Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / accelerate /pr_4021 /en /usage_guides /compilation.md

HuggingFaceDocBuilder

21 days ago

preview code

download

raw

4.63 kB

	# Compilation

	## Overview

	Pytorch 2.0 introduced `torch.compile`, a powerful feature that makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels. Key features of `torch.compile` include:

	- Performance Improvement: Significantly speeds up model execution by optimizing the computation graph.
	- Ease of Use: Requires minimal code changes to implement, making it highly accessible.
	- Compatibility: Works seamlessly with existing PyTorch code and models.

	When used with Accelerate, `torch.compile` integrates smoothly into distributed training workflows, allowing you to benefit from both distributed execution and compilation optimizations simultaneously.

	The first execution of compiled code typically takes longer as it includes the compilation time, but subsequent runs are significantly faster. For optimal performance in different scenarios, `torch.compile` offers various modes like `"default"`, `"reduce-overhead"` (which uses CUDA graphs to further reduce overhead), and `"max-autotune"` (which performs extensive autotuning to find the best kernels for your model).

	## Using `torch.compile` with Accelerate

	Accelerate provides `TorchDynamoPlugin` for easy and seemless integration of `torch.compile` into your training scripts.

	```python
	from accelerate import Accelerator
	from accelerate.utils import TorchDynamoPlugin

	# Configure the compilation backend
	dynamo_plugin = TorchDynamoPlugin(
	backend="inductor", # Options: "inductor", "aot_eager", "aot_nvfuser", etc.
	mode="default", # Options: "default", "reduce-overhead", "max-autotune"
	fullgraph=True,
	dynamic=False
	)

	# Initialize accelerator with the plugin
	accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
	# This will apply torch.compile to your model
	model = accelerator.prepare(model)
	```

	It is compatible with all other features and plugins of Accelerate, including mixed precision, distributed training (DDP, FSDP, Deepspeed), etc.

	## Regional Compilation

	Instead of trying to compile the whole model, which usually has a big problem space for optimization. Regional compilation targets repeated blocks of the same class and compiles them sequentially to hit the compiler's cache. For example, in `GPT2LMHeadModel`, the repeated block/class is `GPT2Block`, and can be accessed as `model.transformer.h[0]`. The rest of the model (e.g model.lm_head) is compiled separately.

	This allows us to speed up the compilation overhead / cold start of models like LLMs and Transformers in general.
	See for more details.

	### How to Use Regional Compilation

	It can be enabled by setting `use_regional_compilation=True` in the `TorchDynamoPlugin` configuration:

	```python
	# Configure the compilation backend
	dynamo_plugin = TorchDynamoPlugin(
	use_regional_compilation=True,
	... # other parameters
	)
	# Initialize accelerator with the plugin
	accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
	# This will apply compile_regions to your model
	model = accelerator.prepare(model)
	```

	You could also use the `accelerate.utils.compile_regions` utility directly the same way you would use `torch.compile`.

	### Benefits of Regional Compilation

	We have conducted extensive benchmarks comparing full compilation and regional compilation using the `torch.compile` feature in PyTorch. The full results are available in the [accelerate repository](https://github.com/huggingface/accelerate/tree/main/benchmarks/torch.compile/regional_compilation). The key findings from our benchmarks are:

	1. Comparable Performance: Regional compilation delivers performance speedups similar to full compilation, especially for larger models.
	2. Faster Compilation: Regional compilation significantly reduces the time taken to compile models, making it a more efficient choice for deployment.
	3. Batch Size Impact: The performance difference between compilation strategies diminishes with larger batch sizes, indicating that the overhead of compilation is less impactful in those scenarios.
	4. Model Size Consideration: The benefits of regional compilation are more pronounced in larger models, where the compilation time savings can be substantial.
	5. Practical Application: For real-world applications, regional compilation is a practical choice for optimizing training cold start times, especially when working with large models.

	## Conclusion

	Both full and regional compilation can significantly speed up your models. Regional compilation offers a practical balance between compilation time and runtime performance, especially for training large models with substantial batch sizes.

Xet Storage Details

Size:: 4.63 kB
Xet hash:: 96f30fee21ef20de9686f25d18b82f90e77df8176916472a07ee6f730ca34f69

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.