Transformers

Nanotron

Nanotron is a distributed training framework with tensor, parallel, and data parallelism (3D parallelism). It is designed for large-scale training workloads across hundreds of GPUs.

Convert any Transformers model to an optimized Nanotron transformer model implementation for pretraining with the convert_hf_to_nanotron.py script.

torchrun --nproc_per_node=1 examples/llama/convert_hf_to_nanotron.py \
    --checkpoint_path=meta-llama/Llama-2-7b-hf \
    --save_path=./llama-7b-nanotron

Transformers integration

Load a supported Transformers model, like Llama, with the from_pretrained() function. This reads the config.json file from the checkpoint directory and creates a LlamaConfig.
Nanotron maps LlamaConfig to it’s own config format and creates a Nanotron model.
Convert Transformers weights to Nanotron. A weight mapping guides how to map Nanotron parameter names to Transformers parameter names. This includes handling transformations such as fusing the QKV projections and the gate/up projections.

Nanotron also relies on AutoTokenizer for turning text into token ids during preprocessing and generation.

Resources

Nanontron repository
Ultrascale Playbook describes how to efficiently scale training with Nanotron

Update on GitHub