How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism
If you want to use multiple GPUs with Hugging Face
transformers, you need to understand two different approaches:
device_map→ memory-based model sharding (simple, inference only)- Tensor Parallelism (
tp_plan) → real multi-GPU computation
Before anything else, you must control which GPUs are visible.
Step 0: Restrict to Specific GPUs
If your machine has many GPUs but you can only use some (for example GPU 3 and 4), restrict visibility with:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
Or from the shell:
export CUDA_VISIBLE_DEVICES="3,4"
Now:
- Physical GPU 3 →
cuda:0 - Physical GPU 4 →
cuda:1
Transformers and Accelerate will only see these two GPUs.
Approach 1: device_map="auto" (Big Model Inference)
Best for:
- Models too large for one GPU
- Inference only
- Simple setup
This does not give true parallel speed-up. It splits model layers across GPUs for memory reasons.
Example
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype="auto",
device_map="auto",
)
print(model.hf_device_map)
# {
# 'model.embed_tokens': 0,
# 'model.layers.0': 0,
# 'model.layers.1': 0,
# 'model.layers.2': 0,
# 'model.layers.3': 0,
# 'model.layers.4': 0,
# 'model.layers.5': 0,
# 'model.layers.6': 0,
# 'model.layers.7': 0,
# 'model.layers.8': 0,
# 'model.layers.9': 0,
# 'model.layers.10': 0,
# 'model.layers.11': 0,
# 'model.layers.12': 0,
# 'model.layers.13': 0,
# 'model.layers.14': 0,
# 'model.layers.15': 0,
# 'model.layers.16': 1,
# 'model.layers.17': 1,
# 'model.layers.18': 1,
# 'model.layers.19': 1,
# 'model.layers.20': 1,
# 'model.layers.21': 1,
# 'model.layers.22': 1,
# 'model.layers.23': 1,
# 'model.layers.24': 1,
# 'model.layers.25': 'cpu',
# 'model.layers.26': 'cpu',
# 'model.layers.27': 'cpu',
# 'model.layers.28': 'cpu',
# 'model.layers.29': 'cpu',
# 'model.layers.30': 'cpu',
# 'model.layers.31': 'cpu',
# 'model.norm': 'cpu',
# 'model.rotary_emb': 'cpu',
# 'lm_head': 'cpu'
# }
What happens?
- Some layers go to
cuda:0 - Some layers go to
cuda:1 - Some stay in the
cpu - Forward pass runs sequentially across devices
Pros
- Very easy
- No distributed setup
- Great for large inference
Cons
- No real compute parallelism
- Not for training
Approach 2 : Tensor Parallelism (tp_plan="auto")
Best for:
- Real multi-GPU compute
- Faster inference
- Large models
- Distributed setup
This splits tensors (like large matrix multiplications) across GPUs.
⚠️ Requires torchrun.
Run Script
torchrun --nproc_per_node=2 run_model.py
run_model.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
from transformers import AutoModelForCausalLM
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype="auto",
tp_plan="auto",
)
What happens?
- Each GPU runs a shard of every large tensor
- Matrix multiplications are split
- Computation runs in parallel
Pros
- Real speed-up
- Better scaling
- Efficient compute use
Cons
- More complex setup
- Not all models support TP
- Requires distributed launch
Check model support here: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi
Device Map vs Tensor Parallelism
| Feature | device_map | Tensor Parallelism |
|---|---|---|
| Setup | Simple | Distributed required |
| Speed | No real speed gain | Yes |
| Memory | Yes | Yes |
| Training | No | Possible |
| Best for | Large inference | Fast multi-GPU inference |
When Should You Use Each?
Use device_map="auto" if:
- Model does not fit on one GPU
- You want simple inference
- You don’t need speed scaling
Use tp_plan="auto" if:
- You want faster inference
- You can use
torchrun - Your model supports tensor parallelism
Final Tip
Always set:
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"
before importing torch.
Otherwise, all GPUs may be used.
Edit: Thanks to Subhasmita Swain for catching a typo in the run_model.py which has been fixed now.