How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism

Community Article Published February 12, 2026

If you want to use multiple GPUs with Hugging Face transformers, you need to understand two different approaches:

  1. device_map → memory-based model sharding (simple, inference only)
  2. Tensor Parallelism (tp_plan) → real multi-GPU computation

Before anything else, you must control which GPUs are visible.

Step 0: Restrict to Specific GPUs

If your machine has many GPUs but you can only use some (for example GPU 3 and 4), restrict visibility with:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

Or from the shell:

export CUDA_VISIBLE_DEVICES="3,4"

Now:

  • Physical GPU 3 → cuda:0
  • Physical GPU 4 → cuda:1

Transformers and Accelerate will only see these two GPUs.

Approach 1: device_map="auto" (Big Model Inference)

Best for:

  • Models too large for one GPU
  • Inference only
  • Simple setup

This does not give true parallel speed-up. It splits model layers across GPUs for memory reasons.

Example

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)

print(model.hf_device_map)

# {
#   'model.embed_tokens': 0,
#   'model.layers.0': 0,
#   'model.layers.1': 0,
#   'model.layers.2': 0,
#   'model.layers.3': 0,
#   'model.layers.4': 0,
#   'model.layers.5': 0,
#   'model.layers.6': 0,
#   'model.layers.7': 0,
#   'model.layers.8': 0,
#   'model.layers.9': 0,
#   'model.layers.10': 0,
#   'model.layers.11': 0,
#   'model.layers.12': 0,
#   'model.layers.13': 0,
#   'model.layers.14': 0,
#   'model.layers.15': 0,
#   'model.layers.16': 1,
#   'model.layers.17': 1,
#   'model.layers.18': 1,
#   'model.layers.19': 1,
#   'model.layers.20': 1,
#   'model.layers.21': 1,
#   'model.layers.22': 1,
#   'model.layers.23': 1,
#   'model.layers.24': 1,
#   'model.layers.25': 'cpu',
#   'model.layers.26': 'cpu',
#   'model.layers.27': 'cpu',
#   'model.layers.28': 'cpu',
#   'model.layers.29': 'cpu',
#   'model.layers.30': 'cpu',
#   'model.layers.31': 'cpu',
#   'model.norm': 'cpu',
#   'model.rotary_emb': 'cpu', 
#   'lm_head': 'cpu'
# }

What happens?

  • Some layers go to cuda:0
  • Some layers go to cuda:1
  • Some stay in the cpu
  • Forward pass runs sequentially across devices

Pros

  • Very easy
  • No distributed setup
  • Great for large inference

Cons

  • No real compute parallelism
  • Not for training

Approach 2 : Tensor Parallelism (tp_plan="auto")

Best for:

  • Real multi-GPU compute
  • Faster inference
  • Large models
  • Distributed setup

This splits tensors (like large matrix multiplications) across GPUs.

⚠️ Requires torchrun.

Run Script

torchrun --nproc_per_node=2 run_model.py

run_model.py

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

from transformers import AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    tp_plan="auto",
)

What happens?

  • Each GPU runs a shard of every large tensor
  • Matrix multiplications are split
  • Computation runs in parallel

Pros

  • Real speed-up
  • Better scaling
  • Efficient compute use

Cons

  • More complex setup
  • Not all models support TP
  • Requires distributed launch

Check model support here: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

Device Map vs Tensor Parallelism

Feature device_map Tensor Parallelism
Setup Simple Distributed required
Speed No real speed gain Yes
Memory Yes Yes
Training No Possible
Best for Large inference Fast multi-GPU inference

When Should You Use Each?

Use device_map="auto" if:

  • Model does not fit on one GPU
  • You want simple inference
  • You don’t need speed scaling

Use tp_plan="auto" if:

  • You want faster inference
  • You can use torchrun
  • Your model supports tensor parallelism

Final Tip

Always set:

os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

before importing torch.

Otherwise, all GPUs may be used.

Edit: Thanks to Subhasmita Swain for catching a typo in the run_model.py which has been fixed now.

Community

Sign up or log in to comment