How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism

Community Article

Published February 12, 2026

Upvote

Aritra Roy Gosthipaty

ariG23498

If you want to use multiple GPUs with Hugging Face transformers, you need to understand two different approaches:

device_map → memory-based model sharding (simple, inference only)
Tensor Parallelism (tp_plan) → real multi-GPU computation

Before anything else, you must control which GPUs are visible.

Step 0: Restrict to Specific GPUs

If your machine has many GPUs but you can only use some (for example GPU 3 and 4), restrict visibility with:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

Or from the shell:

export CUDA_VISIBLE_DEVICES="3,4"

Now:

Physical GPU 3 → cuda:0
Physical GPU 4 → cuda:1

Transformers and Accelerate will only see these two GPUs.

Approach 1: `device_map="auto"` (Big Model Inference)

Best for:

Models too large for one GPU
Inference only
Simple setup

This does not give true parallel speed-up. It splits model layers across GPUs for memory reasons.

Example

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)

print(model.hf_device_map)

# {
#   'model.embed_tokens': 0,
#   'model.layers.0': 0,
#   'model.layers.1': 0,
#   'model.layers.2': 0,
#   'model.layers.3': 0,
#   'model.layers.4': 0,
#   'model.layers.5': 0,
#   'model.layers.6': 0,
#   'model.layers.7': 0,
#   'model.layers.8': 0,
#   'model.layers.9': 0,
#   'model.layers.10': 0,
#   'model.layers.11': 0,
#   'model.layers.12': 0,
#   'model.layers.13': 0,
#   'model.layers.14': 0,
#   'model.layers.15': 0,
#   'model.layers.16': 1,
#   'model.layers.17': 1,
#   'model.layers.18': 1,
#   'model.layers.19': 1,
#   'model.layers.20': 1,
#   'model.layers.21': 1,
#   'model.layers.22': 1,
#   'model.layers.23': 1,
#   'model.layers.24': 1,
#   'model.layers.25': 'cpu',
#   'model.layers.26': 'cpu',
#   'model.layers.27': 'cpu',
#   'model.layers.28': 'cpu',
#   'model.layers.29': 'cpu',
#   'model.layers.30': 'cpu',
#   'model.layers.31': 'cpu',
#   'model.norm': 'cpu',
#   'model.rotary_emb': 'cpu', 
#   'lm_head': 'cpu'
# }

What happens?

Some layers go to cuda:0
Some layers go to cuda:1
Some stay in the cpu
Forward pass runs sequentially across devices

Pros

Very easy
No distributed setup
Great for large inference

Cons

No real compute parallelism
Not for training

Approach 2 : Tensor Parallelism (`tp_plan="auto"`)

Best for:

Real multi-GPU compute
Faster inference
Large models
Distributed setup

This splits tensors (like large matrix multiplications) across GPUs.

⚠️ Requires torchrun.

Run Script

torchrun --nproc_per_node=2 run_model.py

`run_model.py`

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

from transformers import AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    tp_plan="auto",
)

What happens?

Each GPU runs a shard of every large tensor
Matrix multiplications are split
Computation runs in parallel

Pros

Real speed-up
Better scaling
Efficient compute use

Cons

More complex setup
Not all models support TP
Requires distributed launch

Check model support here: https://huggingface.co/docs/transformers/en/perf_infer_gpu_multi

Device Map vs Tensor Parallelism

Feature	device_map	Tensor Parallelism
Setup	Simple	Distributed required
Speed	No real speed gain	Yes
Memory	Yes	Yes
Training	No	Possible
Best for	Large inference	Fast multi-GPU inference

When Should You Use Each?

Use device_map="auto" if:

Model does not fit on one GPU
You want simple inference
You don’t need speed scaling

Use tp_plan="auto" if:

You want faster inference
You can use torchrun
Your model supports tensor parallelism

Final Tip

Always set:

os.environ["CUDA_VISIBLE_DEVICES"] = "3,4"

before importing torch.

Otherwise, all GPUs may be used.

Edit: Thanks to Subhasmita Swain for catching a typo in the run_model.py which has been fixed now.

Pallas for people who know JAX but not kernels yet

April 29, 2026

Benchmarking Assisted Generation with Gemma 3 and Qwen 2.5: A Code-First Guide

March 12, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism

Step 0: Restrict to Specific GPUs

Approach 1: device_map="auto" (Big Model Inference)

Example

What happens?

Pros

Cons

Approach 2 : Tensor Parallelism (tp_plan="auto")

Run Script

run_model.py

What happens?

Pros

Cons

Device Map vs Tensor Parallelism

When Should You Use Each?

Final Tip

Pallas for people who know JAX but not kernels yet

Benchmarking Assisted Generation with Gemma 3 and Qwen 2.5: A Code-First Guide

Community

Approach 1: `device_map="auto"` (Big Model Inference)

Approach 2 : Tensor Parallelism (`tp_plan="auto"`)

`run_model.py`