We Pitted the Cheapest TPU Against an NVIDIA L4. Here's What 6 Experiments Revealed.

Community Article

Published April 17, 2026

Upvote

At Thoughtworks, we train and deploy LLMs on both GPUs and TPUs. Everyone defaults to GPUs — they're familiar, well-documented, and just work. But we kept wondering: for a Transformer training workload on GCP, is an NVIDIA L4 actually the right call? Or is the cheapest available TPU already better?

We didn't want opinions. We wanted numbers. So we ran 6 controlled experiments on real GCP hardware at real prices, sweeping batch size, sequence length, model depth, precision, control flow, and framework choice. The results were not what we expected.

Everything in this post is fully reproducible. Every experiment is a Jupyter notebook with saved results, analysis code, and pre-generated charts. Clone the repo, pick a session, and run it yourself — no GCP hardware needed to explore the data, or bring your own instances to re-run the benchmarks from scratch:

github.com/tails-mpt/gpar-workshop — 6 sessions, 3 notebooks each, all results included.

The Setup

We benchmarked a single Transformer encoder block (8-head attention, d_model=512, FFN dim 2048) — small enough to isolate hardware behavior, large enough to be representative. The two contenders:

	NVIDIA L4	TPU v5e
Memory	23.7 GB GDDR6	16 GB HBM2
Bandwidth	~300 GB/s	~819 GB/s
BF16 Compute	~121 TFLOPS	~393 TFLOPS
GCP Price	$0.70/hr	$1.20/hr

The L4 is the GPU a small-to-mid team would most likely reach for on GCP — fine-tuning a 7B model, training a classifier, distilling a student model. It's not an H100, and this benchmark isn't aimed at labs with standing clusters. It's for teams making practical cost-performance decisions on cloud spot or on-demand instances. The TPU v5e is the cheapest TPU available. Not best-vs-best — realistic-vs-realistic.

Experiment 1: Batch Size Scaling

The question: At what batch size does the TPU overtake the GPU?

The answer: Batch 32.

Batch	GPU (samples/s)	TPU (samples/s)	TPU / GPU
4	1,388	370	0.27x
16	2,866	1,497	0.52x
32	2,895	2,984	1.03x
64	2,760	5,968	2.16x
256	2,653	23,862	9.0x
1,024	2,575	94,795	36.8x

The GPU peaks at batch=16 and flatlines — its 300 GB/s memory bus is saturated. The TPU scales almost perfectly linearly: 256x throughput gain for 256x batch increase. At batch=1024, the TPU is processing 37 samples for every 1 on the GPU.

GPU throughput saturates at batch=16. TPU scales linearly. The crossover happens at batch=32.

Experiment 2: Sequence Length

The question: What happens as sequences get longer?

The GPU collapses. The TPU doesn't notice.

seq_len	GPU (samples/s)	TPU (samples/s)	TPU / GPU
64	5,895	3,146	0.53x
128	2,906	3,166	1.09x
512	516	3,129	6.06x
2,048	66	3,128	47.2x

An 89x throughput drop on the GPU for a 32x sequence increase. Attention's O(seq^2) memory footprint saturates the L4's memory bus. The TPU? Its throughput varies by 1.2% across the entire sweep. At seq_len=2048, the TPU is 47x faster.

GPU throughput degrades quadratically with sequence length. TPU throughput is essentially constant.

Experiment 3-4: Depth and Control Flow

Depth (1 to 24 layers): The TPU maintains a stable ~2.5x throughput advantage regardless of model depth. Neither device hits OOM at 24 layers — BERT-base (12 layers) uses only 19% of the L4's VRAM.

Control flow: Here's where it gets interesting. A Python if tensor > 0 branch costs the GPU 41% throughput — but it costs the TPU 61%. XLA has to sync mid-step and recompile. The fix? Replace if with torch.where. One line of code recovers 98.5% of the TPU's lost performance. Padding masks — the most common dynamic pattern in NLP — are essentially free on TPU.

Experiment 5: Precision

This one surprised us.

On the GPU, BF16 delivers a consistent 2x throughput boost and 21-29% VRAM savings. Exactly what you'd expect from Tensor Cores.

On the TPU, BF16 is 1.4-3.3% slower than FP32. Not faster. Slower. XLA's FP32 kernel fusion is more aggressive than its mixed-precision path for this model size. The autocast overhead eats whatever the MXU gains.

GPU gets a clean 2x from BF16. TPU gets a slight regression. The "obvious" optimization isn't obvious at all.

Experiment 6: JAX vs PyTorch on the Same Hardware

Both JAX and PyTorch/XLA compile down to XLA. Same backend. Same hardware. Surely similar performance?

Not even close.

On the GPU, JAX is 1.7-2.3x faster than PyTorch — it fuses the full train step into one XLA program, eliminating the intermediate memory writes that PyTorch's eager dispatch incurs.

On the TPU, it depends on batch size. JAX dominates at small batches (22.8x faster at batch=4) but PyTorch/XLA wins at large batches (3.3x faster at batch=1024). The crossover is around batch 300.

Same XLA backend, dramatically different performance. Framework choice matters more than most practitioners assume.

The Bottom Line: Cost Per Sample

The TPU costs 1.71x more per hour. But cost-per-sample tells a different story.

Batch	GPU (samples/$)	TPU (samples/$)	Winner
4	7.1M	1.1M	GPU — 6.5x cheaper
32	14.9M	9.0M	GPU — 1.7x cheaper
64	14.2M	17.9M	TPU — 1.26x cheaper
256	13.6M	71.6M	TPU — 5.3x cheaper
1,024	13.2M	284.4M	TPU — 21.5x cheaper

The cost crossover is at batch=64. Below that, the GPU's lower hourly rate wins. Above it, the TPU's throughput advantage overwhelms its price premium — and the gap doubles with every batch doubling.

At batch=1024, every dollar spent on the TPU buys 21.5x more training samples than the same dollar on a GPU.

What This Benchmark Does NOT Capture

These experiments isolate raw hardware behavior using a single Transformer encoder block, vanilla PyTorch eager on GPU, and PyTorch/XLA on TPU. No custom kernels, no inference frameworks, no multi-device setups. This is deliberate — we wanted to measure the silicon, not the software stack.

But in production, NVIDIA's software ecosystem closes or reverses many of these gaps:

FlashAttention — GPU-native fused attention kernels reduce memory I/O by 5-10x. Our benchmark uses standard nn.MultiheadAttention, which doesn't exploit this. Real attention-heavy workloads on GPU are significantly faster than what we measured.
torch.compile + Triton — Fusing operations into single CUDA kernels via torch.compile can cut GPU framework overhead by 30-50%. Our benchmark runs pure eager mode.
CUDA Graphs — Record-and-replay eliminates CPU-GPU sync jitter, achieving consistent sub-5ms inference latency. TPU has no equivalent for latency-critical serving.
Continuous batching & PagedAttention — GPU inference servers (vLLM, SGLang) dynamically pack requests and page KV cache memory. TPU serving is more static and less memory-efficient for variable-length workloads.
Quantization — GPTQ, AWQ, and bitsandbytes provide GPU-native INT4/INT8 inference kernels. A 70B model quantized to INT4 fits on a single H100 (80 GB) but not a TPU v5e (16 GB). Our INT8 results (Session 5) show TPU quantization is still immature.
NVLink & multi-GPU scaling — Our benchmark is single-device. H100s connected via NVLink (900 GB/s) enable efficient tensor parallelism and FSDP. TPU v5e chips communicate over slower fabric.
Structured sparsity — A100/H100 Tensor Cores run 2:4 sparse matrices at full speed. TPU has no hardware sparsity support.

In short: if you're running a production inference service with FlashAttention, continuous batching, quantization, and CUDA Graphs, the GPU's position is much stronger than our raw throughput numbers suggest. The TPU advantage we measured is real, but it's the floor, not the ceiling, of what GPUs can do with the right software.

When to Use What

Workload	Use	Why
Small-batch inference (batch <= 16)	GPU	Faster, simpler, cheaper
Long sequences (seq >= 256, batch >= 32)	TPU	GPU degrades quadratically; TPU is immune
High-throughput training (batch >= 64)	TPU	21.5x more samples per dollar at scale
Mixed precision	GPU: BF16, TPU: FP32	BF16 helps GPU 2x; hurts TPU slightly
Framework choice on TPU	JAX (small batch) / PyTorch/XLA (large batch)	Crossover at batch ~300

Reproduce It

The full workshop is open: github.com/tails-mpt/gpar-workshop

Every chart in this post was generated from a notebook you can run. If you disagree with our numbers, run the experiments on your own hardware and let us know.

This is part of the accelerator research we're doing at Thoughtworks. We also train and release EAGLE3 draft models for speculative decoding on both GPU and TPU — if faster LLM inference is your thing, check those out too.

SpecJAX: A Speculative Decoding Library for TPUs

April 20, 2026

1.37x Faster on Alibaba's 80B Code Model: EAGLE3 for Qwen3-Coder-Next

April 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote