We Pitted the Cheapest TPU Against an NVIDIA L4. Here's What 6 Experiments Revealed.
We didn't want opinions. We wanted numbers. So we ran 6 controlled experiments on real GCP hardware at real prices, sweeping batch size, sequence length, model depth, precision, control flow, and framework choice. The results were not what we expected.
Everything in this post is fully reproducible. Every experiment is a Jupyter notebook with saved results, analysis code, and pre-generated charts. Clone the repo, pick a session, and run it yourself — no GCP hardware needed to explore the data, or bring your own instances to re-run the benchmarks from scratch:
github.com/tails-mpt/gpar-workshop — 6 sessions, 3 notebooks each, all results included.
The Setup
We benchmarked a single Transformer encoder block (8-head attention, d_model=512, FFN dim 2048) — small enough to isolate hardware behavior, large enough to be representative. The two contenders:
| NVIDIA L4 | TPU v5e | |
|---|---|---|
| Memory | 23.7 GB GDDR6 | 16 GB HBM2 |
| Bandwidth | ~300 GB/s | ~819 GB/s |
| BF16 Compute | ~121 TFLOPS | ~393 TFLOPS |
| GCP Price | $0.70/hr | $1.20/hr |
The L4 is the GPU a small-to-mid team would most likely reach for on GCP — fine-tuning a 7B model, training a classifier, distilling a student model. It's not an H100, and this benchmark isn't aimed at labs with standing clusters. It's for teams making practical cost-performance decisions on cloud spot or on-demand instances. The TPU v5e is the cheapest TPU available. Not best-vs-best — realistic-vs-realistic.
Experiment 1: Batch Size Scaling
The question: At what batch size does the TPU overtake the GPU?
The answer: Batch 32.
| Batch | GPU (samples/s) | TPU (samples/s) | TPU / GPU |
|---|---|---|---|
| 4 | 1,388 | 370 | 0.27x |
| 16 | 2,866 | 1,497 | 0.52x |
| 32 | 2,895 | 2,984 | 1.03x |
| 64 | 2,760 | 5,968 | 2.16x |
| 256 | 2,653 | 23,862 | 9.0x |
| 1,024 | 2,575 | 94,795 | 36.8x |
The GPU peaks at batch=16 and flatlines — its 300 GB/s memory bus is saturated. The TPU scales almost perfectly linearly: 256x throughput gain for 256x batch increase. At batch=1024, the TPU is processing 37 samples for every 1 on the GPU.
GPU throughput saturates at batch=16. TPU scales linearly. The crossover happens at batch=32.
Experiment 2: Sequence Length
The question: What happens as sequences get longer?
The GPU collapses. The TPU doesn't notice.
| seq_len | GPU (samples/s) | TPU (samples/s) | TPU / GPU |
|---|---|---|---|
| 64 | 5,895 | 3,146 | 0.53x |
| 128 | 2,906 | 3,166 | 1.09x |
| 512 | 516 | 3,129 | 6.06x |
| 2,048 | 66 | 3,128 | 47.2x |
An 89x throughput drop on the GPU for a 32x sequence increase. Attention's O(seq^2) memory footprint saturates the L4's memory bus. The TPU? Its throughput varies by 1.2% across the entire sweep. At seq_len=2048, the TPU is 47x faster.
GPU throughput degrades quadratically with sequence length. TPU throughput is essentially constant.
Experiment 3-4: Depth and Control Flow
Depth (1 to 24 layers): The TPU maintains a stable ~2.5x throughput advantage regardless of model depth. Neither device hits OOM at 24 layers — BERT-base (12 layers) uses only 19% of the L4's VRAM.
Control flow: Here's where it gets interesting. A Python if tensor > 0 branch costs the GPU 41% throughput — but it costs the TPU 61%. XLA has to sync mid-step and recompile. The fix? Replace if with torch.where. One line of code recovers 98.5% of the TPU's lost performance. Padding masks — the most common dynamic pattern in NLP — are essentially free on TPU.
Experiment 5: Precision
This one surprised us.
On the GPU, BF16 delivers a consistent 2x throughput boost and 21-29% VRAM savings. Exactly what you'd expect from Tensor Cores.
On the TPU, BF16 is 1.4-3.3% slower than FP32. Not faster. Slower. XLA's FP32 kernel fusion is more aggressive than its mixed-precision path for this model size. The autocast overhead eats whatever the MXU gains.
GPU gets a clean 2x from BF16. TPU gets a slight regression. The "obvious" optimization isn't obvious at all.
Experiment 6: JAX vs PyTorch on the Same Hardware
Both JAX and PyTorch/XLA compile down to XLA. Same backend. Same hardware. Surely similar performance?
Not even close.
On the GPU, JAX is 1.7-2.3x faster than PyTorch — it fuses the full train step into one XLA program, eliminating the intermediate memory writes that PyTorch's eager dispatch incurs.
On the TPU, it depends on batch size. JAX dominates at small batches (22.8x faster at batch=4) but PyTorch/XLA wins at large batches (3.3x faster at batch=1024). The crossover is around batch 300.
Same XLA backend, dramatically different performance. Framework choice matters more than most practitioners assume.
The Bottom Line: Cost Per Sample
The TPU costs 1.71x more per hour. But cost-per-sample tells a different story.
| Batch | GPU (samples/$) | TPU (samples/$) | Winner |
|---|---|---|---|
| 4 | 7.1M | 1.1M | GPU — 6.5x cheaper |
| 32 | 14.9M | 9.0M | GPU — 1.7x cheaper |
| 64 | 14.2M | 17.9M | TPU — 1.26x cheaper |
| 256 | 13.6M | 71.6M | TPU — 5.3x cheaper |
| 1,024 | 13.2M | 284.4M | TPU — 21.5x cheaper |
The cost crossover is at batch=64. Below that, the GPU's lower hourly rate wins. Above it, the TPU's throughput advantage overwhelms its price premium — and the gap doubles with every batch doubling.
At batch=1024, every dollar spent on the TPU buys 21.5x more training samples than the same dollar on a GPU.
What This Benchmark Does NOT Capture
These experiments isolate raw hardware behavior using a single Transformer encoder block, vanilla PyTorch eager on GPU, and PyTorch/XLA on TPU. No custom kernels, no inference frameworks, no multi-device setups. This is deliberate — we wanted to measure the silicon, not the software stack.
But in production, NVIDIA's software ecosystem closes or reverses many of these gaps:
- FlashAttention — GPU-native fused attention kernels reduce memory I/O by 5-10x. Our benchmark uses standard
nn.MultiheadAttention, which doesn't exploit this. Real attention-heavy workloads on GPU are significantly faster than what we measured. - torch.compile + Triton — Fusing operations into single CUDA kernels via
torch.compilecan cut GPU framework overhead by 30-50%. Our benchmark runs pure eager mode. - CUDA Graphs — Record-and-replay eliminates CPU-GPU sync jitter, achieving consistent sub-5ms inference latency. TPU has no equivalent for latency-critical serving.
- Continuous batching & PagedAttention — GPU inference servers (vLLM, SGLang) dynamically pack requests and page KV cache memory. TPU serving is more static and less memory-efficient for variable-length workloads.
- Quantization — GPTQ, AWQ, and bitsandbytes provide GPU-native INT4/INT8 inference kernels. A 70B model quantized to INT4 fits on a single H100 (80 GB) but not a TPU v5e (16 GB). Our INT8 results (Session 5) show TPU quantization is still immature.
- NVLink & multi-GPU scaling — Our benchmark is single-device. H100s connected via NVLink (900 GB/s) enable efficient tensor parallelism and FSDP. TPU v5e chips communicate over slower fabric.
- Structured sparsity — A100/H100 Tensor Cores run 2:4 sparse matrices at full speed. TPU has no hardware sparsity support.
In short: if you're running a production inference service with FlashAttention, continuous batching, quantization, and CUDA Graphs, the GPU's position is much stronger than our raw throughput numbers suggest. The TPU advantage we measured is real, but it's the floor, not the ceiling, of what GPUs can do with the right software.
When to Use What
| Workload | Use | Why |
|---|---|---|
| Small-batch inference (batch <= 16) | GPU | Faster, simpler, cheaper |
| Long sequences (seq >= 256, batch >= 32) | TPU | GPU degrades quadratically; TPU is immune |
| High-throughput training (batch >= 64) | TPU | 21.5x more samples per dollar at scale |
| Mixed precision | GPU: BF16, TPU: FP32 | BF16 helps GPU 2x; hurts TPU slightly |
| Framework choice on TPU | JAX (small batch) / PyTorch/XLA (large batch) | Crossover at batch ~300 |
Reproduce It
The full workshop is open: github.com/tails-mpt/gpar-workshop
Every chart in this post was generated from a notebook you can run. If you disagree with our numbers, run the experiments on your own hardware and let us know.
This is part of the accelerator research we're doing at Thoughtworks. We also train and release EAGLE3 draft models for speculative decoding on both GPU and TPU — if faster LLM inference is your thing, check those out too.






