Cloud VM performance benchmarking: a fair test plan for CPU, disk, and network
A cloud VM benchmark is supposed to answer a simple question: “Which option is faster for my workload?” In 2026, that question gets slippery fast.
Instance families now advertise eye watering ceilings like multi-terabit networking for GPU boxes and hundreds of vCPUs for general compute. All this while storage marketing jumps between IOPS, throughput, latency and durability in the same paragraph.
The uncomfortable truth is that many benchmarks are “correct” and still unfair. They compare different regions, different kernels, different storage tiers or they let one platform burst while another is pinned to steady state.
Worse, noisy neighbor effects and platform variance can dwarf the differences you thought you were measuring. One 2025 empirical study comparing public cloud clusters found that some I/O results were extremely unstable.
This includes a reported standard deviation of over 100% for a read case on one platform, while the comparison platform stayed in low single digits.
Why “Fast” Looks Different in 2026?
Before methodology, it helps to calibrate what the platforms can do today. This is because “fair” means testing within real limits, not guessing.
On networking
Modern VM families span from high bandwidth general compute to extreme AI and HPC. Google Cloud documents up to 200 Gbps egress for C3 and C4 class instances with Tier_1 networking and up to 3,600 Gbps egress on some A3 and A4 GPU instances. Amazon Web Services documents specialized instances where aggregate networking can reach multiple terabits per second with multiple network cards and EFA or ENA bandwidth tradeoffs.
On storage
Single VM designs can legitimately push “data center SAN class” numbers, but only if you provision and attach them correctly. Microsoft Azure describes Ultra Disk configurations up to 400,000 IOPS per disk and up to 10,000 MB/s throughput per disk, depending on provisioning. Google’s Hyperdisk Extreme documentation shows instance and disk limits that can reach 500,000 IOPS and 10,000 MiB/s and notes that you may need multiple volumes to actually hit the maximum on some instance sizes.
On consistency
The gap between “average” and “tail” is still the trap. Research on cloud noise shows that even when average performance looks fine, rare events can be spectacular. One study of network noise across multiple providers observed latency spikes that could exceed 100x the minimum, with an extreme outlier in one environment far beyond that.
Another 2025 paper on autotuning in the cloud found that even modest noise can dramatically slow convergence and that many “best” configurations picked during tuning can degrade sharply when deployed.
The takeaway: your benchmark plan must measure both speed and stability and it must do so under comparable provisioning.
Key Rules to Follow that Prevent Accidental Bias
I suggest you use these as non-negotiables. They are short because you should be able to paste them into a benchmark.
- Same workload shape: Same OS image family, kernel major line, compiler versions and benchmark versions.
- Same steady state: Avoid burstable CPU and burstable storage modes unless your real workload uses them and you model credit depletion.
- Same topology: Same region type and placement intent and separate results for same zone, cross zone and cross region networking.
- Same bottleneck: If one VM is limited by disk or network while “testing CPU,” you are really benchmarking the slowest subsystem.
- Enough repetitions: Report median plus tail, not just a single run.
Step 1: Build a minimal, controlled test environment
Create a repeatable harness that can stand up identical test nodes on each cloud:
- Choose one region per provider and stick to it. If you must compare multiple regions, treat region as its own variable and do not mix it into the same chart.
- Use a fresh VM per major test suite (CPU, disk, network) or at least reboot between suites, because caches and background daemons matter.
- Record the full bill of materials like VM size, CPU model as reported by the guest, NIC driver, storage type, disk size, provisioned IOPS and throughput. Figure out whether the instance is “dedicated” or shared.
Also capture “steal time” or scheduler contention indicators if your OS exposes them. When you see weird variance, those metrics help you prove it was not your code.
Step 2: CPU benchmarking that is comparable
A fair CPU plan separates three questions: 1) single core speed, 2) all core throughput and 3) sustained behavior under thermal and power limits.
CPU test set
- Single threaded: One worker, pinned if possible.
- Multi threaded: Workers equal to vCPU count, plus an optional sweep (50%, 75%, 100%) to detect scaling cliffs.
- Sustained: Longer runs (10 to 30 minutes) to catch boost decay and noisy neighbor effects.
If you need a benchmark with industry credibility, use standardized suites where possible and always publish build flags. Cloud vendors often cite SPEC derived results for marketing, but your job is to keep your build consistent and reproducible.
Common CPU fairness pitfalls
- Comparing different turbo behavior without measuring sustained clocks.
- Mixing different kernel versions and mitigations that impact branch heavy workloads.
- Letting background agents and cloud-init tasks run during the timed interval.
Report at least median, p95 and a simple dispersion metric like coefficient of variation over repeated runs.
Step 3: Disk benchmarking that separates IOPS, throughput and latency
Disk is where most “apples to oranges” mistakes happen because cloud storage includes local NVMe, network attached block and managed tiers that behave differently.
Start by writing down the storage class you are testing:
- Local ephemeral NVMe (fast, not durable)
- Network-attached block (durable, performance depends on tier and provisioning)
- Distributed or managed “extreme” tiers (very fast, often require explicit performance settings)
Then structure your fio style test cases around common patterns:
- Random 4 KiB read and write for IOPS and latency
- Sequential 64 KiB or 1 MiB for throughput
- Mixed 70/30 read write for database like profiles
- Queue depth sweeps to see where the device saturates
I highly recommend preconditioning the device as many SSD backed systems perform differently before and after they reach a steady write state. Run a warmup phase and only time the steady portion.
Finally, match provisioning across clouds. When you present results, do not collapse everything into a single “disk score.” Show IOPS, throughput, average latency and p99 latency for each profile.
Step 4: Network benchmarking that reflects real deployments
Network tests should answer two separate questions: bandwidth and tail latency. They often disagree.
Define three paths
- VM to VM in the same zone
- VM to VM across zones in the same region
- VM to VM across regions
Bandwidth tests:
- Use multiple parallel TCP streams because single flow limits are common.
- Run long enough to reach steady throughput, not just a burst.
Latency tests:
- Measure both unloaded latency and latency under load.
- Capture p50, p95, p99 and max. A single rare spike can matter more than a small average change.
Also be honest about what you are testing. Some providers document very high in VPC networking while limiting egress to destinations outside a VPC, depending on configuration.
On the AWS side, instance networking is shaped by instance size and allowances and published bandwidth is not always achievable in every situation.
If your workload is HPC or distributed training, document whether you used placement features and RDMA capable adapters. Those options can change your result more than the VM family choice.
Step 5: Run schedule, statistics and reporting
A “fair” benchmark is as much about experimental design as it is about tools.
- Run at least 10 iterations per test case, spread across multiple hours, ideally multiple days.
- Randomize the order of test cases to avoid time of day bias.
- Report median and p95 and include variability. The 2025 MDPI cloud variability study mentioned earlier shows why stability can matter more than mean throughput for I/O heavy systems.
- Keep raw logs. When someone challenges the result, you want to show evidence, not reassurance.
Fair Benchmarks Buy You Trust, Not Just Numbers
Cloud performance has never been simple and in 2026 it is even less so. The ceilings are huge, like multi hundred gig networking and hundreds of thousands of IOPS. But the spread between the best case and worst case can still surprise you.
In my opinion, a fair test plan is the only antidote. Control the environment, test CPU, disk and network in ways that match actual workload patterns and publish variability alongside speed.
Do that and your benchmark stops being a screenshot for a slide deck. It becomes a decision tool your team can rely on, even when the next instance generation arrives and everyone wants to rerun the race.
