Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / kernels /pr_463 /en /cli-benchmark.md

HuggingFaceDocBuilder

25 days ago

preview code

download

raw

5.51 kB

kernels benchmark

Use kernels benchmark to run benchmark scripts shipped with a kernel repository.

The command:

Downloads the kernel repo at a specific branch or version
Runs all benchmarks/benchmark*.py scripts
Times each benchmark_* workload and prints a results table
Optionally saves results as JSON

Installation

kernels benchmark requires extra dependencies:

uv pip install 'kernels[benchmark]' # or pip install 'kernels[benchmark]'

Example

kernels benchmark kernels-community/activation --version 1

Example output:

Downloading kernels-community/activation@v1...
Running benchmark.py...

  GPU      Apple M3 Max (30 cores)
  CPU      Apple M3 Max
  OS       Darwin 25.2.0
  PyTorch  2.10.0

  Running SiluWorkloads on mps

┌───────────────┬────────────┬─────┬───────────┬────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬───────────┬─────────┐
│ Benchmark     │ Workload   │   N │ Speedup   │   Mean(ms) │   Std(ms) │   Min(ms) │   Max(ms) │   IQR(ms) │   Outliers │   Ref(ms) │ Match   │
├───────────────┼────────────┼─────┼───────────┼────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼───────────┼─────────┤
│ SiluWorkloads │ large      │ 100 │ 1.72x     │     6.5153 │    0.4343 │    6.2883 │    8.4699 │    0.1701 │          8 │   11.2048 │ ✓       │
│ SiluWorkloads │ medium     │ 100 │ 2.48x     │     1.1813 │    0.3976 │    1.04   │    4.2146 │    0.0698 │          5 │    2.9332 │ ✓       │
│ SiluWorkloads │ small      │ 100 │ 1.96x     │     0.4909 │    0.2175 │    0.4407 │    2.6438 │    0.0085 │         16 │    0.9622 │ ✓       │
└───────────────┴────────────┴─────┴───────────┴────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴───────────┴─────────┘

  large: 1.72x faster (95% CI: 6.4302-6.6004ms vs ref 11.2048ms) ✓ significant
  medium: 2.48x faster (95% CI: 1.1034-1.2592ms vs ref 2.9332ms) ✓ significant
  small: 1.96x faster (95% CI: 0.4483-0.5335ms vs ref 0.9622ms) ✓ significant

Kernel: 2385e44  Benchmark: 5b53516

Usage

You must specify which revision to benchmark, either via flags or with @... in the repo id:

kernels benchmark  --version 
kernels benchmark  --branch 
kernels benchmark @v
kernels benchmark @

Examples

Benchmark a tagged kernel version:

kernels benchmark kernels-community/activation --version 1

Equivalent shorthand:

kernels benchmark kernels-community/activation@v1

Benchmark a branch:

kernels benchmark kernels-community/activation --branch main

Tune warmup and iteration count:

kernels benchmark kernels-community/activation@v1 --warmup 20 --iterations 200

Save results to a file (JSON):

kernels benchmark kernels-community/activation@v1 --output results.json

Benchmark a local kernel checkout (must contain benchmarks/):

kernels benchmark ./my_kernel

Output

By default, a table is printed (timings in ms).
--output .json writes a JSON payload to disk.

Writing Benchmark Scripts

Benchmark scripts must live under benchmarks/ in the kernel repository and match benchmark*.py. Each script should define one or more subclasses of kernels.benchmark.Benchmark.

Minimal example (benchmarks/benchmark_activation.py):

import torch

from kernels.benchmark import Benchmark

class ActivationBenchmark(Benchmark):
    seed = 0

    def setup(self):
        self.x = torch.randn(128, 1024, device=self.device, dtype=torch.float16)
        self.out = torch.empty(128, 512, device=self.device, dtype=torch.float16)

    def benchmark_silu_and_mul(self):
        self.kernel.silu_and_mul(self.out, self.x)

    def verify_silu_and_mul(self):
        # Return reference tensor; runner compares with self.out
        return torch.nn.functional.silu(self.x[..., :512]) * self.x[..., 512:]

The runner will:

Call setup() once per workload (or setup_() if present)
Warm up (--warmup)
Time benchmark_() for --iterations
If verify_() exists, check that outputs match (torch.allclose(..., atol=1e-2)) and show a speedup vs the reference computation

Troubleshooting

If the repo does not contain a benchmarks/ directory (or no benchmark*.py files), the command exits with an error.
If a benchmark script defines no Benchmark subclasses, the command exits with an error.
If verify_() exists and the outputs do not match, the command exits with an error.

Xet Storage Details

Size:: 5.51 kB
Xet hash:: 6eaf7bb2b6d40fe27a909448973991db1a9fdf34567fd2afe1a54daa221474a6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.