YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
rvLLM H100 artifacts (2026-04-04)
This bundle contains the final H100 artifacts pulled from the Vast instance after the April 4, 2026 tuning pass.
Environment
- GPU: NVIDIA H100 80GB HBM3
- Model: Qwen/Qwen2.5-7B
- Output length: 128 tokens unless noted
- Repo head pushed to GitHub:
acbb62578 - Public source repo: https://github.com/m0at/rvllm
Final shipped settings
- Batch-1 default decode now uses the reusable batched scratch path.
- Default batched GEMM policy is hybrid:
- QKV / O-proj / down-proj: cuBLAS or cublasLt
- GateUp + SiLU: CUTLASS
- Persistent v3 remains an experimental path and is included here only as a correctness/perf artifact.
Canonical direct-engine results from this pass
Current rvLLM on H100:
- N=1: 133.1 tok/s
- N=64: 8038.0 tok/s
- N=128: 13110.1 tok/s
Published comparison point for vLLM 0.19.0 from the same April 4 campaign:
- N=1: 165.5 tok/s
- N=64: 7972.1 tok/s
- N=128: 13903.5 tok/s
Earlier clean batched logs preserved in this bundle:
- N=32: 4407.5 tok/s (
bench_32.txt) - N=64: 7964.0 tok/s (
bench_64.txt) - N=128: 13148.3 tok/s (
bench_128.txt)
Included files
bench_32.txt,bench_64.txt,bench_128.txt: clean batched direct-engine benchmark outputsbaseline_final_check.txt: short baseline batch-1 check before persistent-v3 correctness fix workv3_final_check.txt: log showing real persistent-v3 launches on decodev3_harness_after_fix.txt: standalone persistent-v3 harness success logv3_memcheck_after_fix.txt: compute-sanitizer log withERROR SUMMARY: 0 errorsflash_attention_3_v3.ptx: SM90 FA3 v3 PTX artifactpersistent_layer_v3.ptx: SM90 persistent-v3 PTX artifactpersistent_layer_v3.cubin: compiled persistent-v3 cubin from the H100 instance
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support