+

Rotary Position Embeddings Benchmarks - Aggregated Results

+

This document combines benchmark results from multiple Rotary Position Embeddings implementations.

+

Combined Summary and Visualization

+
+ + + + + + + 2025-10-28T14:09:08.848427 + image/svg+xml + + + Matplotlib v3.10.7, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + cuda_B1_S128_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S128_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S128_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S128_H32_D128_R64 + + + + + + + + + + + + + cuda_B1_S512_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S512_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S512_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S512_H32_D128_R64 + + + + + + + + + + + + + cuda_B1_S2048_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S2048_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S2048_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S2048_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S128_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S128_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S128_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S128_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S512_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S512_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S512_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S512_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S2048_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S2048_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S2048_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S2048_H32_D128_R64 + + + + Workload + + + + + + + + + + + + + + + + + 0.2 + + + + + + + + + + + + + 0.3 + + + + + + + + + + + + + 0.4 + + + + + + + + + + + + + 0.5 + + + + + + + + + + + + + 0.6 + + + + Latency P50 (ms) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Attention Implementation Latency + + + + + + + + + + + + + torch_eager + + + + + + + + + + +
+ +
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: combine | 4.36s + | + +Raw +
+ +
+
======================================================================
+LOADING BENCHMARK DATA
+======================================================================
+✓ HF Kernels Rotary             : /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/49ec9501b131c967277abe3cccb638422565260339bb30f5ea386b0076f2183e
+✓ PyTorch Rotary                : /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/abf801d6445dfa81a8dd7b2e6257930c39c18160a9b97a739858c3b244e16cc5
+
+  ✓ Found HF Kernels Rotary
+     Path: /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/49ec9501b131c967277abe3cccb638422565260339bb30f5ea386b0076f2183e/rotary.jsonl
+  ✓ Found PyTorch Rotary
+     Path: /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/abf801d6445dfa81a8dd7b2e6257930c39c18160a9b97a739858c3b244e16cc5/rotary.jsonl
+
+======================================================================
+Summary: 2 found, 0 skipped, 0 missing
+======================================================================
+
+COMBINED BENCHMARK SUMMARY
+
+impl                     wl                  p50(ms)  ok
+hf_kernels_rotary        cuda_B1_S128_H32_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S128_H32_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B1_S128_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S128_H8_D64_R32     0.08  False
+hf_kernels_rotary        cuda_B1_S2048_H32_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S2048_H32_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B1_S2048_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S2048_H8_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B1_S512_H32_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S512_H32_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B1_S512_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B1_S512_H8_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B2_S128_H32_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B2_S128_H32_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B2_S128_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B2_S128_H8_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B2_S2048_H32_D128_R64     0.28  False
+hf_kernels_rotary        cuda_B2_S2048_H32_D64_R32     0.10  False
+hf_kernels_rotary        cuda_B2_S2048_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B2_S2048_H8_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B2_S512_H32_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B2_S512_H32_D64_R32     0.09  False
+hf_kernels_rotary        cuda_B2_S512_H8_D128_R64     0.09  False
+hf_kernels_rotary        cuda_B2_S512_H8_D64_R32     0.09  False
+torch_eager              cuda_B1_S128_H32_D128_R64     0.22  True
+torch_eager              cuda_B1_S128_H32_D64_R32     0.22  True
+torch_eager              cuda_B1_S128_H8_D128_R64     0.23  True
+torch_eager              cuda_B1_S128_H8_D64_R32     0.17  True
+torch_eager              cuda_B1_S2048_H32_D128_R64     0.22  True
+torch_eager              cuda_B1_S2048_H32_D64_R32     0.22  True
+torch_eager              cuda_B1_S2048_H8_D128_R64     0.22  True
+torch_eager              cuda_B1_S2048_H8_D64_R32     0.22  True
+torch_eager              cuda_B1_S512_H32_D128_R64     0.22  True
+torch_eager              cuda_B1_S512_H32_D64_R32     0.22  True
+torch_eager              cuda_B1_S512_H8_D128_R64     0.22  True
+torch_eager              cuda_B1_S512_H8_D64_R32     0.22  True
+torch_eager              cuda_B2_S128_H32_D128_R64     0.22  True
+torch_eager              cuda_B2_S128_H32_D64_R32     0.22  True
+torch_eager              cuda_B2_S128_H8_D128_R64     0.22  True
+torch_eager              cuda_B2_S128_H8_D64_R32     0.22  True
+torch_eager              cuda_B2_S2048_H32_D128_R64     0.64  True
+torch_eager              cuda_B2_S2048_H32_D64_R32     0.23  True
+torch_eager              cuda_B2_S2048_H8_D128_R64     0.22  True
+torch_eager              cuda_B2_S2048_H8_D64_R32     0.22  True
+torch_eager              cuda_B2_S512_H32_D128_R64     0.22  True
+torch_eager              cuda_B2_S512_H32_D64_R32     0.22  True
+torch_eager              cuda_B2_S512_H8_D128_R64     0.22  True
+torch_eager              cuda_B2_S512_H8_D64_R32     0.22  True
+
+GENERATING COMBINED VISUALIZATION
+
+Loaded 48 records
+✓ Visualization saved as latency.svg
+Saved latency.png
+✓ Visualization saved as latency.svg
+✓ SVG visualization ready!
+
+ANALYSIS COMPLETE
+Total implementations analyzed: 2
+
+Implementations included:
+  ✓ HF Kernels Rotary
+  ✓ PyTorch Rotary
+
+
+
▶ UV Install Logs
+ +
+
+

Artifacts:

+latency.svg +
+ + + + + + + 2025-10-28T14:09:08.848427 + image/svg+xml + + + Matplotlib v3.10.7, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + cuda_B1_S128_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S128_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S128_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S128_H32_D128_R64 + + + + + + + + + + + + + cuda_B1_S512_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S512_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S512_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S512_H32_D128_R64 + + + + + + + + + + + + + cuda_B1_S2048_H8_D64_R32 + + + + + + + + + + + + + cuda_B1_S2048_H8_D128_R64 + + + + + + + + + + + + + cuda_B1_S2048_H32_D64_R32 + + + + + + + + + + + + + cuda_B1_S2048_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S128_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S128_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S128_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S128_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S512_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S512_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S512_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S512_H32_D128_R64 + + + + + + + + + + + + + cuda_B2_S2048_H8_D64_R32 + + + + + + + + + + + + + cuda_B2_S2048_H8_D128_R64 + + + + + + + + + + + + + cuda_B2_S2048_H32_D64_R32 + + + + + + + + + + + + + cuda_B2_S2048_H32_D128_R64 + + + + Workload + + + + + + + + + + + + + + + + + 0.2 + + + + + + + + + + + + + 0.3 + + + + + + + + + + + + + 0.4 + + + + + + + + + + + + + 0.5 + + + + + + + + + + + + + 0.6 + + + + Latency P50 (ms) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Attention Implementation Latency + + + + + + + + + + + + + torch_eager + + + + + + + + + + +
+
+
+
+