tritllm-kernel
Multiply-free ternary GEMV CUDA kernel for the codec from "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
The headline number from the paper: 7.8ร speedup over cuBLAS FP16 GEMV on RTX 4090 in the memory-bound regime, projected to full-model token generation throughput from per-layer benchmarks.
These are kernel-only projections, not end-to-end serving throughput. They exclude attention, KV cache, sampling, and tokenizer overhead. See Section 7 of the paper for methodology.
What it is
A standalone CUDA shared library (libtrit_gemv.so / .dll) callable via ctypes from any language, with no PyTorch dependency. The same algorithm is also wrapped via PyTorch's pybind11 in trit_gemv_wrapper.cu for benchmarking.
The core trick: each ternary weight (-1, 0, +1) reduces a multiply-accumulate to a conditional add/subtract/skip. The kernel uses Ada/Hopper/Blackwell dp4a intrinsics on int4-packed weights and pre-interleaved int8 activations to do four ternary-times-int8 dot products per instruction.
Build
cd kernel
./build.sh
The build script targets SM 70/75/80/86/89/90/100/120 in one fat binary so the .so runs on V100, T4, A100, RTX 30/40/50, H100, and B100/B200 without recompilation.
Required: nvcc (CUDA 11.8 or newer) and a C++ compiler.
Performance (Qwen2.5-7B, d=2, 3.47 bpw, 3.3 GB model)
| GPU | L2 cache | Tokens/sec | Speedup vs FP16 cuBLAS | Effective BW |
|---|---|---|---|---|
| RTX 4090 | 72 MB | 588 | 7.8ร | 1940 GB/s |
| RTX 3090 | 6 MB | 192 | 3.4ร | 633 GB/s |
| RTX 4080 Laptop | 64 MB | 133 | 5.8ร | 439 GB/s |
| A100 80GB | 40 MB | 201 | 4.2ร | 663 GB/s |
These are per-layer GEMV benchmarks projected to full-model token-generation throughput. The L2-cache size correlates strongly with speedup because each d=2 layer fits in L2 on the RTX 4090, giving an effective bandwidth roughly 2ร HBM bandwidth.
See kernel/bench_*.py for the benchmark drivers.
Launch contract
The kernels in trit_gemv.cu and trit_gemv_standalone.cu assume:
| Constraint | Why | What happens if violated |
|---|---|---|
blockDim.x == 32 (one warp per block) |
Kernels use __shfl_down_sync(0xFFFFFFFF, ...) and lane-0 reduction |
OOB index reads + race on y[row] |
in_features % 64 == 0 |
Group size is fixed at 64 weights | Trailing partial group is silently dropped โ incorrect output for that row |
| Weight, scale, and activation buffers are device-resident and properly aligned | Kernel uses __ldg for cached loads |
UB / device fault |
If your model has in_features not divisible by 64, pad the weight matrix to the next multiple of 64 with zero rows before quantizing.
API surface
C ABI in trit_gemv_standalone.cu:
// Best-tested d=2 path (champion for 4090)
void trit_gemv_d2_fast(
const int32_t* pt, // [rows * num_groups * 8] int4-packed weights
const float* ws, // [rows * num_groups] scales
const int32_t* xt_e, // [num_groups * 8] even nibble activations
const int32_t* xt_o, // [num_groups * 8] odd nibble activations
const float* xs, // [num_groups] activation scales
float* y, // [rows] output
int cols, int rows, int num_groups,
int use_l2_persist // 0 = off, 1 = enable L2 persistence
);
// Native-trit packed d=3 (no int4 intermediate)
void trit_gemv_d3_native(
const int32_t* pt, // [rows * num_groups * 13] trit-packed
const float* sc,
const float* x,
float* y,
int cols, int rows, int depth
);
// L2 cache size query (for deciding whether to enable persist)
int get_l2_cache_bytes();
void get_gpu_name(char* buf, int buflen);
void cuda_sync();
Error reporting
All extern "C" entry points return void, so per-call status is delivered through a separate channel:
int trit_gemv_get_last_error();
Returns 0 on success. Negative values are host-side argument-validation failures (TRIT_ERR_NULL_PTR, TRIT_ERR_BAD_DIM, TRIT_ERR_BAD_GROUP, TRIT_ERR_BAD_BUFFER). Positive values are cudaError_t codes captured from the most recent kernel launch.
The host-side validator in each entry point checks pointer non-null, positive dimensions, cols % 64 == 0, and cols / 64 == num_groups. If validation fails, no kernel is launched, the error is recorded, and the call returns silently.
Known limitations
The educational kernels in trit_gemv.cu use a lane-0 scale-and-add reduction that idles 31 lanes per group. This is a deliberate readability tradeoff โ the headline 7.8ร number is from the deferred-reduction k_d3_hardened kernel in trit_gemv_standalone.cu. See KNOWN_ISSUES.md for details and a planned API-cleanup item.
Citation
@article{stentzel2026ternaryptq,
title = {Balanced Ternary Post-Training Quantization for Large Language Models},
author = {Stentzel, Eric},
year = 2026,
note = {Entrit Systems}
}
License
Apache-2.0.