File size: 1,858 Bytes

# Known limitations — tritllm-codec

Items previously raised in code review have been addressed in the current
release. This document only lists deliberate design tradeoffs that the codec
review surfaced, not bugs.

## Design tradeoffs

### Scale codebook upper bound = `max(group_abs_maxes)`
**Where:** [`quantize_model_v2.py`, `trit_quantize_scales`, `log_max = np.max(...)`](quantize_model_v2.py#L107)

The 27-entry log-spaced scale codebook spans `[log_min, log_max]` where
`log_max` is taken to be the maximum group magnitude in the matrix. This is
intentional — an earlier 99.9th-percentile bound (commit prior to `0c16d24`)
clipped large-scale outlier groups and lost their resolution.

The downside: a single extreme-scale outlier group can stretch the log-spaced
range and reduce scale resolution for the bulk of normal-magnitude groups in
the same matrix.

We do not see this cause measurable quality regressions on Qwen2.5, Llama-3.1,
or Mistral-7B. If you observe unexpectedly high PPL on a new model family with
heavy-tailed scale distributions, this is the first place to look.

We did not change this in the current release because changing it would alter
the bit-exact output of the codec and invalidate published paper numbers; a
future v3 may replace `np.max` with a soft-cap (e.g. `min(max, 4 * p99)`) that
is robust to single extreme outliers without giving up large-scale fidelity.

### Scale candidate set is fixed at 4 percentiles
**Where:** [`quantize_model_v2.py`, `compute_best_scale_4cand`](quantize_model_v2.py#L75)

The MSE-best scale is selected from four fixed order statistics — indices
`[gs-6, gs-4, gs-2, gs-1]` of sorted `|w|`. This is a deliberate compute /
quality tradeoff (≈50× speedup over an exhaustive sweep, <1% PPL gap measured
on Qwen2.5-7B), not a bug. The function name and docstring now reflect this.