Instructions to use Entrit/tritllm-codec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Entrit/tritllm-codec with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Entrit/tritllm-codec", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 1,858 Bytes
6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 d599083 6c2b514 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # Known limitations — tritllm-codec
Items previously raised in code review have been addressed in the current
release. This document only lists deliberate design tradeoffs that the codec
review surfaced, not bugs.
## Design tradeoffs
### Scale codebook upper bound = `max(group_abs_maxes)`
**Where:** [`quantize_model_v2.py`, `trit_quantize_scales`, `log_max = np.max(...)`](quantize_model_v2.py#L107)
The 27-entry log-spaced scale codebook spans `[log_min, log_max]` where
`log_max` is taken to be the maximum group magnitude in the matrix. This is
intentional — an earlier 99.9th-percentile bound (commit prior to `0c16d24`)
clipped large-scale outlier groups and lost their resolution.
The downside: a single extreme-scale outlier group can stretch the log-spaced
range and reduce scale resolution for the bulk of normal-magnitude groups in
the same matrix.
We do not see this cause measurable quality regressions on Qwen2.5, Llama-3.1,
or Mistral-7B. If you observe unexpectedly high PPL on a new model family with
heavy-tailed scale distributions, this is the first place to look.
We did not change this in the current release because changing it would alter
the bit-exact output of the codec and invalidate published paper numbers; a
future v3 may replace `np.max` with a soft-cap (e.g. `min(max, 4 * p99)`) that
is robust to single extreme outliers without giving up large-scale fidelity.
### Scale candidate set is fixed at 4 percentiles
**Where:** [`quantize_model_v2.py`, `compute_best_scale_4cand`](quantize_model_v2.py#L75)
The MSE-best scale is selected from four fixed order statistics — indices
`[gs-6, gs-4, gs-2, gs-1]` of sorted `|w|`. This is a deliberate compute /
quality tradeoff (≈50× speedup over an exhaustive sweep, <1% PPL gap measured
on Qwen2.5-7B), not a bug. The function name and docstring now reflect this.
|