Instructions to use Entrit/tritllm-codec with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Entrit/tritllm-codec with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Entrit/tritllm-codec", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Known limitations — tritllm-codec | |
| Items previously raised in code review have been addressed in the current | |
| release. This document only lists deliberate design tradeoffs that the codec | |
| review surfaced, not bugs. | |
| ## Design tradeoffs | |
| ### Scale codebook upper bound = `max(group_abs_maxes)` | |
| **Where:** [`quantize_model_v2.py`, `trit_quantize_scales`, `log_max = np.max(...)`](quantize_model_v2.py#L107) | |
| The 27-entry log-spaced scale codebook spans `[log_min, log_max]` where | |
| `log_max` is taken to be the maximum group magnitude in the matrix. This is | |
| intentional — an earlier 99.9th-percentile bound (commit prior to `0c16d24`) | |
| clipped large-scale outlier groups and lost their resolution. | |
| The downside: a single extreme-scale outlier group can stretch the log-spaced | |
| range and reduce scale resolution for the bulk of normal-magnitude groups in | |
| the same matrix. | |
| We do not see this cause measurable quality regressions on Qwen2.5, Llama-3.1, | |
| or Mistral-7B. If you observe unexpectedly high PPL on a new model family with | |
| heavy-tailed scale distributions, this is the first place to look. | |
| We did not change this in the current release because changing it would alter | |
| the bit-exact output of the codec and invalidate published paper numbers; a | |
| future v3 may replace `np.max` with a soft-cap (e.g. `min(max, 4 * p99)`) that | |
| is robust to single extreme outliers without giving up large-scale fidelity. | |
| ### Scale candidate set is fixed at 4 percentiles | |
| **Where:** [`quantize_model_v2.py`, `compute_best_scale_4cand`](quantize_model_v2.py#L75) | |
| The MSE-best scale is selected from four fixed order statistics — indices | |
| `[gs-6, gs-4, gs-2, gs-1]` of sorted `|w|`. This is a deliberate compute / | |
| quality tradeoff (≈50× speedup over an exhaustive sweep, <1% PPL gap measured | |
| on Qwen2.5-7B), not a bug. The function name and docstring now reflect this. | |