fix: address codex review BLOCKERs and SHOULD-FIXes; update KNOWN_ISSUES

6c2b514 verified 19 days ago

1.86 kB

	# Known limitations — tritllm-codec

	Items previously raised in code review have been addressed in the current
	release. This document only lists deliberate design tradeoffs that the codec
	review surfaced, not bugs.

	## Design tradeoffs

	### Scale codebook upper bound = `max(group_abs_maxes)`
	Where: [`quantize_model_v2.py`, `trit_quantize_scales`, `log_max = np.max(...)`](quantize_model_v2.py#L107)

	The 27-entry log-spaced scale codebook spans `[log_min, log_max]` where
	`log_max` is taken to be the maximum group magnitude in the matrix. This is
	intentional — an earlier 99.9th-percentile bound (commit prior to `0c16d24`)
	clipped large-scale outlier groups and lost their resolution.

	The downside: a single extreme-scale outlier group can stretch the log-spaced
	range and reduce scale resolution for the bulk of normal-magnitude groups in
	the same matrix.

	We do not see this cause measurable quality regressions on Qwen2.5, Llama-3.1,
	or Mistral-7B. If you observe unexpectedly high PPL on a new model family with
	heavy-tailed scale distributions, this is the first place to look.

	We did not change this in the current release because changing it would alter
	the bit-exact output of the codec and invalidate published paper numbers; a
	future v3 may replace `np.max` with a soft-cap (e.g. `min(max, 4 * p99)`) that
	is robust to single extreme outliers without giving up large-scale fidelity.

	### Scale candidate set is fixed at 4 percentiles
	Where: [`quantize_model_v2.py`, `compute_best_scale_4cand`](quantize_model_v2.py#L75)

	The MSE-best scale is selected from four fixed order statistics — indices
	`[gs-6, gs-4, gs-2, gs-1]` of sorted `\|w\|`. This is a deliberate compute /
	quality tradeoff (≈50× speedup over an exhaustive sweep, <1% PPL gap measured
	on Qwen2.5-7B), not a bug. The function name and docstring now reflect this.