fix(quantization): re-quantize with reduce_range=True

Original v1 quantization omitted reduce_range=True, producing
full-range INT8 weights (-128..127). On pre-VNNI x86 CPUs
(AMD Zen 3 / EPYC 7543, older Intel), ORT's AVX2-only INT8
kernel has an int16 accumulator that overflows on full-range
weights, producing degenerate embeddings (all inputs collapse
to a near-constant manifold).

reduce_range=True clamps weights to [-64, 63], leaving 1 bit
of accumulator headroom. VNNI CPUs (Intel Cascade Lake+, AMD
Zen 4+) and Apple Silicon are unaffected — they use different
kernel paths that handle full-range INT8 natively.

New SHA256: 4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db

Files changed (1) hide show

onnx/model.onnx +1 -1

onnx/model.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d44183a39a3e27bc2ef80aebeba48e8065556f2911c12211ab9f6ed94f2f26ee
 size 138619279

 version https://git-lfs.github.com/spec/v1
+oid sha256:4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db
 size 138619279