mrsladoje commited on
Commit
95c2a74
·
verified ·
1 Parent(s): 5588dcf

fix(quantization): re-quantize with reduce_range=True

Browse files

Original v1 quantization omitted reduce_range=True, producing
full-range INT8 weights (-128..127). On pre-VNNI x86 CPUs
(AMD Zen 3 / EPYC 7543, older Intel), ORT's AVX2-only INT8
kernel has an int16 accumulator that overflows on full-range
weights, producing degenerate embeddings (all inputs collapse
to a near-constant manifold).

reduce_range=True clamps weights to [-64, 63], leaving 1 bit
of accumulator headroom. VNNI CPUs (Intel Cascade Lake+, AMD
Zen 4+) and Apple Silicon are unaffected — they use different
kernel paths that handle full-range INT8 natively.

New SHA256: 4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db

Files changed (1) hide show
  1. onnx/model.onnx +1 -1
onnx/model.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d44183a39a3e27bc2ef80aebeba48e8065556f2911c12211ab9f6ed94f2f26ee
3
  size 138619279
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db
3
  size 138619279