Optimize ONNX models
#1
by
Xenova HF Staff - opened
This PR does the following:
- Optimize ONNX exports (you can inspect the graph to notice the complexity decrease, but you can also see from file size; model.onnx goes from 145492 bytes to 140810 bytes.
- Optimize quantizations: For example, q4 goes from 1217650688 bytes to 850059264 bytes (-30%). We do this by quantizing the embed_tokens to use 4-bit gathers.
- Adds q4f16 (model_q4f16.onnx) and q8 (model_quantized.onnx) quantizations
- Minify tokenizer.json
- Fix transformers.js config values
My demo pushes >180 tps with these optimizations. Prev: ~120tps.
Xenova changed pull request title from
Upload folder using huggingface_hub
to Optimize ONNX models
ykhrustalev changed pull request status to
merged