Optimize ONNX models

#1
by Xenova HF Staff - opened

This PR does the following:

  1. Optimize ONNX exports (you can inspect the graph to notice the complexity decrease, but you can also see from file size; model.onnx goes from 145492 bytes to 140810 bytes.
  2. Optimize quantizations: For example, q4 goes from 1217650688 bytes to 850059264 bytes (-30%). We do this by quantizing the embed_tokens to use 4-bit gathers.
  3. Adds q4f16 (model_q4f16.onnx) and q8 (model_quantized.onnx) quantizations
  4. Minify tokenizer.json
  5. Fix transformers.js config values

My demo pushes >180 tps with these optimizations. Prev: ~120tps.
image

Xenova changed pull request title from Upload folder using huggingface_hub to Optimize ONNX models
Liquid AI org

@Xenova , thank you for the change!

ykhrustalev changed pull request status to merged

Sign up or log in to comment