Optimize ONNX models

by Xenova HF Staff - opened 21 days ago

←

•

This PR does the following:

Optimize ONNX exports (you can inspect the graph to notice the complexity decrease, but you can also see from file size; model.onnx goes from 145492 bytes to 140810 bytes.
Optimize quantizations: For example, q4 goes from 1217650688 bytes to 850059264 bytes (-30%). We do this by quantizing the embed_tokens to use 4-bit gathers.
Adds q4f16 (model_q4f16.onnx) and q8 (model_quantized.onnx) quantizations
Minify tokenizer.json
Fix transformers.js config values

My demo pushes >180 tps with these optimizations. Prev: ~120tps.

Xenova changed pull request title from Upload folder using huggingface_hub to Optimize ONNX models 21 days ago

Liquid AI org 21 days ago

@Xenova , thank you for the change!

ykhrustalev changed pull request status to merged 21 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment