QAT?

#1
by Downtown-Case - opened

Has Arcee considered a highly quantized (2-3 bit?) QAT run, like Baidu did for Ernie 4.5, to make tight quantization more usable?

https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle

One could even do “targeted” QAT with just the MoE FFN layers, and leave the other layers as ~8-bit.

I understand commercial deployments are a priority, but:

  • A more cheaply deployable 400B would put Trinity Large in a unique niche.

  • At the risk of speaking cynically, you could take advantage of the current TurboQuant hype with a major QAT release.

  • It’d be cheap compared to the finetuning cost.

  • It’d be local-inference friendly (opening up inference on 128GB RAM machines).

Sign up or log in to comment