QAT?
#1
by Downtown-Case - opened
Has Arcee considered a highly quantized (2-3 bit?) QAT run, like Baidu did for Ernie 4.5, to make tight quantization more usable?
https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle
One could even do “targeted” QAT with just the MoE FFN layers, and leave the other layers as ~8-bit.
I understand commercial deployments are a priority, but:
A more cheaply deployable 400B would put Trinity Large in a unique niche.
At the risk of speaking cynically, you could take advantage of the current TurboQuant hype with a major QAT release.
It’d be cheap compared to the finetuning cost.
It’d be local-inference friendly (opening up inference on 128GB RAM machines).