Trained on ternary bits?

#3
by LLMToaster - opened

Was it trained with 3 integer values only (-1,0+1) from start? Or is it a quantized model from full model i.e. full model was compressed into this? If it's compressed from full model i.e not trained from start (with ternary bits) then does not it affect quality of responses ? πŸ˜•

Why is it 16 bit on hugging face when downloaded? Does not it inversely affect quality of generation and speed?

I believe what you are looking for is this: tiiuae/Falcon3-10B-Instruct-1.58bit-GGUF
Though I've not tested either of these quantized models but I think this one was trained and is supposed to be better than its guff counterpart. @ybelkada right?

The full model is, at least for text to text tasks, as good as gpt-4o-mini. Try full if you can.

it because microsoft bitnet backs 16 bit in 3 bit of tenary so get cuda speed up

Technology Innovation Institute org

We recently released checkpoints trained directly in Bitnet format: https://falcon-lm.github.io/blog/falcon-edge/ (hf version: https://huggingface.co/blog/tiiuae/falcon-edge) - checkout the blogpost for more details

Thank you I always look forward to models from TIIUAE

Technology Innovation Institute org

I love the fact that small model (file size below 1GB) is capable of holding itself against 5 times bigger models, however, the bitnet makes model really slow on CPUs, the real reason for models to make small, I think, is to enable edge computing, if these models are slow on laptop then how can they work on mobile or SBCs? Are there plans to enable faster inference? (currently I got less than token per second on i7 7th gen). Small but strong is good, but also needs to be fairly fast enough.

Technology Innovation Institute org

Thank you for your comment @supercharge19 !
Just to understand better, have you used bitnet.cpp: https://github.com/microsoft/BitNet ? Authors there claim a big speedup compared to other models on CPU (below is a result on i7)

image.png

If you used HF transformers, it is expected to be slow. If you use BitNet.cpp you should expect some nice speedup (see for example this comment) you can use the GGUF files exposed in the HF collection

Thank you for responding, and comment shows good gains, now I need to know how to run the gguf file (yes I tried with huggingface).

Technology Innovation Institute org

Thank you @supercharge19 ! You can have a look at this section: https://huggingface.co/tiiuae/Falcon-E-3B-Instruct-GGUF#bitnet on how to run the GGUF files

new results:
prompt eval time = 578.80 ms / 14 tokens ( 41.34 ms per token, 24.19 tokens per second)
eval time = 18921.55 ms / 228 tokens ( 82.99 ms per token, 12.05 tokens per second)

Still I think it is surprisingly slow for such a small model. Anyway, instructions you mentioned are not right, better to follow instructions on bitnet repo:

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# Create environment however you like, I use pyenv but site says use conda let's follow site's instructions:
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt

# get the model:
huggingface-cli download tiiuae/Falcon-E-3B-Instruct-GGUF ggml-model-i2_s.gguf --local-dir models/Falcon-E-3B-Instruct/

# run inference
python run_inference_server.py -m models/Falcon-E-3B-Instruct/ggml-model-i2_s.gguf -c 4099

8bit quantized llama 3.2 1b and falcon3 1b were also giving that much speed.

Technology Innovation Institute org
β€’
edited May 29, 2025

Thank you @supercharge19
This is suprising as bitnet authors reported the result of Falcon-E-3B to be consistent with their benchmark on Apple M2 chip

image.png

Their table for Falcon-E 3B:

Screenshot 2025-05-29 at 7.10.08β€―PM.png

They got up to 100 tok/s with 12 parallel threads which is aligned with the 96 tok/s obtained for a 2.5B on their plot. Perhaps this is the culprit, can you try to play with the parameter -t when inferring with Bitnet.cpp ?

Or it could be quantization method that is inherently slow on CPUs? In past I faced models with i_Qauntization_bits (integer quantization) slower than normal simple q_4 or something like that. I will test with newer model H series with 8 and 4 bit and see how they run on same hardware.

Sign up or log in to comment