Trained on ternary bits?

by LLMToaster - opened Jan 2, 2025

Jan 2, 2025

Was it trained with 3 integer values only (-1,0+1) from start? Or is it a quantized model from full model i.e. full model was compressed into this? If it's compressed from full model i.e not trained from start (with ternary bits) then does not it affect quality of responses ? 😕

Why is it 16 bit on hugging face when downloaded? Does not it inversely affect quality of generation and speed?

supercharge19

Jan 2, 2025

I believe what you are looking for is this: tiiuae/Falcon3-10B-Instruct-1.58bit-GGUF
Though I've not tested either of these quantized models but I think this one was trained and is supposed to be better than its guff counterpart. @ybelkada right?

The full model is, at least for text to text tasks, as good as gpt-4o-mini. Try full if you can.

MarxistLeninist

May 15, 2025

it because microsoft bitnet backs 16 bit in 3 bit of tenary so get cuda speed up

ybelkada

Technology Innovation Institute org May 15, 2025

We recently released checkpoints trained directly in Bitnet format: https://falcon-lm.github.io/blog/falcon-edge/ (hf version: https://huggingface.co/blog/tiiuae/falcon-edge) - checkout the blogpost for more details

supercharge19

May 28, 2025

Thank you I always look forward to models from TIIUAE

ybelkada

Technology Innovation Institute org May 29, 2025

We now have a new release of Bitnet models: https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130 feel free to test them out
blogpost: https://falcon-lm.github.io/blog/falcon-edge/

supercharge19

May 29, 2025

I love the fact that small model (file size below 1GB) is capable of holding itself against 5 times bigger models, however, the bitnet makes model really slow on CPUs, the real reason for models to make small, I think, is to enable edge computing, if these models are slow on laptop then how can they work on mobile or SBCs? Are there plans to enable faster inference? (currently I got less than token per second on i7 7th gen). Small but strong is good, but also needs to be fairly fast enough.

ybelkada

Technology Innovation Institute org May 29, 2025

Thank you for your comment @supercharge19 !
Just to understand better, have you used bitnet.cpp: https://github.com/microsoft/BitNet ? Authors there claim a big speedup compared to other models on CPU (below is a result on i7)

If you used HF transformers, it is expected to be slow. If you use BitNet.cpp you should expect some nice speedup (see for example this comment) you can use the GGUF files exposed in the HF collection

supercharge19

May 29, 2025

Thank you for responding, and comment shows good gains, now I need to know how to run the gguf file (yes I tried with huggingface).

ybelkada

Technology Innovation Institute org May 29, 2025

Thank you @supercharge19 ! You can have a look at this section: https://huggingface.co/tiiuae/Falcon-E-3B-Instruct-GGUF#bitnet on how to run the GGUF files

supercharge19

May 29, 2025

new results:
prompt eval time = 578.80 ms / 14 tokens ( 41.34 ms per token, 24.19 tokens per second)
eval time = 18921.55 ms / 228 tokens ( 82.99 ms per token, 12.05 tokens per second)

Still I think it is surprisingly slow for such a small model. Anyway, instructions you mentioned are not right, better to follow instructions on bitnet repo:

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# Create environment however you like, I use pyenv but site says use conda let's follow site's instructions:
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt

# get the model:
huggingface-cli download tiiuae/Falcon-E-3B-Instruct-GGUF ggml-model-i2_s.gguf --local-dir models/Falcon-E-3B-Instruct/

# run inference
python run_inference_server.py -m models/Falcon-E-3B-Instruct/ggml-model-i2_s.gguf -c 4099

supercharge19

May 29, 2025

8bit quantized llama 3.2 1b and falcon3 1b were also giving that much speed.

ybelkada

Technology Innovation Institute org May 29, 2025

•

edited May 29, 2025

Thank you @supercharge19
This is suprising as bitnet authors reported the result of Falcon-E-3B to be consistent with their benchmark on Apple M2 chip

Their table for Falcon-E 3B:

They got up to 100 tok/s with 12 parallel threads which is aligned with the 96 tok/s obtained for a 2.5B on their plot. Perhaps this is the culprit, can you try to play with the parameter -t when inferring with Bitnet.cpp ?

supercharge19

May 29, 2025

Or it could be quantization method that is inherently slow on CPUs? In past I faced models with i_Qauntization_bits (integer quantization) slower than normal simple q_4 or something like that. I will test with newer model H series with 8 and 4 bit and see how they run on same hardware.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment