Trained on ternary bits?
Was it trained with 3 integer values only (-1,0+1) from start? Or is it a quantized model from full model i.e. full model was compressed into this? If it's compressed from full model i.e not trained from start (with ternary bits) then does not it affect quality of responses ? π
Why is it 16 bit on hugging face when downloaded? Does not it inversely affect quality of generation and speed?
I believe what you are looking for is this: tiiuae/Falcon3-10B-Instruct-1.58bit-GGUF
Though I've not tested either of these quantized models but I think this one was trained and is supposed to be better than its guff counterpart.
@ybelkada
right?
The full model is, at least for text to text tasks, as good as gpt-4o-mini. Try full if you can.
it because microsoft bitnet backs 16 bit in 3 bit of tenary so get cuda speed up
We recently released checkpoints trained directly in Bitnet format: https://falcon-lm.github.io/blog/falcon-edge/ (hf version: https://huggingface.co/blog/tiiuae/falcon-edge) - checkout the blogpost for more details
Thank you I always look forward to models from TIIUAE
We now have a new release of Bitnet models: https://huggingface.co/collections/tiiuae/falcon-edge-series-6804fd13344d6d8a8fa71130 feel free to test them out
blogpost: https://falcon-lm.github.io/blog/falcon-edge/
I love the fact that small model (file size below 1GB) is capable of holding itself against 5 times bigger models, however, the bitnet makes model really slow on CPUs, the real reason for models to make small, I think, is to enable edge computing, if these models are slow on laptop then how can they work on mobile or SBCs? Are there plans to enable faster inference? (currently I got less than token per second on i7 7th gen). Small but strong is good, but also needs to be fairly fast enough.
Thank you for your comment
@supercharge19
!
Just to understand better, have you used bitnet.cpp: https://github.com/microsoft/BitNet ? Authors there claim a big speedup compared to other models on CPU (below is a result on i7)
If you used HF transformers, it is expected to be slow. If you use BitNet.cpp you should expect some nice speedup (see for example this comment) you can use the GGUF files exposed in the HF collection
Thank you for responding, and comment shows good gains, now I need to know how to run the gguf file (yes I tried with huggingface).
Thank you @supercharge19 ! You can have a look at this section: https://huggingface.co/tiiuae/Falcon-E-3B-Instruct-GGUF#bitnet on how to run the GGUF files
new results:
prompt eval time = 578.80 ms / 14 tokens ( 41.34 ms per token, 24.19 tokens per second)
eval time = 18921.55 ms / 228 tokens ( 82.99 ms per token, 12.05 tokens per second)
Still I think it is surprisingly slow for such a small model. Anyway, instructions you mentioned are not right, better to follow instructions on bitnet repo:
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# Create environment however you like, I use pyenv but site says use conda let's follow site's instructions:
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt
# get the model:
huggingface-cli download tiiuae/Falcon-E-3B-Instruct-GGUF ggml-model-i2_s.gguf --local-dir models/Falcon-E-3B-Instruct/
# run inference
python run_inference_server.py -m models/Falcon-E-3B-Instruct/ggml-model-i2_s.gguf -c 4099
8bit quantized llama 3.2 1b and falcon3 1b were also giving that much speed.
Thank you
@supercharge19
This is suprising as bitnet authors reported the result of Falcon-E-3B to be consistent with their benchmark on Apple M2 chip
Their table for Falcon-E 3B:
They got up to 100 tok/s with 12 parallel threads which is aligned with the 96 tok/s obtained for a 2.5B on their plot. Perhaps this is the culprit, can you try to play with the parameter -t when inferring with Bitnet.cpp ?
Or it could be quantization method that is inherently slow on CPUs? In past I faced models with i_Qauntization_bits (integer quantization) slower than normal simple q_4 or something like that. I will test with newer model H series with 8 and 4 bit and see how they run on same hardware.


