Instructions to use tiiuae/Falcon3-3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/Falcon3-3B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/Falcon3-3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/Falcon3-3B-Instruct") model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon3-3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use tiiuae/Falcon3-3B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/Falcon3-3B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon3-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiiuae/Falcon3-3B-Instruct
- SGLang
How to use tiiuae/Falcon3-3B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/Falcon3-3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon3-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/Falcon3-3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon3-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiiuae/Falcon3-3B-Instruct with Docker Model Runner:
docker model run hf.co/tiiuae/Falcon3-3B-Instruct
Question about the benchmarks
Hi,
I'm interested in understanding the benchmarking methodology used to compare your AI models with those from other companies and teams, specifically with regards to the lm-evaluation-harness framework.
For example, I've noticed that the reported MMLU and MMLU-PRO scores for Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct appear to be displayed as lower than expected (and also lower than what is reported by Meta and Qwen themselves).
Could you provide more details on the settings or configuration used for these benchmark? I'd like to make sure that the comparisons are accurate. Thank you.
Hi,
some details are already present in the blogpost: https://huggingface.co/blog/falcon3#:~:text=In%20our%20internal%20evaluation%20pipeline%3A
Hi, thank you, yes I did read that prior to posting but unfortunately it only provides this one detail:
We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1).
Part of why I'm asking is because the official open_llm_leaderboard which is powered by the same lm-evaluation-harness is reporting these results on MMLU-PRO:
As you can see, Falcon3-3B-Instruct is slightly outscored here by both models on MMLU-PRO. However, according to your readme, the results are very different:
So, I'm just trying to understand what caused this big discrepancy between scores?
the difference is in "We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1)"
- we use raw scores whereas HF leaderboard uses normalized scores
- --fewshot_as_multiturn is not enabled in our evals whereas it is in HF evals score.
Got it, so does that mean that fewshot made all the difference, because even the raw scores are showing falcon 3b as scoring slightly lower?
I'm just curious if this is an accurate reflection when, for example, in the falcon readme there's a 59.9% decrease in mmlu-pro score between Llama-3.2-3B-Instruct and Falcon3-3B-Instruct.
Here's the raw score reported by the leaderboard:


