Instructions to use tiiuae/Falcon3-3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/Falcon3-3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/Falcon3-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/Falcon3-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon3-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tiiuae/Falcon3-3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/Falcon3-3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon3-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/Falcon3-3B-Instruct

SGLang

How to use tiiuae/Falcon3-3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/Falcon3-3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon3-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/Falcon3-3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon3-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/Falcon3-3B-Instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/Falcon3-3B-Instruct
```

Question about the benchmarks

by quantflex - opened Dec 20, 2024

Discussion

quantflex

Dec 20, 2024

Hi,
I'm interested in understanding the benchmarking methodology used to compare your AI models with those from other companies and teams, specifically with regards to the lm-evaluation-harness framework.

For example, I've noticed that the reported MMLU and MMLU-PRO scores for Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct appear to be displayed as lower than expected (and also lower than what is reported by Meta and Qwen themselves).

Could you provide more details on the settings or configuration used for these benchmark? I'd like to make sure that the comparisons are accurate. Thank you.

slimfrikha

Technology Innovation Institute org Dec 20, 2024

Hi,
some details are already present in the blogpost: https://huggingface.co/blog/falcon3#:~:text=In%20our%20internal%20evaluation%20pipeline%3A

quantflex

Dec 20, 2024

•

edited Dec 20, 2024

Hi, thank you, yes I did read that prior to posting but unfortunately it only provides this one detail:

We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1).

Part of why I'm asking is because the official open_llm_leaderboard which is powered by the same lm-evaluation-harness is reporting these results on MMLU-PRO:

As you can see, Falcon3-3B-Instruct is slightly outscored here by both models on MMLU-PRO. However, according to your readme, the results are very different:

So, I'm just trying to understand what caused this big discrepancy between scores?

slimfrikha

Technology Innovation Institute org Dec 20, 2024

the difference is in "We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1)"

we use raw scores whereas HF leaderboard uses normalized scores
--fewshot_as_multiturn is not enabled in our evals whereas it is in HF evals score.

quantflex

Dec 20, 2024

Got it, so does that mean that fewshot made all the difference, because even the raw scores are showing falcon 3b as scoring slightly lower?
I'm just curious if this is an accurate reflection when, for example, in the falcon readme there's a 59.9% decrease in mmlu-pro score between Llama-3.2-3B-Instruct and Falcon3-3B-Instruct.
Here's the raw score reported by the leaderboard:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment