Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-7b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-7b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-7b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiiuae/falcon-7b-instruct
- SGLang
How to use tiiuae/falcon-7b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-7b-instruct
Slow inference
In 40B and 7B model cards it is said that this model is optimised for inference. But it is one of the most slow models among 7B ones. May be I am doing something wrong, or don't have required libraries, but the use example code is producing very slow results on A100 or RTX8000. Is it common problem or am I doing something wrong?
Slow for me also, on a RTX3090. Orders of magnitude slower than other 7B models I've tried.
After warming up, other models summarize an article in 2 to 10 seconds. Falcon takes about 2 minutes for the same article.
I double checked that it's using the GPU and tried running a quantized version, but still slow.
It's really slow for me also
@Sven00 I didn't any official examples for summarization prompts either, but through trial and error I found this works fairly well:
INSTRUCTIONS:
You are a political analyst for a national newspaper.
Only refer to the provided text and no other sources.
Summarize 5 key facts from the following text as a numbered list.
TEXT:
###
{text}
###
SUMMARY:
However, the model neither numbers items or counts correctly
@HAvietisov Running quantized is slightly faster for this model on my hardware at least, but not by much.
@patonw what hardware you use and what quantization method?
I run int8 quantization via bitsandbytes, with dequantization to float16 on single rtx 3090
Changing torch_dtype=torch.bfloat16 to torch_dtype=torch.float16 in the Getting Started code snippet (removing the "b" before "float") led to a significant speedup on a 16GB vRAM NC4as-v3 machine in databricks running the falcon-7b-instruct model. Hope this helps others, too.
usually How long it takes for warmup steps finish for fully finetune?