Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/falcon-7b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tiiuae/falcon-7b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/falcon-7b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/falcon-7b-instruct

SGLang

How to use tiiuae/falcon-7b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/falcon-7b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/falcon-7b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/falcon-7b-instruct
```

Slow inference

#33

by BigArt - opened Jun 14, 2023

Discussion

BigArt

Jun 14, 2023

In 40B and 7B model cards it is said that this model is optimised for inference. But it is one of the most slow models among 7B ones. May be I am doing something wrong, or don't have required libraries, but the use example code is producing very slow results on A100 or RTX8000. Is it common problem or am I doing something wrong?

patonw

Jun 14, 2023

Slow for me also, on a RTX3090. Orders of magnitude slower than other 7B models I've tried.
After warming up, other models summarize an article in 2 to 10 seconds. Falcon takes about 2 minutes for the same article.
I double checked that it's using the GPU and tried running a quantized version, but still slow.

Sven00

Jun 16, 2023

@patonw can you please let me know which prompt/parameter you are using for summarization task? i'm struggling with 7 B models to get a more or less stable and factual correct summary. thank you

HAvietisov

Jun 17, 2023

@patonw aren't quantized models always slow in comparison to models in float16?

treeguard

Jun 19, 2023

It's really slow for me also

patonw

Jun 19, 2023

@Sven00 I didn't any official examples for summarization prompts either, but through trial and error I found this works fairly well:

INSTRUCTIONS:
You are a political analyst for a national newspaper.
Only refer to the provided text and no other sources.
Summarize 5 key facts from the following text as a numbered list.

TEXT:
###
{text}
###

SUMMARY:

However, the model neither numbers items or counts correctly

@HAvietisov Running quantized is slightly faster for this model on my hardware at least, but not by much.

HAvietisov

Jun 19, 2023

•

edited Jun 19, 2023

@patonw what hardware you use and what quantization method?
I run int8 quantization via bitsandbytes, with dequantization to float16 on single rtx 3090

rustamg

Jun 21, 2023

•

edited Jun 21, 2023

Changing torch_dtype=torch.bfloat16 to torch_dtype=torch.float16 in the Getting Started code snippet (removing the "b" before "float") led to a significant speedup on a 16GB vRAM NC4as-v3 machine in databricks running the falcon-7b-instruct model. Hope this helps others, too.

michaelomahony

Jul 10, 2023

@rustamg thanks for sharing! Any idea how much of a drop in accuracy this could cause?

Jenny2020

Sep 1, 2023

•

edited Sep 1, 2023

usually How long it takes for warmup steps finish for fully finetune?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment