Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/falcon-7b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tiiuae/falcon-7b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/falcon-7b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/falcon-7b-instruct

SGLang

How to use tiiuae/falcon-7b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/falcon-7b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/falcon-7b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/falcon-7b-instruct
```

max_length not working?

#18

by domid10 - opened Jun 2, 2023

Discussion

domid10

Jun 2, 2023

The reply always seems to be under 70 characters. Even when setting a higher max_length. Any ideas?
ex reply:
"Life is a journey, a path we must take.
To find our way, we"

from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from dotenv import load_dotenv
load_dotenv()

template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b-instruct", model_kwargs={"temperature":0.1, "max_length":2000,})
llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Write a poem about life"
print(question)
print('➡️ ', llm_chain.run(question))

aditunoe

Jun 12, 2023

I have the same issue, at least in combination with LangChain the mdel tends to only ouput a few Tokens and than just stops in the middle of the sentence.

Would be nice to know if we are just doing something wrong or its just the way this model works?

aditunoe

Jun 13, 2023

I found our mistake. @domid10 you need to add max_new_tokens and set it higher to get better results.
Example:

llm = HuggingFaceEndpoint(
            endpoint_url= "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct",
            huggingfacehub_api_token=HUGGINFACE_KEY,
            task="text-generation",
            model_kwargs = {
                "temperature":0.2,
                "max_new_tokens":400,
                "num_return_sequences":1
            }
        )

ajmalsiddiqui

Jun 13, 2023

•

edited Jun 13, 2023

Hello All, I am interested to know falcon performance benchmarking on A100 and T4. I will be thankful if someone can share the inference statistics.
a) GPU type
b) Average inference time per request

domid10

Jun 15, 2023

@aditunoe Thank you! This should really be added to the docs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment