Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/falcon-7b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tiiuae/falcon-7b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/falcon-7b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/falcon-7b-instruct

SGLang

How to use tiiuae/falcon-7b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/falcon-7b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/falcon-7b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/falcon-7b-instruct
```

How to use the CoreML model?

#65

by yyjhao - opened Jul 17, 2023

Discussion

yyjhao

Jul 17, 2023

Sorry if this is a noob question:

I was able to drag in the mlpackage folder into my Xcode project and have it generate a class. I then do

let model = try! falcon_7b_64_float32()

and I noticed that the model has a 'prediction' function, but that takes in a falcon_7b_64_float32Input type. It looks like the return type of that function is another special type as well. How do I convert from a string to input and from the output to another string text?

anomalus

Jul 18, 2023

I'm curious as well! It'd be great to have the code from the demo shown in the video, so we can tinker.

I may be overthinking this, but I suspect it involves passing the String to a tokenizer built for this particular model, similar to these Swift CoreML transformers.

pcuenq

Jul 19, 2023

You are right @anomalus : you need to tokenize the text, and then process the outputs to create the output sequence. The model only returns information about the probability of the next token in the sequence, so you need to call it multiple times to get the output.

We intend to publish everything soon.

anomalus

Jul 19, 2023

@pcuenq Fantastic. Looking forward to it!

jayfehr

Aug 5, 2023

You are right @anomalus : you need to tokenize the text, and then process the outputs to create the output sequence. The model only returns information about the probability of the next token in the sequence, so you need to call it multiple times to get the output.

We intend to publish everything soon.

Would you be able to provide quick sample code to run this the mlpackage?

anomalus

Aug 16, 2023

Posting this here: https://huggingface.co/blog/swift-coreml-llm

Thanks @pcuenq ! The only part I'm curious about is using Falcon 7b with Swift Chat is unusably slow. It takes maybe 5 minutes per word. I have a Macbook Pro M1 Max with 32GB of RAM, but SwiftChat uses 55GB+ of RAM on a simple run. Any advice on how to navigate this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment