Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-7b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-7b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-7b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiiuae/falcon-7b-instruct
- SGLang
How to use tiiuae/falcon-7b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-7b-instruct
How to use the CoreML model?
Sorry if this is a noob question:
I was able to drag in the mlpackage folder into my Xcode project and have it generate a class. I then do
let model = try! falcon_7b_64_float32()
and I noticed that the model has a 'prediction' function, but that takes in a falcon_7b_64_float32Input type. It looks like the return type of that function is another special type as well. How do I convert from a string to input and from the output to another string text?
I'm curious as well! It'd be great to have the code from the demo shown in the video, so we can tinker.
I may be overthinking this, but I suspect it involves passing the String to a tokenizer built for this particular model, similar to these Swift CoreML transformers.
You are right @anomalus : you need to tokenize the text, and then process the outputs to create the output sequence. The model only returns information about the probability of the next token in the sequence, so you need to call it multiple times to get the output.
We intend to publish everything soon.
You are right @anomalus : you need to tokenize the text, and then process the outputs to create the output sequence. The model only returns information about the probability of the next token in the sequence, so you need to call it multiple times to get the output.
We intend to publish everything soon.
Would you be able to provide quick sample code to run this the mlpackage?
Posting this here: https://huggingface.co/blog/swift-coreml-llm
Thanks @pcuenq ! The only part I'm curious about is using Falcon 7b with Swift Chat is unusably slow. It takes maybe 5 minutes per word. I have a Macbook Pro M1 Max with 32GB of RAM, but SwiftChat uses 55GB+ of RAM on a simple run. Any advice on how to navigate this?