Instructions to use OpenAssistant/falcon-7b-sft-mix-2000 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenAssistant/falcon-7b-sft-mix-2000 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenAssistant/falcon-7b-sft-mix-2000", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenAssistant/falcon-7b-sft-mix-2000", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenAssistant/falcon-7b-sft-mix-2000 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenAssistant/falcon-7b-sft-mix-2000" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-7b-sft-mix-2000", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenAssistant/falcon-7b-sft-mix-2000
- SGLang
How to use OpenAssistant/falcon-7b-sft-mix-2000 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-7b-sft-mix-2000" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-7b-sft-mix-2000", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-7b-sft-mix-2000" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-7b-sft-mix-2000", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenAssistant/falcon-7b-sft-mix-2000 with Docker Model Runner:
docker model run hf.co/OpenAssistant/falcon-7b-sft-mix-2000
Error using example code in model card
Good morning,
I am writing to ask how to overcome the issue popping up while trying to run the pipeline described in the model card.
from transformers import AutoTokenizer
import transformers
import torch
model = "OpenAssistant/falcon-7b-sft-mix-2000"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
input_text="<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
sequences = pipeline(
input_text,
max_length=500,
do_sample=True,
return_full_text=False,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Which triggers the following error
How did you manage to solve this issue to run the pipeline successfully?
Is everything updated? Especially check that transformers is updated.
Yes, everything was installed from Scrath for a Hackathon project using Python 3.11
@Metal3d This is strange. It seems to work fine for me with transformers 4.31.0. Are you sure you copied the sample code right?
The first guy was also using a flask server, so his code couldn't be verified.
If you really want to use the model, try loading in without the pipeline:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "OpenAssistant/falcon-7b-sft-mix-2000"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_size='left')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
).half()
tokens = tokenizer.encode("Hello, my dog is cute", return_tensors="pt", padding=True, truncation=True).to(model.device)
gen = model.generate(tokens, do_sample=True, max_length=100, top_p=0.95, top_k=50, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
print(tokenizer.decode(gen[0], skip_special_tokens=True))
Yes, you theoretically can, but it is extremely slow.
Practically speaking, no.
It's been 16 minutes and not a single token has been generated on my 10600k. It may be worth noting that it is single threaded, but 🤷♂️
@Metal3d This is strange. It seems to work fine for me with transformers 4.31.0. Are you sure you copied the sample code right?
The first guy was also using a flask server, so his code couldn't be verified.
If you really want to use the model, try loading in without the pipeline:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "OpenAssistant/falcon-7b-sft-mix-2000" tokenizer = AutoTokenizer.from_pretrained(model_name, padding_size='left') tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ).half() tokens = tokenizer.encode("Hello, my dog is cute", return_tensors="pt", padding=True, truncation=True).to(model.device) gen = model.generate(tokens, do_sample=True, max_length=100, top_p=0.95, top_k=50, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True) print(tokenizer.decode(gen[0], skip_special_tokens=True))
Yes, everything following the instructions give. I will give a try to 4.31.0 but as one model works and another not. I believe the problem must be in the instructions or. The model itself

