Instructions to use OpenAssistant/falcon-7b-sft-mix-2000 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenAssistant/falcon-7b-sft-mix-2000 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenAssistant/falcon-7b-sft-mix-2000", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("OpenAssistant/falcon-7b-sft-mix-2000", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenAssistant/falcon-7b-sft-mix-2000 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenAssistant/falcon-7b-sft-mix-2000"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-7b-sft-mix-2000",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/OpenAssistant/falcon-7b-sft-mix-2000

SGLang

How to use OpenAssistant/falcon-7b-sft-mix-2000 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenAssistant/falcon-7b-sft-mix-2000" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-7b-sft-mix-2000",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenAssistant/falcon-7b-sft-mix-2000" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-7b-sft-mix-2000",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use OpenAssistant/falcon-7b-sft-mix-2000 with Docker Model Runner:
```
docker model run hf.co/OpenAssistant/falcon-7b-sft-mix-2000
```

Error using example code in model card

by Eduardo-AC - opened Jul 20, 2023

Discussion

Eduardo-AC

Jul 20, 2023

Good morning,

I am writing to ask how to overcome the issue popping up while trying to run the pipeline described in the model card.

from transformers import AutoTokenizer
import transformers
import torch

model = "OpenAssistant/falcon-7b-sft-mix-2000"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

input_text="<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"

sequences = pipeline(
    input_text,
    max_length=500,
    do_sample=True,
    return_full_text=False,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Which triggers the following error

How did you manage to solve this issue to run the pipeline successfully?

ThatOneShortGuy

Jul 21, 2023

Is everything updated? Especially check that transformers is updated.

Eduardo-AC

Jul 24, 2023

Yes, everything was installed from Scrath for a Hackathon project using Python 3.11

Metal3d

Aug 14, 2023

Exactly the same error for me

ThatOneShortGuy

Aug 14, 2023

•

edited Aug 14, 2023

@Metal3d This is strange. It seems to work fine for me with transformers 4.31.0. Are you sure you copied the sample code right?

The first guy was also using a flask server, so his code couldn't be verified.

If you really want to use the model, try loading in without the pipeline:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenAssistant/falcon-7b-sft-mix-2000"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_size='left')
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             trust_remote_code=True
                                             ).half()

tokens = tokenizer.encode("Hello, my dog is cute", return_tensors="pt", padding=True, truncation=True).to(model.device)
gen = model.generate(tokens, do_sample=True, max_length=100, top_p=0.95, top_k=50, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
print(tokenizer.decode(gen[0], skip_special_tokens=True))

deepakkaura26

Aug 16, 2023

@ThatOneShortGuy can I run this model on CPU ?

ThatOneShortGuy

Aug 16, 2023

Yes, you theoretically can, but it is extremely slow.

Practically speaking, no.

It's been 16 minutes and not a single token has been generated on my 10600k. It may be worth noting that it is single threaded, but 🤷‍♂️

Eduardo-AC

Aug 17, 2023

@Metal3d This is strange. It seems to work fine for me with transformers 4.31.0. Are you sure you copied the sample code right?

The first guy was also using a flask server, so his code couldn't be verified.

If you really want to use the model, try loading in without the pipeline:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenAssistant/falcon-7b-sft-mix-2000"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_size='left')
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             trust_remote_code=True
                                             ).half()

tokens = tokenizer.encode("Hello, my dog is cute", return_tensors="pt", padding=True, truncation=True).to(model.device)
gen = model.generate(tokens, do_sample=True, max_length=100, top_p=0.95, top_k=50, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)
print(tokenizer.decode(gen[0], skip_special_tokens=True))

Yes, everything following the instructions give. I will give a try to 4.31.0 but as one model works and another not. I believe the problem must be in the instructions or. The model itself

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment