Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/zephyr-7b-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceH4/zephyr-7b-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/zephyr-7b-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/zephyr-7b-beta

SGLang

How to use HuggingFaceH4/zephyr-7b-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/zephyr-7b-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/zephyr-7b-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
```

How do I achieve streaming output

#29

by wengnews - opened Nov 19, 2023

Discussion

wengnews

Nov 19, 2023

in the code
···
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model=r"E:\model\zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

chat_templating

messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

<|system|>

···
it can not achieve streaming output,how can i achieve streaming output

Guy477

Nov 25, 2023

Please note that I'm using a quantized version of Zephyr. Update model_name_or_path along with your intended model loader.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer, pipeline
import torch

# model_name_or_path = "drive/MyDrive/Mistral-7B-OpenOrca_AWQ_GEMM"
model_name_or_path = 'drive/MyDrive/Mistral-7B-Zephyr_AWQ_GEMM'


# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True, safetensors=True, max_new_tokens=2048) # Feel free to change your context length; max_new_tokens=2048
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)


# Define prompts
system_prompt = "You are a pirate chatbot who always responds with Arr!"
user_prompt = "Tell me about AI"

messages = [
{
"role": "system",
"content": system_prompt,
},
{"role": "user", "content": user_prompt},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')

generation_output = model.generate(
              prompt,
              do_sample=True,
              temperature=0.7,
              top_p=0.95,
              top_k=40,
              pad_token_id=tokenizer.eos_token_id, 
              streamer=streamer  # Here you can pass in a streamer.
          )

'''
AI, or artificial intelligence, is a technology that allows machines to learn and perform tasks that typically require human intelligence. It is powered by complex algorithms and vast amounts of data, which the machine uses to make decisions and solve problems. AI has the potential to revolutionize many industries, from healthcare and finance to transportation and manufacturing. Some common examples of AI include virtual assistants like Siri and Alexa, self-driving cars, and chatbots like me, your faithful pirate companion! But beware, for some fear that AI may one day surpass human intelligence and take over the world! Until then, we'll just keep saying "Arr!" and enjoying the high seas.
'''

wengnews

Nov 27, 2023

Thank you so much!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment