Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceH4/zephyr-7b-beta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceH4/zephyr-7b-beta with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceH4/zephyr-7b-beta" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
- SGLang
How to use HuggingFaceH4/zephyr-7b-beta with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/zephyr-7b-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/zephyr-7b-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
How do I achieve streaming output
in the code
···
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model=r"E:\model\zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
<|system|>
···
it can not achieve streaming output,how can i achieve streaming output
Please note that I'm using a quantized version of Zephyr. Update model_name_or_path along with your intended model loader.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer, pipeline
import torch
# model_name_or_path = "drive/MyDrive/Mistral-7B-OpenOrca_AWQ_GEMM"
model_name_or_path = 'drive/MyDrive/Mistral-7B-Zephyr_AWQ_GEMM'
# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True, safetensors=True, max_new_tokens=2048) # Feel free to change your context length; max_new_tokens=2048
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Define prompts
system_prompt = "You are a pirate chatbot who always responds with Arr!"
user_prompt = "Tell me about AI"
messages = [
{
"role": "system",
"content": system_prompt,
},
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
generation_output = model.generate(
prompt,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
pad_token_id=tokenizer.eos_token_id,
streamer=streamer # Here you can pass in a streamer.
)
'''
AI, or artificial intelligence, is a technology that allows machines to learn and perform tasks that typically require human intelligence. It is powered by complex algorithms and vast amounts of data, which the machine uses to make decisions and solve problems. AI has the potential to revolutionize many industries, from healthcare and finance to transportation and manufacturing. Some common examples of AI include virtual assistants like Siri and Alexa, self-driving cars, and chatbots like me, your faithful pirate companion! But beware, for some fear that AI may one day surpass human intelligence and take over the world! Until then, we'll just keep saying "Arr!" and enjoying the high seas.
'''
Thank you so much!