Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/zephyr-7b-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceH4/zephyr-7b-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/zephyr-7b-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/zephyr-7b-beta

SGLang

How to use HuggingFaceH4/zephyr-7b-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/zephyr-7b-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/zephyr-7b-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
```

Very Nice Work, But It Can't Be Prompted To Tell Stories

#19

by deleted - opened Nov 4, 2023

Discussion

deleted

Nov 4, 2023

With some other Mistrals like Open Hermes 2, as well as Llama 2 13b AYT, I can prompt a story with a paragraph of instructions and it will in most cases follow it without creating blatant contradictions.

However, this model stubbornly sticks to the standard story telling format, such as suspense and happy endings, even when it blatantly contradicts the prompt, leading to absurd continuity errors than not even a young child would make.

For example, if prompted for a kid to get caught stealing a cookie other LLMs would simply say something like 'the door flung open'. However, this LLM keeps saying things like he heard foot steps coming up the hall as he looked at the plate of cookies, then moments later was startled and caught red-handed eating the cookies. And when I asked why, if he was supposed to get caught, you had him hear foot steps coming up the hall, Zephyr Beta says it's to build suspense.

A blatant contradiction like this happens with every one of the paragraph long story prompts I use to test LLMs with, and in every case because it's stubbornly sticking to pre-packaged story telling elements like suspense and happy endings. I know that this doesn't have to be the case because other LLMs are smart enough to avoid these contradictions (e.g. the door suddenly opened vs first hearing footsteps coming down the hall). And it's not like it can't comprehend the prompt because when I ask why hearing foot steps coming up the hall precludes getting caught it can explain why, then it will tell the story again while making the correction.

In short, prompting Zephyr Beta to tell a story turns into a battle against the pre-package story telling elements. Other than this, Zephyr Beta is great and did far better during my testing than Zephyr Alpha, which has the same story telling stubbornness, resulting in blatant contradictions not even human toddlers would make when following the prompted instructions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment