Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceH4/zephyr-7b-beta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceH4/zephyr-7b-beta with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceH4/zephyr-7b-beta" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
- SGLang
How to use HuggingFaceH4/zephyr-7b-beta with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/zephyr-7b-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/zephyr-7b-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/zephyr-7b-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
Very Nice Work, But It Can't Be Prompted To Tell Stories
With some other Mistrals like Open Hermes 2, as well as Llama 2 13b AYT, I can prompt a story with a paragraph of instructions and it will in most cases follow it without creating blatant contradictions.
However, this model stubbornly sticks to the standard story telling format, such as suspense and happy endings, even when it blatantly contradicts the prompt, leading to absurd continuity errors than not even a young child would make.
For example, if prompted for a kid to get caught stealing a cookie other LLMs would simply say something like 'the door flung open'. However, this LLM keeps saying things like he heard foot steps coming up the hall as he looked at the plate of cookies, then moments later was startled and caught red-handed eating the cookies. And when I asked why, if he was supposed to get caught, you had him hear foot steps coming up the hall, Zephyr Beta says it's to build suspense.
A blatant contradiction like this happens with every one of the paragraph long story prompts I use to test LLMs with, and in every case because it's stubbornly sticking to pre-packaged story telling elements like suspense and happy endings. I know that this doesn't have to be the case because other LLMs are smart enough to avoid these contradictions (e.g. the door suddenly opened vs first hearing footsteps coming down the hall). And it's not like it can't comprehend the prompt because when I ask why hearing foot steps coming up the hall precludes getting caught it can explain why, then it will tell the story again while making the correction.
In short, prompting Zephyr Beta to tell a story turns into a battle against the pre-package story telling elements. Other than this, Zephyr Beta is great and did far better during my testing than Zephyr Alpha, which has the same story telling stubbornness, resulting in blatant contradictions not even human toddlers would make when following the prompted instructions.