Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/zephyr-7b-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use HuggingFaceH4/zephyr-7b-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/zephyr-7b-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/zephyr-7b-beta

SGLang

How to use HuggingFaceH4/zephyr-7b-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/zephyr-7b-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/zephyr-7b-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
```

BFloat16 is not supported on MPS

#27

by mhelmy - opened Nov 12, 2023

Discussion

mhelmy

Nov 12, 2023

Hello,

I am an absolute newb to machine learning and the torch/transformers libraries, and I'm trying to run the model on MacOS, but I'm getting an error saying that BFloat16 is not supported on MPS. Can someone please advise how I can resolve this issue?

System:

$ sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90

Code:

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

Output:

Loading checkpoint shards:   0%|                                                                                                                                                                                             | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 870, in pipeline
    framework, model = infer_framework_load_model(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3870, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 743, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/Users/myself/repos/local/vai/backend/.env/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
    new_value = value.to(device)
                ^^^^^^^^^^^^^^^^
TypeError: BFloat16 is not supported on MPS

alvarobartt

Hugging Face H4 org Nov 28, 2023

Hey! Just change your snippet with

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.float16, device_map="auto")

Or just use a quantized version of the model via llama-cpp-python as it will run faster (q5_0, q5_k_s, q5_k_m, q4_k_s, q4_k_m) variants recommended, find all the quantized models at https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/tree/main

010gvr

Nov 28, 2023

@mhelmy

M1 doesn't support BFloat16. Interestingly, M2 does. To workaround, on top of the suggestion from @alvarobartt , also run accelerate config to configure the option for mixed_precision to fp16. Verify by running accelerate env and then try executing the code.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment