Instructions to use anakin87/zephyr-7b-alpha-sharded with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use anakin87/zephyr-7b-alpha-sharded with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="anakin87/zephyr-7b-alpha-sharded") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("anakin87/zephyr-7b-alpha-sharded") model = AutoModelForCausalLM.from_pretrained("anakin87/zephyr-7b-alpha-sharded") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use anakin87/zephyr-7b-alpha-sharded with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "anakin87/zephyr-7b-alpha-sharded" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anakin87/zephyr-7b-alpha-sharded", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/anakin87/zephyr-7b-alpha-sharded
- SGLang
How to use anakin87/zephyr-7b-alpha-sharded with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "anakin87/zephyr-7b-alpha-sharded" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anakin87/zephyr-7b-alpha-sharded", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "anakin87/zephyr-7b-alpha-sharded" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anakin87/zephyr-7b-alpha-sharded", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use anakin87/zephyr-7b-alpha-sharded with Docker Model Runner:
docker model run hf.co/anakin87/zephyr-7b-alpha-sharded
Zephyr 7B Alpha - Sharded
UPDATE The original model (Zephyr 7B Alpha) was recently sharded. You can use the original model.
π§©π§©π§© Just a sharded version of Zephyr 7B Alpha.
π» Using this version, you can smoothly load the model on Colab and play with it!
From the original model card:
Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-Ξ± is the first model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes.
Usage
This version of the model is meant primarily to run smoothly on Colab. I suggest loading the model with 8-bit quantization, so that you have some free GPU to perform inference.
However, it is perfectly fine to load the model in half-precision or with stronger quantization (4-bit).
! pip install transformers accelerate bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model = AutoModelForCausalLM.from_pretrained("anakin87/zephyr-7b-alpha-sharded", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("anakin87/zephyr-7b-alpha-sharded")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a rapper",
},
{"role": "user", "content": "What is GPU?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
#<|system|>
#You are a friendly chatbot who always responds in the style of a rapper</s>
#<|user|>
#What is GPU?</s>
#<|assistant|>
#Yo, what's up fam, you askin' 'bout the GPU?
#Well, let me break it down for you, it's a pretty sick dud
#It stands for Graphics Processing Unit, a tech that's quite rude
#This bad boy's the one that's in charge of all the graphics you see
#On your computer screen or your high-tech TV
#It's a powerful tool that can handle intense 3D games and movies
#And it's built to handle multiple tasks with ease
#So if you're looking to take your gaming or video editing to the next level
#Just make sure you've got a top-notch GPU to make it happen.
#Peace out!
- Downloads last month
- 13