Instructions to use WizardLMTeam/WizardLM-70B-V1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WizardLMTeam/WizardLM-70B-V1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="WizardLMTeam/WizardLM-70B-V1.0")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("WizardLMTeam/WizardLM-70B-V1.0") model = AutoModelForCausalLM.from_pretrained("WizardLMTeam/WizardLM-70B-V1.0") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use WizardLMTeam/WizardLM-70B-V1.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WizardLMTeam/WizardLM-70B-V1.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardLM-70B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/WizardLMTeam/WizardLM-70B-V1.0
- SGLang
How to use WizardLMTeam/WizardLM-70B-V1.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WizardLMTeam/WizardLM-70B-V1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardLM-70B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WizardLMTeam/WizardLM-70B-V1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardLM-70B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use WizardLMTeam/WizardLM-70B-V1.0 with Docker Model Runner:
docker model run hf.co/WizardLMTeam/WizardLM-70B-V1.0
Prompt Format
In the readme you say
WizardLM adopts the prompt format from Vicuna and supports multi-turn conversation. The prompt should be as following:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: hello, who are you? ASSISTANT:
Are there \n missing between the roles and did you use a </s> after the Assistant turn? Since that's what the official Vicuna format is [REF]
here an sample
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>
Also note, that according to the config.json, this model was trained on top of Llama-2-70b-chat-hf rather than Llama-2-70b-hf.
and, Llama-2-70b-chat-hf has a prompt format like:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt} [/INST]
To continue a conversation:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt} [/INST] {model_reply} [INST] {prompt} [/INST]
So this model was trained to follow two different prompt formats, and I imagine its personality changes dramatically depending on which prompt format you use.
WizardLM adopts the prompt format from Vicuna and supports multi-turn conversation. The prompt should be as following:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello.</s>USER: Who are you? ASSISTANT: I am WizardLM.</s>......
Hey @WizardLM ,
Thank you for the response! I see in your comment that you have </s> added after the ASSISTANT turn. Any chance you can answer if there should be \n between the turns as Vicuna does?
meaning is the prompt like the one below?
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>
I am trying to use WizardLM model in chat-conversational-react-description and the prompt schema inside ChatPrompt has a big impact on the result specially in the conversation. I tried but USER/ASSISTANT with </s> and the usual Llama-2 style, I am not sure which prompting style should be the best when it comes to the begin/end of system, user, and assistant roles.
