Instructions to use berkeley-nest/Starling-LM-7B-alpha with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use berkeley-nest/Starling-LM-7B-alpha with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="berkeley-nest/Starling-LM-7B-alpha") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha") model = AutoModelForCausalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use berkeley-nest/Starling-LM-7B-alpha with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "berkeley-nest/Starling-LM-7B-alpha" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "berkeley-nest/Starling-LM-7B-alpha", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/berkeley-nest/Starling-LM-7B-alpha
- SGLang
How to use berkeley-nest/Starling-LM-7B-alpha with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "berkeley-nest/Starling-LM-7B-alpha" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "berkeley-nest/Starling-LM-7B-alpha", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "berkeley-nest/Starling-LM-7B-alpha" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "berkeley-nest/Starling-LM-7B-alpha", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use berkeley-nest/Starling-LM-7B-alpha with Docker Model Runner:
docker model run hf.co/berkeley-nest/Starling-LM-7B-alpha
extremely variable response time for inference?
This model is awesome. But for some reason we are getting extremely variable response time for inference, anywhere between 0.40 seconds and 15 sewconds on an A40.
Could this be cauae by the prompt format or otehr inference parameters?
That's interesting.. Actually I never experienced this before. What kind of inference package are you using? TGI, vLLM or other stuff?
just something we have had to test other models for quite a while. we are checking to see if somehow it is doing something weird. just thought it would be good to post here to check.
BUT: I CANTELL YOU SO FAR ITS INSANELY AWESOME. 8-)
Like crazy accurate.
Haha thank you! I'm glad you like it! It's also likely due to mistral structure itself? Not sure if mistral base / instruct will have the same issue.
yes, it could be that as well. we will check that, running additional tests now.
its kind of interesting. it flies then for some reason it will sit on one inference for about 15 seconds. then it flies again.
just in case, what is the prompt format? Can you post an example?
The prompt format is listed in the model card. FYI is
GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:
But I don't think prompt format will change inference speed. Is it only happening for starling but not other mistral-based model? That is very mysteriours..
yeah tracing the code to see what's up, its reallt weird.
Is it slow for the same prompt consistently?
The model is just too slow actually. I have been testing this model on an AWS instance with NVIDIA Tesla T4 GPU and it takes 2-3 minutes for each response. Once, it took about 9 minutes to generate a simple response. IDK what is going on and my internet is good too.