Instructions to use berkeley-nest/Starling-LM-7B-alpha with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use berkeley-nest/Starling-LM-7B-alpha with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="berkeley-nest/Starling-LM-7B-alpha")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
model = AutoModelForMultimodalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use berkeley-nest/Starling-LM-7B-alpha with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "berkeley-nest/Starling-LM-7B-alpha"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "berkeley-nest/Starling-LM-7B-alpha",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/berkeley-nest/Starling-LM-7B-alpha

SGLang

How to use berkeley-nest/Starling-LM-7B-alpha with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "berkeley-nest/Starling-LM-7B-alpha" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "berkeley-nest/Starling-LM-7B-alpha",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "berkeley-nest/Starling-LM-7B-alpha" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "berkeley-nest/Starling-LM-7B-alpha",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use berkeley-nest/Starling-LM-7B-alpha with Docker Model Runner:
```
docker model run hf.co/berkeley-nest/Starling-LM-7B-alpha
```

extremely variable response time for inference?

#21

by silvacarl - opened Dec 5, 2023

Discussion

silvacarl

Dec 5, 2023

This model is awesome. But for some reason we are getting extremely variable response time for inference, anywhere between 0.40 seconds and 15 sewconds on an A40.

Could this be cauae by the prompt format or otehr inference parameters?

banghua

Berkeley-Nest org Dec 5, 2023

That's interesting.. Actually I never experienced this before. What kind of inference package are you using? TGI, vLLM or other stuff?

silvacarl

Dec 5, 2023

just something we have had to test other models for quite a while. we are checking to see if somehow it is doing something weird. just thought it would be good to post here to check.

BUT: I CANTELL YOU SO FAR ITS INSANELY AWESOME. 8-)

Like crazy accurate.

banghua

Berkeley-Nest org Dec 5, 2023

Haha thank you! I'm glad you like it! It's also likely due to mistral structure itself? Not sure if mistral base / instruct will have the same issue.

silvacarl

Dec 5, 2023

yes, it could be that as well. we will check that, running additional tests now.

silvacarl

Dec 8, 2023

•

edited Dec 9, 2023

its kind of interesting. it flies then for some reason it will sit on one inference for about 15 seconds. then it flies again.

just in case, what is the prompt format? Can you post an example?

banghua

Berkeley-Nest org Dec 9, 2023

The prompt format is listed in the model card. FYI is

But I don't think prompt format will change inference speed. Is it only happening for starling but not other mistral-based model? That is very mysteriours..

silvacarl

Dec 9, 2023

yeah tracing the code to see what's up, its reallt weird.

beenotung

Dec 9, 2023

Is it slow for the same prompt consistently?

Sujan42024

Dec 9, 2023

The model is just too slow actually. I have been testing this model on an AWS instance with NVIDIA Tesla T4 GPU and it takes 2-3 minutes for each response. Once, it took about 9 minutes to generate a simple response. IDK what is going on and my internet is good too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment