Instructions to use HuggingFaceH4/starchat-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/starchat-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use HuggingFaceH4/starchat-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/starchat-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/starchat-beta

SGLang

How to use HuggingFaceH4/starchat-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/starchat-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/starchat-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceH4/starchat-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/starchat-beta
```

Conversation derails after a certain number of tokens (?)

#20

by mindplay - opened Aug 4, 2023

Discussion

mindplay

Aug 4, 2023

•

edited Aug 4, 2023

Just came here to say, wow, this model is extremely good - I was quite surprised at the helpfulness of this rather small model!

However, I've noticed, after a certain number of turns, it seems to cut off abruptly - and if you attempt to continue the conversation, it goes completely off the rails! It completely switched from helpful and objective to being all like "haha! nope!" and using a whole bunch of emoji.

I tried to delete the end of the conversation and resume, but this appears to happen consistently after a certain number of turns/tokens.

I'm otherwise extremely surprised and impressed with it's ability to explain some rather complex and exotic programming topics I was asking about!

Really promising stuff. :-)

djokowsj90

Aug 6, 2023

I noticed the same thing.
It started to talk Spanish after answering my prompt question.

Rajath-jain

Sep 29, 2023

Hi can you please provide a short snippet on how you used starchat as a conversation bot? I deployed starcoder in Sagemaker using the deployment script from HF. Using that sample code it only does autocomplete, how to use it like a chatbot?

mindplay

Sep 29, 2023

There's a chat/demo here:

https://huggingface.co/spaces/HuggingFaceH4/starchat-playground

By the way, I understand now why the conversation derails - from my limited understanding, this happens with all models when you exceed the maximum length of the content it was trained on. Some interfaces (such as ChatGPT) work around this problem internally, either by truncating or summarizing the conversation behind the scenes, when the conversation length starts to approach the limit.

I wonder when we'll start to see implementation of this:

https://github.com/ggerganov/llama.cpp/discussions/2936

It looks relatively simple, and supposedly solves the conversation length issue by enabling the model to selectively forget things that fall out of conversation scope. It also apparently speeds up the model by 2-4x!

David Shapiro talks about it in this video:

https://www.youtube.com/watch?v=5XaJQKgL9Hs&t=8s&pp=ygULbG0taW5maW5pdGU%3D

I'm not sure why we're not seeing implementations of this everywhere yet - it sounds like a slam dunk. 🙂

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment