Instructions to use HuggingFaceH4/starchat-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceH4/starchat-beta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta") model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceH4/starchat-beta with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceH4/starchat-beta" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/starchat-beta", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HuggingFaceH4/starchat-beta
- SGLang
How to use HuggingFaceH4/starchat-beta with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/starchat-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/starchat-beta", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceH4/starchat-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceH4/starchat-beta", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HuggingFaceH4/starchat-beta with Docker Model Runner:
docker model run hf.co/HuggingFaceH4/starchat-beta
Conversation derails after a certain number of tokens (?)
Just came here to say, wow, this model is extremely good - I was quite surprised at the helpfulness of this rather small model!
However, I've noticed, after a certain number of turns, it seems to cut off abruptly - and if you attempt to continue the conversation, it goes completely off the rails! It completely switched from helpful and objective to being all like "haha! nope!" and using a whole bunch of emoji.
I tried to delete the end of the conversation and resume, but this appears to happen consistently after a certain number of turns/tokens.
I'm otherwise extremely surprised and impressed with it's ability to explain some rather complex and exotic programming topics I was asking about!
Really promising stuff. :-)
I noticed the same thing.
It started to talk Spanish after answering my prompt question.
Hi can you please provide a short snippet on how you used starchat as a conversation bot? I deployed starcoder in Sagemaker using the deployment script from HF. Using that sample code it only does autocomplete, how to use it like a chatbot?
There's a chat/demo here:
https://huggingface.co/spaces/HuggingFaceH4/starchat-playground
By the way, I understand now why the conversation derails - from my limited understanding, this happens with all models when you exceed the maximum length of the content it was trained on. Some interfaces (such as ChatGPT) work around this problem internally, either by truncating or summarizing the conversation behind the scenes, when the conversation length starts to approach the limit.
I wonder when we'll start to see implementation of this:
https://github.com/ggerganov/llama.cpp/discussions/2936
It looks relatively simple, and supposedly solves the conversation length issue by enabling the model to selectively forget things that fall out of conversation scope. It also apparently speeds up the model by 2-4x!
David Shapiro talks about it in this video:
https://www.youtube.com/watch?v=5XaJQKgL9Hs&t=8s&pp=ygULbG0taW5maW5pdGU%3D
I'm not sure why we're not seeing implementations of this everywhere yet - it sounds like a slam dunk. π