Instructions to use bigscience/bloom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloom")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") model = AutoModelForCausalLM.from_pretrained("bigscience/bloom") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigscience/bloom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloom
- SGLang
How to use bigscience/bloom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloom with Docker Model Runner:
docker model run hf.co/bigscience/bloom
"Temperature needs to be >0" error
When trying to query the inference API with a temperature of 0, I get an error saying Temperature needs to be >0. Why is that? A temperature of 0 should just mean to always take the highest score.
Hi Sleven! That's because temperature is a float that divides the logits to reshape them for sampling: https://github.com/huggingface/transformers/blob/main/src/transformers/generation_logits_process.py#L140
Personally, I don't think it makes sense to have a temperature of zero be hardcoded to bypass the fact that the generating process is in sampling mode, but that's probably a design discussion to be had on the transformers repo.
Fair enough! Might want to update the docs because they say 0 is okay to disable sampling:
(Default: 1.0). Float (0.0-100.0). The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability.
I am encountering the same error when trying to set the temperature to 0.0
ValueError: `temperature` has to be a strictly positive float, but is 0.0
Any updates on this yet?
In some cases, setting temperature to zero does make sense, e.g. to ensure the reproducibility of results.
You probably don't have to. The reason I was trying with 0 was because the docs seem to say that's the way to switch to greedy sampling (and they still say that).
But another more obvious way is to set top_k=1 which means "sample the top 1 tokens" which is the same as greedy sampling. Then set the temperature to 1 for good measure.