Instructions to use bigscience/bloom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloom")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") model = AutoModelForCausalLM.from_pretrained("bigscience/bloom") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use bigscience/bloom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloom
- SGLang
How to use bigscience/bloom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloom with Docker Model Runner:
docker model run hf.co/bigscience/bloom
How to use bloom-176B to generate or evaluate on Multi-graphics?
I have a machine with 8*A100, and i would like to use bloom-176B to generate text, and apply evaluation on different datasets. I use the code in bigscience-workshop/bigscience/sortval repo,bigscience/evaluation/generation/generate.py. But this code only allows the model to run on a graphics card.
Apperently single A100 is not enough:
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
How to use bloom-176B to generate or evaluate on Multi-graphics? Should I change the code in generate.py,or use other code?
I have tried use accelerate:
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM
import torch
from accelerate import Accelerator
def generate_from_text(model, text, tokenizer, max_length=200, greedy=False, top_k=0):
input_ids = tokenizer.encode(text, return_tensors='pt').to("cuda:0")
max_length = input_ids.size(-1) + max_length
greedy_output = model.generate(
input_ids.to('cuda:0'),
max_length=max_length,
do_sample=not greedy,
top_k=None if greedy else top_k,
)
return tokenizer.decode(greedy_output[0], skip_special_tokens=True)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
print("load tokenizer")
accelerator = Accelerator()
device = accelerator.device
model = AutoModelForCausalLM.from_pretrained(
arg.checkpoints,
device_map="auto",
torch_dtype=torch.bfloat16,
offload_folder="./offload",
)
model = accelerator.prepare(model)
print("load model")
text = 'hello'
output = generate_from_text(model, text, tokenizer, max_length=500, greedy=True, top_k=1)
print(output)
and it shows :
load tokenizer
Traceback (most recent call last):
File "/mnt/yrfs/xuyifan/bloom/bigscience-sorteval/eva.py", line 28, in
model = accelerator.prepare(model)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in prepare
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 441, in _prepare_one
return self.prepare_model(obj)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 565, in prepare_model
model = model.to(self.device)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi @xuyifan ! Sorry for the late reply. Why is it that you need Accelerator? I don't think you need it.