The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper • 2303.03915 • Published • 8
How to use TurkuNLP/bloom-finnish-176b with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="TurkuNLP/bloom-finnish-176b") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TurkuNLP/bloom-finnish-176b")
model = AutoModelForCausalLM.from_pretrained("TurkuNLP/bloom-finnish-176b")How to use TurkuNLP/bloom-finnish-176b with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TurkuNLP/bloom-finnish-176b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "TurkuNLP/bloom-finnish-176b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/TurkuNLP/bloom-finnish-176b
How to use TurkuNLP/bloom-finnish-176b with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "TurkuNLP/bloom-finnish-176b" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "TurkuNLP/bloom-finnish-176b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "TurkuNLP/bloom-finnish-176b" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "TurkuNLP/bloom-finnish-176b",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use TurkuNLP/bloom-finnish-176b with Docker Model Runner:
docker model run hf.co/TurkuNLP/bloom-finnish-176b
Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish. This model is built upon pretrained BLOOM which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.
Datasets
We used a combination of multiple Finnish resources.
Sampling ratios for Finnish
| Dataset | Chars | Ratio | Weight | W.Ratio |
|---|---|---|---|---|
| Parsebank | 35.0B | 16.9% | 1.5 | 22.7% |
| mC4-Fi | 46.3B | 22.4% | 1.0 | 20.0% |
| CC-Fi | 79.6B | 38.5% | 1.0 | 34.4% |
| Fiwiki | 0.8B | 0.4% | 3.0 | 1.0% |
| Lönnrot | 0.8B | 0.4% | 3.0 | 1.0% |
| Yle | 1.6B | 0.8% | 2.0 | 1.4% |
| STT | 2.2B | 1.1% | 2.0 | 1.9% |
| ePub | 13.5B | 6.5% | 1.0 | 5.8% |
| Lehdet | 5.8B | 2.8% | 1.0 | 2.5% |
| Suomi24 | 20.6B | 9.9% | 1.0 | 8.9% |
| Reddit-Fi | 0.7B | 0.4% | 1.0 | 0.3% |
| TOTAL | 207.0B | 100.0% | N/A | 100.0% |
And for whole continued pretraining, ROOTS is mixed in.