Instructions to use TurkuNLP/bloom-finnish-176b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TurkuNLP/bloom-finnish-176b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TurkuNLP/bloom-finnish-176b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TurkuNLP/bloom-finnish-176b")
model = AutoModelForCausalLM.from_pretrained("TurkuNLP/bloom-finnish-176b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TurkuNLP/bloom-finnish-176b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TurkuNLP/bloom-finnish-176b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TurkuNLP/bloom-finnish-176b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TurkuNLP/bloom-finnish-176b

SGLang

How to use TurkuNLP/bloom-finnish-176b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TurkuNLP/bloom-finnish-176b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TurkuNLP/bloom-finnish-176b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TurkuNLP/bloom-finnish-176b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TurkuNLP/bloom-finnish-176b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TurkuNLP/bloom-finnish-176b with Docker Model Runner:
```
docker model run hf.co/TurkuNLP/bloom-finnish-176b
```

Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish. This model is built upon pretrained BLOOM which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.

Datasets

We used a combination of multiple Finnish resources.

Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
Common Crawl Finnish https://TODO
Finnish Wikipedia https://fi.wikipedia.org/wiki
Lönnrot Projekti Lönnrot http://www.lonnrot.net/
ePub National library ”epub” collection
National library ”lehdet” collection
Suomi24 The Suomi 24 Corpus 2001-2020 http://urn.fi/urn:nbn:fi:lb-2021101527
Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
STT Finnish News Agency Archive 1992-2018 http://urn.fi/urn:nbn:fi:lb-2019041501
Yle Finnish News Archive 2011-2018 http://urn.fi/urn:nbn:fi:lb-2017070501
Yle Finnish News Archive 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050401
Yle News Archive Easy-to-read Finnish 2011-2018 http://urn.fi/urn:nbn:fi:lb-2019050901
Yle News Archive Easy-to-read Finnish 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050701
ROOTS - original BLOOM training corpus

Sampling ratios for Finnish

Dataset	Chars	Ratio	Weight	W.Ratio
Parsebank	35.0B	16.9%	1.5	22.7%
mC4-Fi	46.3B	22.4%	1.0	20.0%
CC-Fi	79.6B	38.5%	1.0	34.4%
Fiwiki	0.8B	0.4%	3.0	1.0%
Lönnrot	0.8B	0.4%	3.0	1.0%
Yle	1.6B	0.8%	2.0	1.4%
STT	2.2B	1.1%	2.0	1.9%
ePub	13.5B	6.5%	1.0	5.8%
Lehdet	5.8B	2.8%	1.0	2.5%
Suomi24	20.6B	9.9%	1.0	8.9%
Reddit-Fi	0.7B	0.4%	1.0	0.3%
TOTAL	207.0B	100.0%	N/A	100.0%

And for whole continued pretraining, ROOTS is mixed in.

Downloads last month: 28

Model tree for TurkuNLP/bloom-finnish-176b

Adapters

1 model

Paper for TurkuNLP/bloom-finnish-176b

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 8