Instructions to use yhavinga/gpt2-medium-dutch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yhavinga/gpt2-medium-dutch with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yhavinga/gpt2-medium-dutch")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yhavinga/gpt2-medium-dutch")
model = AutoModelForCausalLM.from_pretrained("yhavinga/gpt2-medium-dutch")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yhavinga/gpt2-medium-dutch with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yhavinga/gpt2-medium-dutch"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yhavinga/gpt2-medium-dutch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/yhavinga/gpt2-medium-dutch

SGLang

How to use yhavinga/gpt2-medium-dutch with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yhavinga/gpt2-medium-dutch" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yhavinga/gpt2-medium-dutch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yhavinga/gpt2-medium-dutch" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yhavinga/gpt2-medium-dutch",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use yhavinga/gpt2-medium-dutch with Docker Model Runner:
```
docker model run hf.co/yhavinga/gpt2-medium-dutch
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

A GPT2 medium-sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.

How To Use

You can use this GPT2-model directly with a pipeline for text generation.

MODEL_DIR='yhavinga/gpt2-medium-dutch'
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':100})

generated_text = generator('In Antwerpen heeft zich gisteren', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))

"In Antwerpen heeft zich gisteren" - " een dramatische ontknoping voorgedaan in de Vlaamse deelregering. De VLD, die sinds afgelopen woensdag aan het bewind is in Vlaams-Waals gebied (de zogenaamde gewestelijke en niet rechtstreeks met Vlaanderen samenwerkende gewesten), krijgt toch geen meerderheidszetels bij verkiezingen voor gemeenteraadsverkiezingen in oktober of november volgend jaar in Westmalle, Berchem, Tervuren enz., aldus premier Jean-Pierre Van Cauwenberghe van Wallonië vandaag"

Tokenizer

BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface Transformers Flax examples.

Dataset

This model was trained on of the full configuration (33B tokens) of cleaned Dutch mC4, which is the original mC4, except

Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
Sentences with less than 3 words are removed
Sentences with a word of more than 1000 characters are removed
Documents with less than 5 sentences are removed
Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Models

TL;DR: yhavinga/gpt2-medium-dutch is the best model.

The models with a/b in the step-column have been trained to step a of a total of b steps.

	model	params	train seq len	ppl	loss	batch size	epochs	steps	optim	lr	duration	config
yhavinga/gpt-neo-125M-dutch	gpt neo	125M	512	20.9	3.04	128	1	190000/558608	adam	2.4e-3	1d 12h	full
yhavinga/gpt2-medium-dutch	gpt2	345M	512	15.1	2.71	128	1	320000/520502	adam	8e-4	7d 2h	full
yhavinga/gpt2-large-dutch	gpt2	762M	512	15.1	2.72	32	1	1100000/2082009	adafactor	3.3e-5	8d 15h	large
yhavinga/gpt-neo-1.3B-dutch	gpt neo	1.3B	512	16.0	2.77	16	1	960000/3049896	adafactor	5e-4	7d 11h	full

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM, and training the models:

Created by Yeb Havinga

Downloads last month: 1,067

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for yhavinga/gpt2-medium-dutch

Quantizations

1 model

yhavinga
/

gpt2-medium-dutch