BLOOM-CLP German (6.4B parameters)

This is a monolingual German language model trained using the CLP-Transfer method based on BLOOM-7b1.

You can try out the model at European Language Grid.

UPDATE: We recently released an instruction-tuned version of this model: malteos/bloom-6b4-clp-german-oasst-v0.1.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3)

[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
 {'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
 {'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},]

Training dataset

ca. 50B German tokens
Web-crawled content from the German subset OSCAR v22.01 (excluding content tagged as header, footer, noisy, or adult)
Web-crawled content from the GC4 Corpus (including only the head and middle parts)
Both Web-crawled datasets are deduplicated with Google's suffix array implementation
German court decisions from Open Legal Data

Code

BigScience's Megatron-Deepspeed fork

Hardware

32xA100-40GB GPUs
12.5 days
Tensorboard logs

Evaluation

Validation PPL compared to from-scratch training (the lower the better):

Additional evaluations can be found in our paper.

How to cite

If you are using our code or models, please cite our paper:

@misc{Ostendorff2023clp,
  doi = {10.48550/ARXIV.2301.09626},
  author = {Ostendorff, Malte and Rehm, Georg},
  title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning},
  publisher = {arXiv},
  year = {2023}
}

License

BigScience BLOOM RAIL 1.0

Downloads last month: 93

Safetensors

Model size

6B params

Tensor type

F16

Dataset used to train malteos/bloom-6b4-clp-german

Spaces using malteos/bloom-6b4-clp-german 4

Paper for malteos/bloom-6b4-clp-german

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Paper • 2301.09626 • Published Jan 23, 2023 • 2