mittagessen
/

bytellama-7m-cc

Model card Files Files and versions

mittagessen commited on Mar 6

Commit

7487118

·

verified ·

1 Parent(s): 488c055

Initial readme

Files changed (1) hide show

README.md +94 -3

README.md CHANGED Viewed

@@ -1,3 +1,94 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- PleIAs/common_corpus
+---
+# ByteLlama
+ByteLlama is a series of tiny byte-level language models built on the Llama
+architecture. Instead of a learned subword tokenizer, ByteLlama operates
+directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary
+size of 512.
+## Models
+| Variant | Parameters | Layers | Embed Dim | Heads | KV Heads | Intermediate Dim | Max Seq Len |
+|---------|-----------|--------|-----------|-------|----------|-------------------|-------------|
+| tiny    | 7M        | 18     | 192       | 6     | 2        | 512               | 2048        |
+| small   | 16M       | 18     | 288       | 9     | 3        | 768               | 2048        |
+| base    | 43M       | 12     | 576       | 9     | 3        | 1536              | 2048        |
+| large   | 101M      | 30     | 576       | 9     | 3        | 1536              | 2048        |
+## Training Data
+All models are trained on the **Historical and Cultural** subset of the Common
+Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has
+been trained on 100 billion tokens (100,000 steps with an effective batch size
+of 512 and a sequence length of 2048).
+## Language Tokens
+The models are trained with use in the
+[party](https://github.com/mittagessen/kraken) in mind. They expect a
+**language token** as the first input after the BOS token, indicating the
+language of the follow-up prompt. The tokenizer supports over 200 languages
+identified by ISO 639-3 codes. Given the distribution of languages in the
+training corpus it is unknown which languages the model has actually seen
+during training.
+For example, encoding with a language tag:
+```python
+from bytellama.tokenizer import OctetTokenizer
+tokenizer = OctetTokenizer()
+# Encode with a language tag
+tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)
+# Encode without a language tag
+tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
+```
+The token sequence with a language tag looks like: `[BOS, LANG, byte1, byte2, ..., EOS]`.
+## Installation
+```shell
+pip install .
+```
+## Training
+Training is driven through the `bytellama` CLI:
+```shell
+bytellama --device cuda:0 --precision bf16-mixed train \
+    --dataset <hf_dataset_or_path> \
+    --model-variant base \
+    --batch-size 128 \
+    --max-steps 100000 \
+    --accumulate-grad-batches 4 \
+    -o /path/to/checkpoints
+```
+Configuration can also be provided via a YAML file:
+```shell
+bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>
+```
+## Inference
+Use the `infer` command with a prompt and one or more language tokens:
+```shell
+bytellama --device cuda:0 infer \
+    --model models/model.safetensors \
+    --max-new-tokens 200 \
+    "Blaue Blasen sind" deu
+```
+Language tokens can be ISO 639-3 codes (e.g. `eng`) or language names that
+map to supported tokenizer languages.