Initial readme
Browse files
README.md
CHANGED
|
@@ -1,3 +1,94 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- PleIAs/common_corpus
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ByteLlama
|
| 8 |
+
|
| 9 |
+
ByteLlama is a series of tiny byte-level language models built on the Llama
|
| 10 |
+
architecture. Instead of a learned subword tokenizer, ByteLlama operates
|
| 11 |
+
directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary
|
| 12 |
+
size of 512.
|
| 13 |
+
|
| 14 |
+
## Models
|
| 15 |
+
|
| 16 |
+
| Variant | Parameters | Layers | Embed Dim | Heads | KV Heads | Intermediate Dim | Max Seq Len |
|
| 17 |
+
|---------|-----------|--------|-----------|-------|----------|-------------------|-------------|
|
| 18 |
+
| tiny | 7M | 18 | 192 | 6 | 2 | 512 | 2048 |
|
| 19 |
+
| small | 16M | 18 | 288 | 9 | 3 | 768 | 2048 |
|
| 20 |
+
| base | 43M | 12 | 576 | 9 | 3 | 1536 | 2048 |
|
| 21 |
+
| large | 101M | 30 | 576 | 9 | 3 | 1536 | 2048 |
|
| 22 |
+
|
| 23 |
+
## Training Data
|
| 24 |
+
|
| 25 |
+
All models are trained on the **Historical and Cultural** subset of the Common
|
| 26 |
+
Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has
|
| 27 |
+
been trained on 100 billion tokens (100,000 steps with an effective batch size
|
| 28 |
+
of 512 and a sequence length of 2048).
|
| 29 |
+
|
| 30 |
+
## Language Tokens
|
| 31 |
+
|
| 32 |
+
The models are trained with use in the
|
| 33 |
+
[party](https://github.com/mittagessen/kraken) in mind. They expect a
|
| 34 |
+
**language token** as the first input after the BOS token, indicating the
|
| 35 |
+
language of the follow-up prompt. The tokenizer supports over 200 languages
|
| 36 |
+
identified by ISO 639-3 codes. Given the distribution of languages in the
|
| 37 |
+
training corpus it is unknown which languages the model has actually seen
|
| 38 |
+
during training.
|
| 39 |
+
|
| 40 |
+
For example, encoding with a language tag:
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
from bytellama.tokenizer import OctetTokenizer
|
| 44 |
+
|
| 45 |
+
tokenizer = OctetTokenizer()
|
| 46 |
+
|
| 47 |
+
# Encode with a language tag
|
| 48 |
+
tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)
|
| 49 |
+
|
| 50 |
+
# Encode without a language tag
|
| 51 |
+
tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
The token sequence with a language tag looks like: `[BOS, LANG, byte1, byte2, ..., EOS]`.
|
| 55 |
+
|
| 56 |
+
## Installation
|
| 57 |
+
|
| 58 |
+
```shell
|
| 59 |
+
pip install .
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Training
|
| 63 |
+
|
| 64 |
+
Training is driven through the `bytellama` CLI:
|
| 65 |
+
|
| 66 |
+
```shell
|
| 67 |
+
bytellama --device cuda:0 --precision bf16-mixed train \
|
| 68 |
+
--dataset <hf_dataset_or_path> \
|
| 69 |
+
--model-variant base \
|
| 70 |
+
--batch-size 128 \
|
| 71 |
+
--max-steps 100000 \
|
| 72 |
+
--accumulate-grad-batches 4 \
|
| 73 |
+
-o /path/to/checkpoints
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
Configuration can also be provided via a YAML file:
|
| 77 |
+
|
| 78 |
+
```shell
|
| 79 |
+
bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Inference
|
| 83 |
+
|
| 84 |
+
Use the `infer` command with a prompt and one or more language tokens:
|
| 85 |
+
|
| 86 |
+
```shell
|
| 87 |
+
bytellama --device cuda:0 infer \
|
| 88 |
+
--model models/model.safetensors \
|
| 89 |
+
--max-new-tokens 200 \
|
| 90 |
+
"Blaue Blasen sind" deu
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
Language tokens can be ISO 639-3 codes (e.g. `eng`) or language names that
|
| 94 |
+
map to supported tokenizer languages.
|