mittagessen commited on
Commit
7487118
·
verified ·
1 Parent(s): 488c055

Initial readme

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - PleIAs/common_corpus
5
+ ---
6
+
7
+ # ByteLlama
8
+
9
+ ByteLlama is a series of tiny byte-level language models built on the Llama
10
+ architecture. Instead of a learned subword tokenizer, ByteLlama operates
11
+ directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary
12
+ size of 512.
13
+
14
+ ## Models
15
+
16
+ | Variant | Parameters | Layers | Embed Dim | Heads | KV Heads | Intermediate Dim | Max Seq Len |
17
+ |---------|-----------|--------|-----------|-------|----------|-------------------|-------------|
18
+ | tiny | 7M | 18 | 192 | 6 | 2 | 512 | 2048 |
19
+ | small | 16M | 18 | 288 | 9 | 3 | 768 | 2048 |
20
+ | base | 43M | 12 | 576 | 9 | 3 | 1536 | 2048 |
21
+ | large | 101M | 30 | 576 | 9 | 3 | 1536 | 2048 |
22
+
23
+ ## Training Data
24
+
25
+ All models are trained on the **Historical and Cultural** subset of the Common
26
+ Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has
27
+ been trained on 100 billion tokens (100,000 steps with an effective batch size
28
+ of 512 and a sequence length of 2048).
29
+
30
+ ## Language Tokens
31
+
32
+ The models are trained with use in the
33
+ [party](https://github.com/mittagessen/kraken) in mind. They expect a
34
+ **language token** as the first input after the BOS token, indicating the
35
+ language of the follow-up prompt. The tokenizer supports over 200 languages
36
+ identified by ISO 639-3 codes. Given the distribution of languages in the
37
+ training corpus it is unknown which languages the model has actually seen
38
+ during training.
39
+
40
+ For example, encoding with a language tag:
41
+
42
+ ```python
43
+ from bytellama.tokenizer import OctetTokenizer
44
+
45
+ tokenizer = OctetTokenizer()
46
+
47
+ # Encode with a language tag
48
+ tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)
49
+
50
+ # Encode without a language tag
51
+ tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
52
+ ```
53
+
54
+ The token sequence with a language tag looks like: `[BOS, LANG, byte1, byte2, ..., EOS]`.
55
+
56
+ ## Installation
57
+
58
+ ```shell
59
+ pip install .
60
+ ```
61
+
62
+ ## Training
63
+
64
+ Training is driven through the `bytellama` CLI:
65
+
66
+ ```shell
67
+ bytellama --device cuda:0 --precision bf16-mixed train \
68
+ --dataset <hf_dataset_or_path> \
69
+ --model-variant base \
70
+ --batch-size 128 \
71
+ --max-steps 100000 \
72
+ --accumulate-grad-batches 4 \
73
+ -o /path/to/checkpoints
74
+ ```
75
+
76
+ Configuration can also be provided via a YAML file:
77
+
78
+ ```shell
79
+ bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>
80
+ ```
81
+
82
+ ## Inference
83
+
84
+ Use the `infer` command with a prompt and one or more language tokens:
85
+
86
+ ```shell
87
+ bytellama --device cuda:0 infer \
88
+ --model models/model.safetensors \
89
+ --max-new-tokens 200 \
90
+ "Blaue Blasen sind" deu
91
+ ```
92
+
93
+ Language tokens can be ISO 639-3 codes (e.g. `eng`) or language names that
94
+ map to supported tokenizer languages.