ByteLlama
ByteLlama is a series of tiny byte-level language models built on the Llama architecture. Instead of a learned subword tokenizer, ByteLlama operates directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary size of 512.
Models
| Variant | Parameters | Layers | Embed Dim | Heads | KV Heads | Intermediate Dim | Max Seq Len |
|---|---|---|---|---|---|---|---|
| tiny | 7M | 18 | 192 | 6 | 2 | 512 | 2048 |
| small | 16M | 18 | 288 | 9 | 3 | 768 | 2048 |
| base | 43M | 12 | 576 | 9 | 3 | 1536 | 2048 |
| large | 101M | 30 | 576 | 9 | 3 | 1536 | 2048 |
Training Data
All models are trained on the Historical and Cultural subset of the Common Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has been trained on 100 billion tokens (100,000 steps with an effective batch size of 512 and a sequence length of 2048).
Language Tokens
The models are trained with use in the party in mind. They expect a language token as the first input after the BOS token, indicating the language of the follow-up prompt. The tokenizer supports over 200 languages identified by ISO 639-3 codes. Given the distribution of languages in the training corpus it is unknown which languages the model has actually seen during training.
For example, encoding with a language tag:
from bytellama.tokenizer import OctetTokenizer
tokenizer = OctetTokenizer()
# Encode with a language tag
tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)
# Encode without a language tag
tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
The token sequence with a language tag looks like: [BOS, LANG, byte1, byte2, ..., EOS].
Installation
pip install .
Training
Training is driven through the bytellama CLI:
bytellama --device cuda:0 --precision bf16-mixed train \
--dataset <hf_dataset_or_path> \
--model-variant base \
--batch-size 128 \
--max-steps 100000 \
--accumulate-grad-batches 4 \
-o /path/to/checkpoints
Configuration can also be provided via a YAML file:
bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>
Inference
Use the infer command with a prompt and one or more language tokens:
bytellama --device cuda:0 infer \
--model models/model.safetensors \
--max-new-tokens 200 \
"Blaue Blasen sind" deu
Language tokens can be ISO 639-3 codes (e.g. eng) or language names that
map to supported tokenizer languages.
- Downloads last month
- 36