ByteLlama

ByteLlama is a series of tiny byte-level language models built on the Llama architecture. Instead of a learned subword tokenizer, ByteLlama operates directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary size of 512.

Models

Variant Parameters Layers Embed Dim Heads KV Heads Intermediate Dim Max Seq Len
tiny 7M 18 192 6 2 512 2048
small 16M 18 288 9 3 768 2048
base 43M 12 576 9 3 1536 2048
large 101M 30 576 9 3 1536 2048

Training Data

All models are trained on the Historical and Cultural subset of the Common Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has been trained on 100 billion tokens (100,000 steps with an effective batch size of 512 and a sequence length of 2048).

Language Tokens

The models are trained with use in the party in mind. They expect a language token as the first input after the BOS token, indicating the language of the follow-up prompt. The tokenizer supports over 200 languages identified by ISO 639-3 codes. Given the distribution of languages in the training corpus it is unknown which languages the model has actually seen during training.

For example, encoding with a language tag:

from bytellama.tokenizer import OctetTokenizer

tokenizer = OctetTokenizer()

# Encode with a language tag
tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)

# Encode without a language tag
tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)

The token sequence with a language tag looks like: [BOS, LANG, byte1, byte2, ..., EOS].

Installation

pip install .

Training

Training is driven through the bytellama CLI:

bytellama --device cuda:0 --precision bf16-mixed train \
    --dataset <hf_dataset_or_path> \
    --model-variant base \
    --batch-size 128 \
    --max-steps 100000 \
    --accumulate-grad-batches 4 \
    -o /path/to/checkpoints

Configuration can also be provided via a YAML file:

bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>

Inference

Use the infer command with a prompt and one or more language tokens:

bytellama --device cuda:0 infer \
    --model models/model.safetensors \
    --max-new-tokens 200 \
    "Blaue Blasen sind" deu

Language tokens can be ISO 639-3 codes (e.g. eng) or language names that map to supported tokenizer languages.

Downloads last month
36
Safetensors
Model size
7.28M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mittagessen/bytellama-7m-cc