ByteLlama

ByteLlama is a series of tiny byte-level language models built on the Llama architecture. Instead of a learned subword tokenizer, ByteLlama operates directly on raw UTF-8 bytes, using a simple octet tokenizer with a vocabulary size of 512.

Models

Variant	Parameters	Layers	Embed Dim	Heads	KV Heads	Intermediate Dim	Max Seq Len
tiny	7M	18	192	6	2	512	2048
small	16M	18	288	9	3	768	2048
base	43M	12	576	9	3	1536	2048
large	101M	30	576	9	3	1536	2048

Training Data

All models are trained on the Historical and Cultural subset of the Common Corpus, i.e. the open culture and pre-1900 filtered portion. Each model has been trained on 100 billion tokens (100,000 steps with an effective batch size of 512 and a sequence length of 2048).

Language Tokens

The models are trained with use in the party in mind. They expect a language token as the first input after the BOS token, indicating the language of the follow-up prompt. The tokenizer supports over 200 languages identified by ISO 639-3 codes. Given the distribution of languages in the training corpus it is unknown which languages the model has actually seen during training.

For example, encoding with a language tag:

from bytellama.tokenizer import OctetTokenizer

tokenizer = OctetTokenizer()

# Encode with a language tag
tokens = tokenizer.encode("Hello world!", langs=["eng"], add_bos=True, add_eos=True)

# Encode without a language tag
tokens = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)

The token sequence with a language tag looks like: [BOS, LANG, byte1, byte2, ..., EOS].

Installation

pip install .

Training

Training is driven through the bytellama CLI:

bytellama --device cuda:0 --precision bf16-mixed train \
    --dataset <hf_dataset_or_path> \
    --model-variant base \
    --batch-size 128 \
    --max-steps 100000 \
    --accumulate-grad-batches 4 \
    -o /path/to/checkpoints

Configuration can also be provided via a YAML file:

bytellama --config experiment.yaml train --dataset <hf_dataset_or_path>

Inference

Use the infer command with a prompt and one or more language tokens:

bytellama --device cuda:0 infer \
    --model models/model.safetensors \
    --max-new-tokens 200 \
    "Blaue Blasen sind" deu

Language tokens can be ISO 639-3 codes (e.g. eng) or language names that map to supported tokenizer languages.

Downloads last month: 5

Safetensors

Model size

7.28M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mittagessen
/

bytellama-7m-cc