Tiny Llama4 MoE Text

This repository contains a tiny Llama4 text-only Mixture-of-Experts causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Llama4 MoE text-model code paths in Hugging Face Transformers.

This checkpoint is useful for implementation testing because it includes chunked attention, full attention, grouped-query attention, QK norm, and MoE routing with multiple experts.

Model purpose

This model is designed for:

testing Llama4ForCausalLM
validating Llama4TextConfig
testing Llama4 text-only MoE model loading
checking model save/load behavior
checking tokenizer save/load behavior
exercising chunked attention layers
exercising full attention layers
exercising grouped-query attention
exercising QK norm paths
exercising MoE expert parameters
exercising top-k expert routing
providing a compact Llama4 MoE checkpoint for inference-engine validation

It is not designed for:

high-quality story generation
instruction following
chat use
OCR
multimodal inference
benchmark comparison against production language models
production deployment

Model architecture

The model uses Llama4ForCausalLM with a small Llama4 text MoE configuration.

Representative configuration:

model_type: llama4_text
vocab_size: 1024

hidden_size: 96
intermediate_size: 192
intermediate_size_mlp: 384

num_hidden_layers: 5
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 24

max_position_embeddings: 1024
attention_chunk_size: 128

layer_types:
  - chunked_attention
  - full_attention
  - chunked_attention
  - full_attention
  - chunked_attention

hidden_act: silu
tie_word_embeddings: false
attention_bias: false
rms_norm_eps: 1e-05
rope_theta: 500000.0

use_qk_norm: true
attn_temperature_tuning: false
floor_scale: 8192
attn_scale: 0.1

num_local_experts: 4
num_experts_per_tok: 2
moe_layers:
  - 0
  - 1
  - 2
  - 3
  - 4
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001
router_jitter_noise: 0.0
output_router_logits: false

pad_token_id: 2
bos_token_id: 0
eos_token_id: 1

The attention pattern is:

CFCFC

where C means chunked_attention and F means full_attention.

This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the chunked attention path.

MoE configuration

This model enables Llama4 MoE blocks.

num_local_experts: 4
num_experts_per_tok: 2
moe_layers: all layers
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001

The num_local_experts=4 and num_experts_per_tok=2 setting is intentional. A smaller configuration such as num_local_experts=2, num_experts_per_tok=1 would exercise only a much simpler routing path.

This checkpoint is intended to cover:

router parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
chunked and full attention interaction
MoE layer execution in a small model

Parameter count

The exact parameter count depends on the saved checkpoint configuration. The default training script prints the parameter count at startup.

For the default configuration, check the run log for:

Parameter count: ...
MoE/router/expert parameter names found: ...
Prefix breakdown:
  ...

After training, the metadata file also records the count:

artifact_metadata.json

Training data

The model was trained on TinyStories-style English story text.

The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.

Training setup

Representative training settings:

num_epochs: 1
learning_rate: 3e-4
batch_size: 32
block_size: 256

vocab_size: 1024
hidden_size: 96
intermediate_size: 192
intermediate_size_mlp: 384
num_hidden_layers: 5
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 24

layer_pattern: CFCFC
attention_chunk_size: 128

num_local_experts: 4
num_experts_per_tok: 2
moe_layers: all
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001

The reference run can be reproduced with:

python train_llama4_moe_text_tinystories_epoch.py \
  --output-dir tinyllama4moe2m \
  --num-epochs 1 \
  --max-steps 0

A short smoke test can be run with:

python train_llama4_moe_text_tinystories_epoch.py \
  --output-dir tinyllama4moe_smoke \
  --max-rows 2000 \
  --max-steps 100 \
  --log-steps 10 \
  --eval-steps 50

Reference result

The reference run completed one epoch of TinyStories-style training.

Final loss: 2.458171248435974

Example generations from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she saw a big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big,

Prompt: There was a little

There was a little girl named Lily. She loved to play with her toys and her friends. One day, she saw a big, big, big carrot in the park. She wanted to play with it, but she was very sad.

Lily saw a big dog who was playing with a ballrot. She wanted to play with it, but she was too big to play with. She asked her mom to help her. She said, "I can't go to the park, but I

Prompt: One day

One day, a little girl named Lily went to the park with her mom. They saw a big tree with a big bowl of cookies. Lily wanted to play with the big, but her mom said they could go to the park.

Lily was very happy. She saw a big tree with a big bowl of cookies. She wanted to play with it. She asked her mom, "Why are you sad?" Her mom said, "I'm sorry

The model can generate TinyStories-like fragments, but repetition, template collapse, and occasional invented words are expected. This is normal for this checkpoint and is not considered a failure for its intended validation purpose.

Usage

If the model files are stored under an hf/ subdirectory, use the following example.

import torch
from transformers import PreTrainedTokenizerFast, Llama4ForCausalLM

repo = "shibatch/tinyllama4moe2m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Llama4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading requirements

This checkpoint requires a Transformers version that supports Llama4 and Llama4 MoE.

The following imports should work:

from transformers import Llama4ForCausalLM, Llama4TextConfig

If these imports fail, update Transformers to a version with Llama4 support.

Tokenizer note

This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.

For this reason, examples use:

from transformers import PreTrainedTokenizerFast

instead of AutoTokenizer.

Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.

The expected tokenizer files include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json

Intended validation coverage

This checkpoint is intended to validate support for:

Llama4TextConfig
Llama4ForCausalLM
chunked_attention layers
full_attention layers
GQA with num_key_value_heads = 1
QK norm
Llama4 RMSNorm behavior
Llama4 MLP activation: silu
Llama4 MoE expert parameters
num_local_experts = 4
num_experts_per_tok = 2
moe_layers
expert routing
expert output combination
generate()
save_pretrained()
from_pretraine()

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

frequent phrase repetition may occur
TinyStories template collapse may occur
weak long-form coherence
small vocabulary
weak semantic consistency
no instruction tuning
no chat formatting
no multimodal capability
no OCR capability
no production use

The checkpoint is primarily intended to make Llama4 MoE text-model code paths easy to test without downloading a large model.

Why include MoE?

A dense tiny Llama4 model is simpler to train, but it does not cover MoE-specific implementation paths.

This checkpoint intentionally includes:

num_local_experts = 4
num_experts_per_tok = 2

to exercise a more realistic MoE routing path than a minimal top-1 routing configuration.

Why include chunked attention?

A full-attention-only tiny model may train more cleanly, but it would not cover Llama4 chunked attention behavior.

This checkpoint uses:

chunked_attention
full_attention
chunked_attention
full_attention
chunked_attention

to cover both attention implementations.

Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

A Llama4 OCR or Llama4 multimodal MoE validation model would be a separate project. It would require a tiny multimodal Llama4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

Suggested repository name

Suggested Hugging Face repository name:

shibatch/tinyllama4moe2m

Alternative names:

shibatch/tinyllama4moetext2m
shibatch/tinyllama4-moe-text

Citation

This is a synthetic tiny validation checkpoint derived from Llama4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support