Instructions to use shibatch/tinyllama4moe2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyllama4moe2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyllama4moe2m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Tiny Llama4 MoE Text
- Model purpose
- Model architecture
- MoE configuration
- Parameter count
- Training data
- Training setup
- Reference result
- Usage
- Loading requirements
- Tokenizer note
- Intended validation coverage
- Limitations
- Why include MoE?
- Why include chunked attention?
- Notes on OCR and multimodal use
- Suggested repository name
- Citation
- Model purpose
Tiny Llama4 MoE Text
This repository contains a tiny Llama4 text-only Mixture-of-Experts causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Llama4 MoE text-model code paths in Hugging Face Transformers.
This checkpoint is useful for implementation testing because it includes chunked attention, full attention, grouped-query attention, QK norm, and MoE routing with multiple experts.
Model purpose
This model is designed for:
- testing
Llama4ForCausalLM - validating
Llama4TextConfig - testing Llama4 text-only MoE model loading
- checking model save/load behavior
- checking tokenizer save/load behavior
- exercising chunked attention layers
- exercising full attention layers
- exercising grouped-query attention
- exercising QK norm paths
- exercising MoE expert parameters
- exercising top-k expert routing
- providing a compact Llama4 MoE checkpoint for inference-engine validation
It is not designed for:
- high-quality story generation
- instruction following
- chat use
- OCR
- multimodal inference
- benchmark comparison against production language models
- production deployment
Model architecture
The model uses Llama4ForCausalLM with a small Llama4 text MoE configuration.
Representative configuration:
model_type: llama4_text
vocab_size: 1024
hidden_size: 96
intermediate_size: 192
intermediate_size_mlp: 384
num_hidden_layers: 5
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 24
max_position_embeddings: 1024
attention_chunk_size: 128
layer_types:
- chunked_attention
- full_attention
- chunked_attention
- full_attention
- chunked_attention
hidden_act: silu
tie_word_embeddings: false
attention_bias: false
rms_norm_eps: 1e-05
rope_theta: 500000.0
use_qk_norm: true
attn_temperature_tuning: false
floor_scale: 8192
attn_scale: 0.1
num_local_experts: 4
num_experts_per_tok: 2
moe_layers:
- 0
- 1
- 2
- 3
- 4
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001
router_jitter_noise: 0.0
output_router_logits: false
pad_token_id: 2
bos_token_id: 0
eos_token_id: 1
The attention pattern is:
CFCFC
where C means chunked_attention and F means full_attention.
This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the chunked attention path.
MoE configuration
This model enables Llama4 MoE blocks.
num_local_experts: 4
num_experts_per_tok: 2
moe_layers: all layers
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001
The num_local_experts=4 and num_experts_per_tok=2 setting is intentional. A smaller configuration such as num_local_experts=2, num_experts_per_tok=1 would exercise only a much simpler routing path.
This checkpoint is intended to cover:
router parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
chunked and full attention interaction
MoE layer execution in a small model
Parameter count
The exact parameter count depends on the saved checkpoint configuration. The default training script prints the parameter count at startup.
For the default configuration, check the run log for:
Parameter count: ...
MoE/router/expert parameter names found: ...
Prefix breakdown:
...
After training, the metadata file also records the count:
artifact_metadata.json
Training data
The model was trained on TinyStories-style English story text.
The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.
Training setup
Representative training settings:
num_epochs: 1
learning_rate: 3e-4
batch_size: 32
block_size: 256
vocab_size: 1024
hidden_size: 96
intermediate_size: 192
intermediate_size_mlp: 384
num_hidden_layers: 5
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 24
layer_pattern: CFCFC
attention_chunk_size: 128
num_local_experts: 4
num_experts_per_tok: 2
moe_layers: all
interleave_moe_layer_step: 1
router_aux_loss_coef: 0.001
The reference run can be reproduced with:
python train_llama4_moe_text_tinystories_epoch.py \
--output-dir tinyllama4moe2m \
--num-epochs 1 \
--max-steps 0
A short smoke test can be run with:
python train_llama4_moe_text_tinystories_epoch.py \
--output-dir tinyllama4moe_smoke \
--max-rows 2000 \
--max-steps 100 \
--log-steps 10 \
--eval-steps 50
Reference result
The reference run completed one epoch of TinyStories-style training.
Final loss: 2.458171248435974
Example generations from the reference checkpoint:
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she saw a big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big, big,
Prompt: There was a little
There was a little girl named Lily. She loved to play with her toys and her friends. One day, she saw a big, big, big carrot in the park. She wanted to play with it, but she was very sad.
Lily saw a big dog who was playing with a ballrot. She wanted to play with it, but she was too big to play with. She asked her mom to help her. She said, "I can't go to the park, but I
Prompt: One day
One day, a little girl named Lily went to the park with her mom. They saw a big tree with a big bowl of cookies. Lily wanted to play with the big, but her mom said they could go to the park.
Lily was very happy. She saw a big tree with a big bowl of cookies. She wanted to play with it. She asked her mom, "Why are you sad?" Her mom said, "I'm sorry
The model can generate TinyStories-like fragments, but repetition, template collapse, and occasional invented words are expected. This is normal for this checkpoint and is not considered a failure for its intended validation purpose.
Usage
If the model files are stored under an hf/ subdirectory, use the following example.
import torch
from transformers import PreTrainedTokenizerFast, Llama4ForCausalLM
repo = "shibatch/tinyllama4moe2m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Llama4ForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Loading requirements
This checkpoint requires a Transformers version that supports Llama4 and Llama4 MoE.
The following imports should work:
from transformers import Llama4ForCausalLM, Llama4TextConfig
If these imports fail, update Transformers to a version with Llama4 support.
Tokenizer note
This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.
For this reason, examples use:
from transformers import PreTrainedTokenizerFast
instead of AutoTokenizer.
Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.
The expected tokenizer files include:
tokenizer.json
tokenizer_config.json
special_tokens_map.json
Intended validation coverage
This checkpoint is intended to validate support for:
Llama4TextConfig
Llama4ForCausalLM
chunked_attention layers
full_attention layers
GQA with num_key_value_heads = 1
QK norm
Llama4 RMSNorm behavior
Llama4 MLP activation: silu
Llama4 MoE expert parameters
num_local_experts = 4
num_experts_per_tok = 2
moe_layers
expert routing
expert output combination
generate()
save_pretrained()
from_pretraine()
Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
- frequent phrase repetition may occur
- TinyStories template collapse may occur
- weak long-form coherence
- small vocabulary
- weak semantic consistency
- no instruction tuning
- no chat formatting
- no multimodal capability
- no OCR capability
- no production use
The checkpoint is primarily intended to make Llama4 MoE text-model code paths easy to test without downloading a large model.
Why include MoE?
A dense tiny Llama4 model is simpler to train, but it does not cover MoE-specific implementation paths.
This checkpoint intentionally includes:
num_local_experts = 4
num_experts_per_tok = 2
to exercise a more realistic MoE routing path than a minimal top-1 routing configuration.
Why include chunked attention?
A full-attention-only tiny model may train more cleanly, but it would not cover Llama4 chunked attention behavior.
This checkpoint uses:
chunked_attention
full_attention
chunked_attention
full_attention
chunked_attention
to cover both attention implementations.
Notes on OCR and multimodal use
This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.
A Llama4 OCR or Llama4 multimodal MoE validation model would be a separate project. It would require a tiny multimodal Llama4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.
Suggested repository name
Suggested Hugging Face repository name:
shibatch/tinyllama4moe2m
Alternative names:
shibatch/tinyllama4moetext2m
shibatch/tinyllama4-moe-text
Citation
This is a synthetic tiny validation checkpoint derived from Llama4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyllama4moe2m", dtype="auto")