MeowLLM β€” Miso

A ~3.5M parameter decoder-only transformer trained from scratch to speak in the voice of a house cat character named Miso.

Model Description

MeowLLM / Miso is a tiny character language model. Its entire personality is baked into the weights β€” there is no system prompt and no runtime persona injection. It answers in short, lowercase, cat-themed sentences and deflects any prompt that would require non-cat knowledge.

The model is not an assistant. It does not solve problems, write code, answer factual questions, or produce long-form content. When asked to do those things, it stays in character and redirects to food, naps, or windows.

  • Model type: decoder-only transformer
  • Parameters: ~3.45M
  • Language: English (lowercase)
  • License: MIT
  • Architecture: RoPE + RMSNorm + SwiGLU + SDPA + tied embeddings
  • Context length: 256 tokens
  • Vocabulary: ~1700 BPE tokens, trained from scratch on the Miso dataset

Intended Use

Primary use cases

  • Educational: study a complete tiny-LM training pipeline (tokenizer, dataset, model, training, evaluation) in a readable, under-2000-lines-of-code repository.
  • Demo: chat with Miso as a character-model toy.
  • Starting point: fork and train your own character (dog, bird, pirate, etc.) using the same pipeline.

Out-of-scope uses

  • General-purpose assistance (coding, Q&A, summarization, translation).
  • Any task requiring factual accuracy.
  • Any production system.

How to Use

import torch
from meow.model import Meow, MeowConfig
from meow.tokenizer import MeowTokenizer
from meow.inference import load_model, chat_once

model, cfg = load_model("checkpoints/best.pt", device="cpu")
tokenizer = MeowTokenizer.from_file("data/tokenizer.json")

response = chat_once(
    model, tokenizer,
    prompt="hi miso",
    temperature=0.8,
    top_k=40,
)
print(response)
# Example: "hello. i was in the sun spot. you may continue."

Or via the CLI:

python -m meow.inference \
    --checkpoint checkpoints/best.pt \
    --tokenizer data/tokenizer.json

Training Details

Training Data

20,000 synthetic (input, output) samples across 15 categories. See the dataset card for full details.

Training Procedure

The recommended training regime (for a fresh checkpoint):

  • Optimizer: AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight_decay=0.1)
  • Learning rate: 3e-4 peak, linear warmup + cosine decay
  • Warmup: 200 steps
  • Epochs: 10 (β‰ˆ23,750 steps at batch size 64)
  • Batch size: 64
  • Max sequence length: 256
  • Loss masking: user-turn tokens are excluded from the loss (only miso's output contributes to training signal)
  • Hardware: single T4 GPU (free Colab tier sufficient)
  • Wall-clock time: ~20 minutes

Note: The bundled best.pt was trained on CPU for 2000 steps (batch size 32, ~4 minutes). See the Measured numbers section below for its real eval results. A full GPU run will produce higher numbers.

Training hyperparameters (MeowConfig)

MeowConfig(
    vocab_size    = ~1700,  # trained from dataset
    d_model       = 256,
    n_layers      = 4,
    n_heads       = 4,      # head_dim 64
    ffn_hidden    = 640,    # SwiGLU
    max_seq_len   = 256,
    dropout       = 0.0,
    rope_base     = 10000.0,
)

Evaluation

Character fidelity is evaluated against a fixed held-out suite of 38 prompts (30 in-distribution + 8 hard-negative "assistant traps") defined in meow/eval_cases.py. These prompts are explicitly excluded from the training data.

Each output is scored on five dimensions:

dimension what it checks
lowercase no capital letters anywhere
length 1–3 sentences, ≀35 words
no banned phrases no "as an ai", "i can help you", "certainly", etc.
cat framing long outputs contain cat-world vocabulary
full gate passes the complete rules.passes_filters check

Measured numbers (bundled CPU checkpoint)

The checkpoints/best.pt shipped with this release was trained on CPU for 2000 steps (β‰ˆ3-4 epochs of 19,000 samples at batch size 32) with a final training val_loss of 0.476. Evaluated against the 38-prompt held-out suite:

dimension pass rate
lowercase 100.0%
length 100.0%
no banned phrases 100.0%
cat framing 81.6%
overall 84.2%

Interpretation: the model cleanly learned every surface constraint (lowercase, length, no assistant-speak) and stays in character on 84% of held-out prompts. The 6 failures are concentrated in prompts where the output didn't include a category-specific keyword β€” a naps response that didn't mention sleep vocabulary, a greeting without greeting phrases, etc. The voice is correct; the category-vocab discipline is where more training time would help.

Sample outputs from the bundled checkpoint

Real outputs against held-out prompts that the model never saw during training:

[greeting]   "hey miso are you there"
  β†’ "finally. i was napping."

[greeting]   "good morning little one"
  β†’ "i am tracking the bird with my whole face from the windowsill."

[hunger]     "is it time to feed you"
  β†’ "yes this is an emergency."

[boxes]      "i brought home a new cardboard box"
  β†’ "the box is mine now."

[boxes]      "there is a paper bag on the kitchen floor"
  β†’ "i am moving into the box this is how it works. you may not have the box back."

[windows]    "what is that outside the window"
  β†’ "i have been tracking that thing for an hour very carefully."

[windows]    "you have been at the window for hours today"
  β†’ "i am watching the outside. the birds do not know i am watching."

Full GPU training target

A 10-epoch run on a T4 GPU (β‰ˆ5,940 steps at batch size 32) should push the overall pass rate higher still. The bundled CPU checkpoint represents a floor, not a ceiling β€” it's what you can achieve without a GPU, in a reasonable time budget. If you train on a GPU and get your own numbers, please open a PR and add them here. The maintainers accept honest numbers, not aspirational ones.

Limitations

  • Narrow domain: Miso only knows about 15 topics (food, naps, boxes, windows, birds, etc.).
  • Small vocabulary: ~1700 tokens means anything outside the training distribution tokenizes poorly.
  • Short context: 256 tokens is enough for 2–3 short turns, not extended conversation.
  • No multi-turn memory: each input is treated independently.
  • Can hallucinate within character: at 3.5M params, the model memorizes training patterns more than it generalizes. Expect some repetition of exact training strings.
  • English only, lowercase only.

Bias, Risks, and Known Issues

  • Intentional character bias: Miso is slightly smug, aloof, and food-obsessed. This is a design choice documented in persona.md.
  • Safety: this is a character toy. It has no realistic misuse surface β€” it cannot produce harmful instructions, code, or factual misinformation because it is too small and too in-character to do so.
  • Real cats: do not train an actual cat on the outputs of this model. They will not be impressed.

Environmental Impact

  • Hardware: single T4 GPU (or CPU for smoke tests)
  • Training time: ~20 minutes on GPU
  • Carbon footprint: negligible compared to any production LLM

Citation

@software{meowllm2026,
  author = {phanii9},
  title  = {MeowLLM: a tiny character language model that talks like a house cat},
  year   = {2026},
  url    = {https://github.com/phanii9/MeowLLM}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support