Micro-Terse

GitHub Technical Report License

1. Model Introduction

Micro-Terse is a 423M-parameter (≈320M active) ternary-weight language model trained from scratch for ≈**$150**, deployable as a 182 MB CPU-only GGUF. Its weights are constrained to {−1, 0, +1} (≈1.58 bits), so TQ2_0 packs them exactly; the released 182 MB file pairs that with a Q6_K tied embedding.

It is a research proof-of-concept, not a production assistant. At an 8B-token budget it is data-limited: fluent for a clause or two, near chance on knowledge benchmarks. The point is capability per megabyte and per joule — a from-scratch ternary model an individual can train and run on owned hardware.

Key Features

  • Ternary weights {−1, 0, +1} on all internal projections.
  • Clean-room architecture and ternary training operator.
  • 182 MB GGUF (ternary weights packed exactly; Q6_K tied embedding), CPU-only inference.
  • Trained from scratch for ≈$150 on a single RTX A6000.

Model Variants

File Stage Best for
terse-micro-base.TQ2_0.gguf Pretrained LM next-token prediction / completion
terse-micro-sft.TQ2_0.gguf Supervised fine-tuned chat (most fluent)
terse-micro-orpo.TQ2_0.gguf ORPO-aligned identity-aligned responses

2. Model Overview

Property Value
Total / active parameters ≈423 M / ≈320 M (MoE top-2)
Layers / hidden 12 / 1024
Attention GQA 8 query / 2 KV heads (4:1), head dim 128, QK-Norm before RoPE (θ=500000)
FFN 2816 intermediate, squared-ReLU gated
MoE 4 experts, top-2, odd layers; aux-loss-free bias-EMA balancing
MTP 1 head (training only, dropped at inference)
Embeddings tied input/output, full precision (~31% of params)
Tokenizer Llama-3.1 (128,256 vocab)
Context 4096

3. Training

Stage Details
Pretraining 8B tokens FineWeb-Edu; AdamW; LR 3e-4 → 3e-5 cosine; 488,282 steps; bf16; MTP aux 0.1
SFT 3 epochs, 44,558 ChatML conversations, prompt-masked loss
ORPO 1 epoch, ~3,500 identity/charter preference pairs, reference-free
Hardware 1× RTX A6000 48 GB, ≈250 GPU-hours, ≈$150 total
Export F32 GGUF (lossless for ternary) → TQ2_0 ≈ 182 MB

4. Evaluation (measured)

Standard academic benchmarks (MMLU/HellaSwag/ARC) were not run; at this data budget knowledge accuracy is expected near chance. What we measured:

  • Perplexity (held-out English, lower better): base 56.7, SFT 97.5, ORPO 125.0.
  • Identity preference (mean log-prob margin, charter vs "ChatGPT", 4 probes): base −1.81 (0/4) → SFT −1.09 (0/4) → ORPO +0.90 (3/4).
  • Single-token factual recall (base, top-1): "…painted by Leonardo da" → Vinci (90%), "…Neil" → Armstrong (84%), "hydrogen and" → oxygen (73%), "…revolves around the" → sun (66%). ≈14/18 curated prompts correct.

5. Quickstart

The model uses a custom terse architecture, so it needs the small llama.cpp fork (branch terse-arch). After building it:

huggingface-cli download MicheRomChis/micro-terse terse-micro-sft.TQ2_0.gguf --local-dir .
./llama-cli -m terse-micro-sft.TQ2_0.gguf -p "Hello" -n 128

Use terse-micro-base.TQ2_0.gguf for completion and terse-micro-orpo.TQ2_0.gguf for identity-aligned output.

6. Limitations

  • Not a production assistant. Free-generation is incoherent beyond a clause or two (GPT-2-medium-class); it is data-limited.
  • Near-chance on knowledge/reasoning benchmarks is expected. Do not use for factual QA without retrieval.
  • May hallucinate and reflect web-text biases; no safety tuning beyond the ORPO pass.
  • Ternary gives no training-memory savings (STE keeps fp masters); the win is inference footprint/energy.

7. License

Apache-2.0.

8. Citation

@techreport{romerochisco2026tersemicro,
  title  = {Terse-Micro: A 423M-Parameter Ternary-Weight Language Model Trained From Scratch for \$150},
  author = {Romero Chisco, Michelangelo},
  year   = {2026},
  note   = {Apache-2.0. github.com/michelangeloromerochisco/micro-terse}
}
Downloads last month
1,160
GGUF
Model size
0.4B params
Architecture
terse
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support