Instructions to use MicheRomChis/micro-terse with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use MicheRomChis/micro-terse with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MicheRomChis/micro-terse", filename="terse-micro-base.TQ2_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use MicheRomChis/micro-terse with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf MicheRomChis/micro-terse:TQ2_0 # Run inference directly in the terminal: llama cli -hf MicheRomChis/micro-terse:TQ2_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf MicheRomChis/micro-terse:TQ2_0 # Run inference directly in the terminal: llama cli -hf MicheRomChis/micro-terse:TQ2_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf MicheRomChis/micro-terse:TQ2_0 # Run inference directly in the terminal: ./llama-cli -hf MicheRomChis/micro-terse:TQ2_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf MicheRomChis/micro-terse:TQ2_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf MicheRomChis/micro-terse:TQ2_0
Use Docker
docker model run hf.co/MicheRomChis/micro-terse:TQ2_0
- LM Studio
- Jan
- vLLM
How to use MicheRomChis/micro-terse with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MicheRomChis/micro-terse" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MicheRomChis/micro-terse", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/MicheRomChis/micro-terse:TQ2_0
- Ollama
How to use MicheRomChis/micro-terse with Ollama:
ollama run hf.co/MicheRomChis/micro-terse:TQ2_0
- Unsloth Studio
How to use MicheRomChis/micro-terse with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MicheRomChis/micro-terse to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MicheRomChis/micro-terse to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for MicheRomChis/micro-terse to start chatting
- Atomic Chat new
- Docker Model Runner
How to use MicheRomChis/micro-terse with Docker Model Runner:
docker model run hf.co/MicheRomChis/micro-terse:TQ2_0
- Lemonade
How to use MicheRomChis/micro-terse with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull MicheRomChis/micro-terse:TQ2_0
Run and chat with the model
lemonade run user.micro-terse-TQ2_0
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)
1. Model Introduction
Micro-Terse is a 423M-parameter (≈320M active) ternary-weight language model trained from
scratch for ≈**$150**, deployable as a 182 MB CPU-only GGUF. Its weights are constrained to
{−1, 0, +1} (≈1.58 bits), so TQ2_0 packs them exactly; the released 182 MB file pairs that with a Q6_K tied embedding.
It is a research proof-of-concept, not a production assistant. At an 8B-token budget it is data-limited: fluent for a clause or two, near chance on knowledge benchmarks. The point is capability per megabyte and per joule — a from-scratch ternary model an individual can train and run on owned hardware.
Key Features
- Ternary weights
{−1, 0, +1}on all internal projections. - Clean-room architecture and ternary training operator.
- 182 MB GGUF (ternary weights packed exactly; Q6_K tied embedding), CPU-only inference.
- Trained from scratch for ≈$150 on a single RTX A6000.
Model Variants
| File | Stage | Best for |
|---|---|---|
terse-micro-base.TQ2_0.gguf |
Pretrained LM | next-token prediction / completion |
terse-micro-sft.TQ2_0.gguf |
Supervised fine-tuned | chat (most fluent) |
terse-micro-orpo.TQ2_0.gguf |
ORPO-aligned | identity-aligned responses |
2. Model Overview
| Property | Value |
|---|---|
| Total / active parameters | ≈423 M / ≈320 M (MoE top-2) |
| Layers / hidden | 12 / 1024 |
| Attention | GQA 8 query / 2 KV heads (4:1), head dim 128, QK-Norm before RoPE (θ=500000) |
| FFN | 2816 intermediate, squared-ReLU gated |
| MoE | 4 experts, top-2, odd layers; aux-loss-free bias-EMA balancing |
| MTP | 1 head (training only, dropped at inference) |
| Embeddings | tied input/output, full precision (~31% of params) |
| Tokenizer | Llama-3.1 (128,256 vocab) |
| Context | 4096 |
3. Training
| Stage | Details |
|---|---|
| Pretraining | 8B tokens FineWeb-Edu; AdamW; LR 3e-4 → 3e-5 cosine; 488,282 steps; bf16; MTP aux 0.1 |
| SFT | 3 epochs, 44,558 ChatML conversations, prompt-masked loss |
| ORPO | 1 epoch, ~3,500 identity/charter preference pairs, reference-free |
| Hardware | 1× RTX A6000 48 GB, ≈250 GPU-hours, ≈$150 total |
| Export | F32 GGUF (lossless for ternary) → TQ2_0 ≈ 182 MB |
4. Evaluation (measured)
Standard academic benchmarks (MMLU/HellaSwag/ARC) were not run; at this data budget knowledge accuracy is expected near chance. What we measured:
- Perplexity (held-out English, lower better): base 56.7, SFT 97.5, ORPO 125.0.
- Identity preference (mean log-prob margin, charter vs "ChatGPT", 4 probes): base −1.81 (0/4) → SFT −1.09 (0/4) → ORPO +0.90 (3/4).
- Single-token factual recall (base, top-1): "…painted by Leonardo da" → Vinci (90%), "…Neil" → Armstrong (84%), "hydrogen and" → oxygen (73%), "…revolves around the" → sun (66%). ≈14/18 curated prompts correct.
5. Quickstart
The model uses a custom terse architecture, so it needs the small llama.cpp fork
(branch terse-arch). After building it:
huggingface-cli download MicheRomChis/micro-terse terse-micro-sft.TQ2_0.gguf --local-dir .
./llama-cli -m terse-micro-sft.TQ2_0.gguf -p "Hello" -n 128
Use terse-micro-base.TQ2_0.gguf for completion and terse-micro-orpo.TQ2_0.gguf for
identity-aligned output.
6. Limitations
- Not a production assistant. Free-generation is incoherent beyond a clause or two (GPT-2-medium-class); it is data-limited.
- Near-chance on knowledge/reasoning benchmarks is expected. Do not use for factual QA without retrieval.
- May hallucinate and reflect web-text biases; no safety tuning beyond the ORPO pass.
- Ternary gives no training-memory savings (STE keeps fp masters); the win is inference footprint/energy.
7. License
Apache-2.0.
8. Citation
@techreport{romerochisco2026tersemicro,
title = {Terse-Micro: A 423M-Parameter Ternary-Weight Language Model Trained From Scratch for \$150},
author = {Romero Chisco, Michelangelo},
year = {2026},
note = {Apache-2.0. github.com/michelangeloromerochisco/micro-terse}
}
- Downloads last month
- 1,160
2-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MicheRomChis/micro-terse", filename="", )