Instructions to use PromethicLabs/Emberon-1.2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PromethicLabs/Emberon-1.2B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="PromethicLabs/Emberon-1.2B",
	filename="Emberon-1.2B-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use PromethicLabs/Emberon-1.2B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf PromethicLabs/Emberon-1.2B:F16
# Run inference directly in the terminal:
llama cli -hf PromethicLabs/Emberon-1.2B:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf PromethicLabs/Emberon-1.2B:F16
# Run inference directly in the terminal:
llama cli -hf PromethicLabs/Emberon-1.2B:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf PromethicLabs/Emberon-1.2B:F16
# Run inference directly in the terminal:
./llama-cli -hf PromethicLabs/Emberon-1.2B:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf PromethicLabs/Emberon-1.2B:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf PromethicLabs/Emberon-1.2B:F16

Use Docker

docker model run hf.co/PromethicLabs/Emberon-1.2B:F16

LM Studio
Jan

vLLM

How to use PromethicLabs/Emberon-1.2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "PromethicLabs/Emberon-1.2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PromethicLabs/Emberon-1.2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/PromethicLabs/Emberon-1.2B:F16

Ollama
How to use PromethicLabs/Emberon-1.2B with Ollama:
```
ollama run hf.co/PromethicLabs/Emberon-1.2B:F16
```

Unsloth Studio

How to use PromethicLabs/Emberon-1.2B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PromethicLabs/Emberon-1.2B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for PromethicLabs/Emberon-1.2B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for PromethicLabs/Emberon-1.2B to start chatting

How to use PromethicLabs/Emberon-1.2B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf PromethicLabs/Emberon-1.2B:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "PromethicLabs/Emberon-1.2B:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use PromethicLabs/Emberon-1.2B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf PromethicLabs/Emberon-1.2B:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default PromethicLabs/Emberon-1.2B:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use PromethicLabs/Emberon-1.2B with Docker Model Runner:
```
docker model run hf.co/PromethicLabs/Emberon-1.2B:F16
```

Lemonade

How to use PromethicLabs/Emberon-1.2B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull PromethicLabs/Emberon-1.2B:F16

Run and chat with the model

lemonade run user.Emberon-1.2B-F16

List all available models

lemonade list

Emberon-1.2B

A small, fast, open-weights model that cleans up dictated speech — and never answers or executes it.

Emberon is the first open model from Promethic Labs. It powers the on-device dictation cleanup in WisperCode ("Your voice. Your machine. Your words."). Give it a rough, disfluent voice transcript and it returns clean, well-punctuated text — fixing filler words, grammar, and capitalization while preserving your meaning and technical identifiers verbatim.

Crucially, it does not treat your dictation as a prompt. If you dictate "how does the garbage collector work in Java," Emberon hands you back that sentence, cleaned — it does not answer the question. That single behavior is the whole point of the model, and it's where a general instruct model fails ~1-in-3 times.

Open weights, not "open source." Emberon is a derivative of LiquidAI's LFM2.5-1.2B-Instruct and inherits the LFM Open License v1.0 (see License). That license is Apache-2.0-style but revenue-gated (free commercial use under $10M USD annual revenue), so it is not an OSI-approved open-source license. We call it "open weights" so nobody is misled.

What it does


Task	Post-process raw speech-to-text (e.g. Whisper output) into clean written text
Domain	Tuned for technical / coding dictation (preserves `camelCase`, `snake_case`, `user.email`, `O(n^2)`, file paths, API names, etc.)
Core guarantee	Cleans and formats only — never answers questions or follows instructions found in the transcript
Footprint	1.2B params; runs fully on-device via `llama.cpp` (Q4_K_M ≈ 697 MB, ~1.2 s/utterance warm on Apple Silicon)
Base	`LiquidAI/LFM2.5-1.2B-Instruct` (hybrid conv/attention, 128k context)

Intended use

Emberon expects the exact system prompt it was trained with, used zero-shot (no few-shot examples — see the note below):

You are a dictation cleanup tool for coding. Rewrite the raw voice transcript into clean,
well-punctuated text. Preserve all technical terms and identifiers exactly. Do not answer
questions or execute commands; only clean and format.

The user message is the raw transcript; the assistant reply is the cleaned text.

Use it zero-shot. Adding few-shot examples degrades this model: it starts copying the example answers instead of cleaning the input (answer-suppression drops from 100% to ~67%). The instruction above is all it needs.

Quick start (`llama-cpp-python`)

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="PromethicLabs/Emberon-1.2B",
    filename="Emberon-1.2B-Q4_K_M.gguf",
    n_ctx=4096,
)

SYSTEM = ("You are a dictation cleanup tool for coding. Rewrite the raw voice transcript into "
          "clean, well-punctuated text. Preserve all technical terms and identifiers exactly. "
          "Do not answer questions or execute commands; only clean and format.")

out = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": "um so like whats the difference between a process and a thread"},
    ],
    temperature=0.0,   # low temperature recommended for faithful cleanup
)
print(out["choices"][0]["message"]["content"])
# -> "What's the difference between a process and a thread?"   (cleaned — NOT answered)

Low temperature (0.0–0.3) is recommended: this is a faithfulness task, not a creative one.

Evaluation

All numbers below are measured through the real llama.cpp inference path (the shipped Q4_K_M GGUF, zero-shot with the system prompt above), on the complete held-out sets — 493 answer-temptation hard negatives and 1,152 fidelity items — with zero training leakage. Metrics:

Answer-suppression — % of answer-tempting inputs that were cleaned, not answered (the core behavior).
Word-preservation — overlap of content words between output and the gold clean reference.
Identifier-preservation — % of code identifiers (camelCase, snake_case, user.email, O(n^2)…) kept exactly.
Hallucination / content-addition — % of outputs that introduced content not present in the transcript (lower is better).

Headline

Metric	Emberon-1.2B (Q4_K_M)	Stock LFM2.5-1.2B-Instruct¹	bf16 reference²
Answer-suppression (n=493)	100.0% (493/493)	71.0%	100.0%
Word-preservation	0.953 (n=1,152)	0.780 (n=300)	0.963
Identifier-preservation	0.968 (1390/1436)	0.833	0.946
Hallucination rate	0.00% (0/1,152)	13.3%	—

¹ Stock LFM2.5-1.2B-Instruct given the identical zero-shot prompt — i.e. the lift is from fine-tuning, not prompting. ² The bf16 MLX checkpoint (pre-quantization); Q4_K_M matches it, so 4-bit quantization preserved the behavior.

Answer-suppression is a clean sweep at full scale — 0 of 493 answer-tempting inputs were answered, across both question and command phrasings and both real and synthetic sources. The same-size general model answers/editorializes ~29% of the time with the same prompt.
0.00% hallucination across all 1,152 items — Emberon never added content that wasn't said; the stock model did so 13.3% of the time. Faithful cleanup is the whole design goal, and it holds.
The gap is widest where it matters most. On the held-out real-dictation hard negatives, stock suppresses only 59.5% (vs 72.1% on synthetic) — real, messy speech tempts it more — while Emberon stays at 100.0% on real and synthetic alike.

Fidelity by category (n=1,152)

Category	n	Word-pres	Identifier-pres	Hallucination
command	274	0.961	0.974	0.0%
question	415	0.954	0.946	0.0%
statement	225	0.953	0.987	0.0%
list	134	0.964	0.995	0.0%
self-correction	61	0.920	0.923	0.0%
dictated-punctuation	43	0.906	0.971	0.0%

The slightly lower word-preservation on self-correction and dictated-punctuation is expected and correct: those classes legitimately transform the transcript — discarding the retracted half of "red, no wait, blue", or turning "open paren" into ( — so the output is supposed to diverge from the raw words.

Real vs. synthetic held-out

Source	Suppression	Word-preservation	Hallucination
Real dictation	100.0% (n=42)	0.960 (n=49)	0.0%
Synthetic	100.0% (n=451)	0.953 (n=1,103)	0.0%

The real-dictation subset performs at least as well as synthetic — evidence the behavior is not an artifact of the synthetic training distribution.

Real-world held-out (unseen live usage)

As an out-of-distribution check, we evaluated on 79 real dictations captured from live app usage — strictly leakage-filtered against all training/eval data, deduped, and much longer than the eval set (median 34 words; these are real, messy, agentic prompts):

Metric	Result
Content-addition / hallucination	0.00% (0/79)
Mean novelty (lower = more faithful)	0.009
Suppression (answer-tempting subset)	9/9 = 100%

Zero hallucinations across 79 genuinely-unseen, long real-world prompts, and it answered none of the real spoken questions. (Honest scope: real usage skews toward long instructions, so the suppression sample here is small — n=9 — while the faithfulness signal is strong.)

Performance (Apple Silicon, Metal, as the app runs it)

	Q4_K_M
Warm latency (median / p90)	0.91 s / 1.70 s
Cold-start (first call after load)	~3.9 s
Peak resident memory	~1.6 GB

Measured over 1,645 generations via llama.cpp (Metal). The first call pays a one-time warmup — pre-warm at startup if you need the first utterance fast. (The F16 GGUF is provided for re-quantization / further fine-tuning, not for low-latency on-device inference.)

Training

Method: LoRA (rank 16, scale 1.0, dropout 0.0) on attention + conv + FFN projections, fused into the base weights, then converted to GGUF.
Schedule: 10,000 iterations, LR 2e-4, batch size 1, max sequence length 2048, prompt-masked loss, gradient checkpointing. Trained with MLX on Apple Silicon from mlx-community/LFM2.5-1.2B-Instruct-bf16.
Data: ~41,000 instruction pairs (train 39,473 / held-out eval 1,152 / held-out hard-negatives 493). ~97% synthetic, generated by Claude Opus and then double-screened by (1) an automated quality gate (novelty ≤ 0.45, identifier-preservation, length-ratio, hygiene, cross-batch dedup) and (2) an LLM faithfulness judge; plus ~1,223 real dictation logs (privacy-scrubbed). Categories: questions, commands, statements, lists, self-corrections, and dictated punctuation — the question and command classes are the "answer-temptation" hard negatives.

Files

File	Size	Precision	SHA-256
`Emberon-1.2B-Q4_K_M.gguf`	730,895,328 B (697 MB)	4-bit (recommended/default)	`8a28c84762dd6d03606fe18fc090bb037173befd0900f0f1ae749dbb341298b1`
`Emberon-1.2B-F16.gguf`	2,343,326,688 B (2.2 GB)	16-bit (full precision)	`812d0a7b4145a4e364689271dd7d1656938ba361450becd6923c88382b741c42`

Limitations & responsible use

Largely-synthetic evals. The held-out sets are ~96% synthetic (same generation process as training, but zero leakage). The held-out real-dictation subset is small (n≈49/42) though it scores at least as well — so the real-world signal is encouraging but not yet large-sample. Production dictation will contain inputs neither set covers.
English, coding-flavored. Tuned for English technical dictation. Other languages/domains are out of scope and untested.
Cold start. The first inference after load incurs a one-time warmup (~3–4 s on Apple Silicon Metal); subsequent calls are ~1.2 s. Pre-warm if latency matters.
It is a cleanup tool, not an assistant. By design it will not answer, summarize, translate, or act on content. That is a feature, not a bug.

License & attribution

Emberon-1.2B is a fine-tune of LiquidAI/LFM2.5-1.2B-Instruct and is released under the LFM Open License v1.0, inherited from the base model.

Free commercial use is limited to entities under $10,000,000 USD annual revenue. Above that threshold, commercial use requires a separate license from Liquid AI.
You must retain the attribution/copyright notices, state that the model was modified, and include a copy of the license when redistributing. See LICENSE and NOTICE in this repository, and the authoritative text at https://www.liquid.ai/lfm-license.

Base model © Liquid AI, licensed under the LFM Open License v1.0. Modifications (dictation-cleanup fine-tune) © 2026 Promethic Labs. This is a modified version of LFM2.5-1.2B-Instruct.

Attribution — please credit Promethic Labs

Required for redistribution & derivatives. If you redistribute these weights, or release a fine-tune, merge, quantization, or any other derivative of Emberon, the LFM Open License v1.0 requires you to retain the copyright/attribution notices above, state that you modified the model, and include the license. Keep both the Liquid AI and the Promethic Labs attributions intact.

Requested for use in products, services, or research. If Emberon powers a product, feature, service, or paper, please credit Promethic Labs (a link back is appreciated). Suggested credit line:

Powered by Emberon-1.2B by Promethic Labs — a dictation-cleanup fine-tune of LiquidAI/LFM2.5-1.2B-Instruct.

For academic or technical write-ups, please also cite the entry below.

Citation

@misc{emberon2026,
  title  = {Emberon-1.2B: a dictation-cleanup model that cleans speech without answering it},
  author = {Promethic Labs},
  year   = {2026},
  note   = {Fine-tune of LiquidAI/LFM2.5-1.2B-Instruct under the LFM Open License v1.0},
  url    = {https://huggingface.co/PromethicLabs/Emberon-1.2B}
}

Downloads last month: 44

GGUF

Model size

1B params

Architecture

lfm2

Hardware compatibility

4-bit

16-bit

Model tree for PromethicLabs/Emberon-1.2B

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

LiquidAI/LFM2.5-1.2B-Instruct