---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3.5-0.8B
- Qwen/Qwen3.5-2B
- Qwen/Qwen3.5-4B
library_name: gguf
tags:
- dictation
- speech-to-text
- text-cleanup
- post-asr
- qwen3.5
- gguf
- llama.cpp
- on-device
pipeline_tag: text-generation
inference: false
---
# Quill: on-device dictation cleanup models
**Quill** is a family of small language models that turn raw speech-to-text
output into clean, written text, **entirely on your own device**. It removes
filler words (*um*, *uh*, *like*, *you know*), fixes punctuation and
capitalization, repairs spoken self-corrections and false starts, and collapses
the stutters and repeats that dictation produces, without changing your words
or sending anything to the cloud.
Quill is the cleanup stage of **[Quobi](https://huggingface.co/quobi)**, a
private, offline dictation app for desktop and mobile.
## What this is
When you dictate, a speech recognizer (e.g. Whisper) produces a literal, messy
transcript:
> *"um so i was thinking like maybe we could you know meet up at three"*
Quill rewrites that into what you actually meant to write:
> **"So I was thinking maybe we could meet up at three."**
It is **not** a chatbot and not an instruction-following assistant. It does one
job: clean dictated text. Feeding it questions or commands will not get answers;
it will just clean the text.
## Base model & credit
Quill is a fine-tune of **[Qwen3.5](https://huggingface.co/Qwen)** by the Qwen
team (Alibaba), used under the **Apache 2.0** license. Qwen3.5 is a hybrid
architecture interleaving **Mamba-2 / state-space (SSM)** layers with periodic
full-attention layers, which makes the small sizes fast and memory-light,
well suited to on-device, low-latency cleanup. All credit for the base models
goes to the Qwen team; Quill only adds task-specific fine-tuning.
| Quill tier | Base model | Size (Q4_K_M) |
|---|---|---|
| `quill-0.8b-Q4_K_M.gguf` | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) | 505 MB |
| `quill-2b-Q4_K_M.gguf` | [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) | 1.2 GB |
| `quill-4b-Q4_K_M.gguf` | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | 2.6 GB |
## Which tier to use
| Tier | Best for | Behavior |
|---|---|---|
| **0.8B** | Phones and any CPU (recommended default) | **Verbatim**: faithful cleanup, no rephrasing |
| **2B** | Mid-range machines / a modest GPU | Verbatim + light tidying |
| **4B** | Desktops with a GPU | Verbatim + tidying + light formatting |
The smaller tiers are deliberately conservative. The **0.8B is verbatim-only by
design**: it is paired with a deterministic post-processing scaffold (symbol,
email, URL, and number normalization) so the model never has to *guess* at
conversions like "at" → `@`. This keeps the tiny model accurate and predictable;
the larger tiers take on more rewriting and structure.
## Usage (llama.cpp)
```bash
llama-server -m quill-0.8b-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -ngl 99
```
**Prompt format (important).** Use ChatML with the assistant turn pre-seeded
with an **empty think block** so the model does not emit chain-of-thought:
```
<|im_start|>system
You clean up dictated text.<|im_end|>
<|im_start|>user
yeah so um the meeting is gonna be like at uh three thirty tomorrow i think<|im_end|>
<|im_start|>assistant
```
→ **"The meeting is at 3:30 tomorrow."**
> ⚠️ Do **not** pass `--jinja`. It re-enables chain-of-thought leakage. Use the
> raw prompt above (or the `/completion` endpoint) with the pre-seeded empty
> `` block. Greedy decoding (`temperature = 0`) is recommended.
## Intended use & limitations
- **Intended:** post-ASR cleanup of first-person English dictation.
- **Not intended:** as a general assistant, translator, or summarizer; for
languages other than English (non-English text is passed through, not
cleaned); for safety-critical rewriting.
- Like any LM it can occasionally over- or under-edit. The verbatim tiers
minimize this by preserving your wording; pair them with the deterministic
scaffold for symbol/number normalization.
## License
**Apache 2.0**, inherited from the Qwen3.5 base models (also Apache 2.0). You
are free to use, modify, and redistribute, including commercially, under the
terms of the license. Fine-tuned and released as part of the **Quobi** project.