Text Generation
GGUF
English
dictation
speech-to-text
text-cleanup
post-asr
qwen3.5
llama.cpp
on-device
conversational
Instructions to use Quobi/Quill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Quobi/Quill with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Quobi/Quill", filename="quill-0.8b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Quobi/Quill with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Quobi/Quill:Q4_K_M # Run inference directly in the terminal: llama cli -hf Quobi/Quill:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Quobi/Quill:Q4_K_M # Run inference directly in the terminal: llama cli -hf Quobi/Quill:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Quobi/Quill:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Quobi/Quill:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Quobi/Quill:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Quobi/Quill:Q4_K_M
Use Docker
docker model run hf.co/Quobi/Quill:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Quobi/Quill with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Quobi/Quill" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quobi/Quill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Quobi/Quill:Q4_K_M
- Ollama
How to use Quobi/Quill with Ollama:
ollama run hf.co/Quobi/Quill:Q4_K_M
- Unsloth Studio
How to use Quobi/Quill with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Quobi/Quill to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Quobi/Quill to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Quobi/Quill to start chatting
- Pi
How to use Quobi/Quill with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Quobi/Quill:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Quobi/Quill:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Quobi/Quill with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Quobi/Quill:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Quobi/Quill:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Quobi/Quill with Docker Model Runner:
docker model run hf.co/Quobi/Quill:Q4_K_M
- Lemonade
How to use Quobi/Quill with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Quobi/Quill:Q4_K_M
Run and chat with the model
lemonade run user.Quill-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: | |
| - Qwen/Qwen3.5-0.8B | |
| - Qwen/Qwen3.5-2B | |
| - Qwen/Qwen3.5-4B | |
| library_name: gguf | |
| tags: | |
| - dictation | |
| - speech-to-text | |
| - text-cleanup | |
| - post-asr | |
| - qwen3.5 | |
| - gguf | |
| - llama.cpp | |
| - on-device | |
| pipeline_tag: text-generation | |
| inference: false | |
| # Quill: on-device dictation cleanup models | |
| **Quill** is a family of small language models that turn raw speech-to-text | |
| output into clean, written text, **entirely on your own device**. It removes | |
| filler words (*um*, *uh*, *like*, *you know*), fixes punctuation and | |
| capitalization, repairs spoken self-corrections and false starts, and collapses | |
| the stutters and repeats that dictation produces, without changing your words | |
| or sending anything to the cloud. | |
| Quill is the cleanup stage of **[Quobi](https://huggingface.co/quobi)**, a | |
| private, offline dictation app for desktop and mobile. | |
| ## What this is | |
| When you dictate, a speech recognizer (e.g. Whisper) produces a literal, messy | |
| transcript: | |
| > *"um so i was thinking like maybe we could you know meet up at three"* | |
| Quill rewrites that into what you actually meant to write: | |
| > **"So I was thinking maybe we could meet up at three."** | |
| It is **not** a chatbot and not an instruction-following assistant. It does one | |
| job: clean dictated text. Feeding it questions or commands will not get answers; | |
| it will just clean the text. | |
| ## Base model & credit | |
| Quill is a fine-tune of **[Qwen3.5](https://huggingface.co/Qwen)** by the Qwen | |
| team (Alibaba), used under the **Apache 2.0** license. Qwen3.5 is a hybrid | |
| architecture interleaving **Mamba-2 / state-space (SSM)** layers with periodic | |
| full-attention layers, which makes the small sizes fast and memory-light, | |
| well suited to on-device, low-latency cleanup. All credit for the base models | |
| goes to the Qwen team; Quill only adds task-specific fine-tuning. | |
| | Quill tier | Base model | Size (Q4_K_M) | | |
| |---|---|---| | |
| | `quill-0.8b-Q4_K_M.gguf` | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) | 505 MB | | |
| | `quill-2b-Q4_K_M.gguf` | [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) | 1.2 GB | | |
| | `quill-4b-Q4_K_M.gguf` | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | 2.6 GB | | |
| ## Which tier to use | |
| | Tier | Best for | Behavior | | |
| |---|---|---| | |
| | **0.8B** | Phones and any CPU (recommended default) | **Verbatim**: faithful cleanup, no rephrasing | | |
| | **2B** | Mid-range machines / a modest GPU | Verbatim + light tidying | | |
| | **4B** | Desktops with a GPU | Verbatim + tidying + light formatting | | |
| The smaller tiers are deliberately conservative. The **0.8B is verbatim-only by | |
| design**: it is paired with a deterministic post-processing scaffold (symbol, | |
| email, URL, and number normalization) so the model never has to *guess* at | |
| conversions like "at" → `@`. This keeps the tiny model accurate and predictable; | |
| the larger tiers take on more rewriting and structure. | |
| ## Usage (llama.cpp) | |
| ```bash | |
| llama-server -m quill-0.8b-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -ngl 99 | |
| ``` | |
| **Prompt format (important).** Use ChatML with the assistant turn pre-seeded | |
| with an **empty think block** so the model does not emit chain-of-thought: | |
| ``` | |
| <|im_start|>system | |
| You clean up dictated text.<|im_end|> | |
| <|im_start|>user | |
| yeah so um the meeting is gonna be like at uh three thirty tomorrow i think<|im_end|> | |
| <|im_start|>assistant | |
| <think> | |
| </think> | |
| ``` | |
| → **"The meeting is at 3:30 tomorrow."** | |
| > ⚠️ Do **not** pass `--jinja`. It re-enables chain-of-thought leakage. Use the | |
| > raw prompt above (or the `/completion` endpoint) with the pre-seeded empty | |
| > `<think></think>` block. Greedy decoding (`temperature = 0`) is recommended. | |
| ## Intended use & limitations | |
| - **Intended:** post-ASR cleanup of first-person English dictation. | |
| - **Not intended:** as a general assistant, translator, or summarizer; for | |
| languages other than English (non-English text is passed through, not | |
| cleaned); for safety-critical rewriting. | |
| - Like any LM it can occasionally over- or under-edit. The verbatim tiers | |
| minimize this by preserving your wording; pair them with the deterministic | |
| scaffold for symbol/number normalization. | |
| ## License | |
| **Apache 2.0**, inherited from the Qwen3.5 base models (also Apache 2.0). You | |
| are free to use, modify, and redistribute, including commercially, under the | |
| terms of the license. Fine-tuned and released as part of the **Quobi** project. | |