# Mumble cleanup model report

A small fine-tuned language model that cleans dictation transcripts. Trained on a GPU, runs on a CPU.

## What it does

You dictate something messy. It returns the cleaned version. Five things it handles:

- removes filler words (um, uh, like, you know)
- collapses word stutters ("we we" -> "we")
- recovers punctuation and capitalization
- corrects homophones (their / there, your / you're)
- formats numbers, dates, lists where the cue is clear

Example: `"um so the the meeting is at three thirty tomorrow"` becomes `"The meeting is at 3:30 tomorrow."`

## How it was built

```mermaid
flowchart TD
    A[hand-curated seed jsonl<br/>688 pairs, 8 categories] --> B[stratified split<br/>85/10/5 train/val/test]
    B --> C[lora sft on qwen2.5-0.5b-instruct<br/>1 epoch on a single rtx 4090]
    C --> D[eval on held-out test<br/>raw vs base vs fine-tuned]
    D --> E[merge lora + export onnx<br/>fp32 + int8 for cpu inference]
    E --> F[cpu latency benchmark<br/>run on the target laptop]
```

The seed dataset was generated by a multi-agent workflow that spawned eight specialist agents in parallel, each producing ~70-80 pairs in a distinct dictation category. After dedup, the final dataset has 612 unique pairs.

Training uses TRL's `SFTTrainer` with `DataCollatorForCompletionOnlyLM`. The collator masks system and user tokens with `-100`, so cross-entropy only fires on the assistant turn. This is what keeps the model honest: gradients flow only through the cleaned output, never through the raw disfluent input.

## Accuracy

> Filled in after running `make evaluate`.

| model | disfluency removal | punct f1 | faithfulness | length ratio | pass rate |
|---|---:|---:|---:|---:|---:|
| raw (no cleanup) | _tbd_ | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| Qwen base zero-shot | _tbd_ | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| **fine-tuned** | _tbd_ | _tbd_ | _tbd_ | _tbd_ | _tbd_ |

Pass rate is the percentage of test examples that simultaneously meet: disfluency removal ≥ 0.95, punctuation F1 ≥ 0.85, faithfulness ≥ 0.98, length ratio in [0.85, 1.05].

The base model has a documented failure mode: it **answers** questions instead of cleaning them ("what's the capital of france" → "Paris"). The adversarial question check confirms whether fine-tuning corrects this.

![per metric comparison](report_images/metrics_comparison.png)

## Training

![learning curves](report_images/learning_curves.png)

> Filled in after running `make evaluate`. Look for: train loss drops smoothly, val loss tracks train, no late divergence.

## Speed on CPU

> Measured on a laptop CPU. Laptop timings are noisy because they depend on what else the machine is doing; treat these as approximate. For an authoritative number, run the benchmark inside the actual deployment environment (the Mumble Tauri app via the Rust `ort` crate).

![latency](report_images/latency.png)

| input length (tokens) | fp32 p50 (ms) | fp32 p95 (ms) | int8 p50 (ms) | int8 p95 (ms) |
|---:|---:|---:|---:|---:|
| 16 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| 32 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| 64 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| 128 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| 256 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |
| 512 | _tbd_ | _tbd_ | _tbd_ | _tbd_ |

Realistic mix on ~500 real test inputs (variable length): _tbd_.

## What you get

The deliverable is:

- `runs/<run-id>/onnx/model.onnx` — fp32 ONNX, ~1 GB
- `runs/<run-id>/onnx/int8/model.onnx` — int8 ONNX, ~250 MB (target for the Mumble app)

Both run on CPU with `onnxruntime`. The Rust `ort` crate consumes the int8 build.

## Limits

- English only.
- Trained on synthetic data. Test set is held out from the same synthetic distribution. Real ASR output may have failure modes the synthetic operators did not model. The cleanup operators were tuned to match Parakeet's failure distribution as observed in the bench harness; expect some domain shift in production.
- Inputs longer than 512 tokens must be chunked before cleanup.
- Single-turn only. Does not maintain conversation history.
- Fixed system prompt baked in at training time. Changing the prompt at inference will degrade quality.