mumble-cleanup / docs /model_report.md
adikuma's picture
initial upload: cleanup code and 688-pair seed dataset
fd0b01f verified
|
Raw
History Blame Contribute Delete
4.15 kB

Mumble cleanup model report

A small fine-tuned language model that cleans dictation transcripts. Trained on a GPU, runs on a CPU.

What it does

You dictate something messy. It returns the cleaned version. Five things it handles:

  • removes filler words (um, uh, like, you know)
  • collapses word stutters ("we we" -> "we")
  • recovers punctuation and capitalization
  • corrects homophones (their / there, your / you're)
  • formats numbers, dates, lists where the cue is clear

Example: "um so the the meeting is at three thirty tomorrow" becomes "The meeting is at 3:30 tomorrow."

How it was built

flowchart TD
    A[hand-curated seed jsonl<br/>688 pairs, 8 categories] --> B[stratified split<br/>85/10/5 train/val/test]
    B --> C[lora sft on qwen2.5-0.5b-instruct<br/>1 epoch on a single rtx 4090]
    C --> D[eval on held-out test<br/>raw vs base vs fine-tuned]
    D --> E[merge lora + export onnx<br/>fp32 + int8 for cpu inference]
    E --> F[cpu latency benchmark<br/>run on the target laptop]

The seed dataset was generated by a multi-agent workflow that spawned eight specialist agents in parallel, each producing ~70-80 pairs in a distinct dictation category. After dedup, the final dataset has 612 unique pairs.

Training uses TRL's SFTTrainer with DataCollatorForCompletionOnlyLM. The collator masks system and user tokens with -100, so cross-entropy only fires on the assistant turn. This is what keeps the model honest: gradients flow only through the cleaned output, never through the raw disfluent input.

Accuracy

Filled in after running make evaluate.

model disfluency removal punct f1 faithfulness length ratio pass rate
raw (no cleanup) tbd tbd tbd tbd tbd
Qwen base zero-shot tbd tbd tbd tbd tbd
fine-tuned tbd tbd tbd tbd tbd

Pass rate is the percentage of test examples that simultaneously meet: disfluency removal ≥ 0.95, punctuation F1 ≥ 0.85, faithfulness ≥ 0.98, length ratio in [0.85, 1.05].

The base model has a documented failure mode: it answers questions instead of cleaning them ("what's the capital of france" → "Paris"). The adversarial question check confirms whether fine-tuning corrects this.

per metric comparison

Training

learning curves

Filled in after running make evaluate. Look for: train loss drops smoothly, val loss tracks train, no late divergence.

Speed on CPU

Measured on a laptop CPU. Laptop timings are noisy because they depend on what else the machine is doing; treat these as approximate. For an authoritative number, run the benchmark inside the actual deployment environment (the Mumble Tauri app via the Rust ort crate).

latency

input length (tokens) fp32 p50 (ms) fp32 p95 (ms) int8 p50 (ms) int8 p95 (ms)
16 tbd tbd tbd tbd
32 tbd tbd tbd tbd
64 tbd tbd tbd tbd
128 tbd tbd tbd tbd
256 tbd tbd tbd tbd
512 tbd tbd tbd tbd

Realistic mix on ~500 real test inputs (variable length): tbd.

What you get

The deliverable is:

  • runs/<run-id>/onnx/model.onnx — fp32 ONNX, ~1 GB
  • runs/<run-id>/onnx/int8/model.onnx — int8 ONNX, ~250 MB (target for the Mumble app)

Both run on CPU with onnxruntime. The Rust ort crate consumes the int8 build.

Limits

  • English only.
  • Trained on synthetic data. Test set is held out from the same synthetic distribution. Real ASR output may have failure modes the synthetic operators did not model. The cleanup operators were tuned to match Parakeet's failure distribution as observed in the bench harness; expect some domain shift in production.
  • Inputs longer than 512 tokens must be chunked before cleanup.
  • Single-turn only. Does not maintain conversation history.
  • Fixed system prompt baked in at training time. Changing the prompt at inference will degrade quality.