mumble-cleanup / docs /data_report.md
adikuma's picture
initial upload: cleanup code and 688-pair seed dataset
fd0b01f verified
|
Raw
History Blame Contribute Delete
8.42 kB

data report

The synthetic seed dataset that backs the Mumble cleanup model. Built by a multi-agent workflow that spawned 8 specialist agents in parallel and produced 612 pairs across 8 dictation categories; a polish pass added 76 more long_form_thoughts pairs with strictly diverse openers, bringing the total to 688 pairs.

Every pair is { raw: <Parakeet-shaped lowercase no-punct disfluent input>, clean: <proper English output> }. The clean side is faithful by construction: every content word in clean exists in raw (modulo standard homophone fixes, contractions, and casing). This is what stops the model from learning to hallucinate.

category mix

category counts

category count
casual_messages 81
long_form_thoughts 145
meeting_notes 81
mixed_content 70
professional_emails 80
questions_and_asks 70
technical_dictation 80
todo_lists 81
total 688

long_form_thoughts is intentionally over-weighted because paragraph-length cleanup is the hardest behavior (multiple sentence boundaries, sustained context, false starts) and 145 examples gives the model the signal it needs to handle 60-90 word inputs.

length distribution

length distribution

Raw inputs span 2 to 70 words with a median of 17. Clean outputs are slightly shorter on average (16 median words) because they have fillers and stutters removed. The categories show meaningfully different length distributions: short utterances dominate casual_messages, questions_and_asks, and mixed_content; long paragraph-shaped inputs dominate long_form_thoughts.

raw vs clean length

raw vs clean length

Points below the diagonal mean clean is shorter than raw — the model is being trained to remove material, not add it. The cluster sits just below the diagonal, which is the expected shape for a faithful cleanup task: a few words removed per input on average, never more than ~25%.

disfluency intensity

filler intensity by category

Average filler-word count per raw input, by category. meeting_notes and long_form_thoughts carry the heaviest disfluency load (people think out loud during meetings); mixed_content and questions_and_asks are leanest (those categories are about precision, not verbosity).

top fillers

Distribution of filler words across the entire dataset. um and uh dominate (matching real Parakeet output), with like, you know, and so following at a moderate rate. The mix matches what shows up in real dictation transcripts.

faithfulness check

faithfulness distribution

For each pair, we compute the fraction of content words in the clean side that also appear in the raw side. A perfect value is 1.0 (every clean content word came from raw); lower values indicate the clean introduced content the raw did not have, which would train the model to hallucinate.

  • Mean faithfulness: 0.929
  • Median faithfulness: 0.957
  • Pairs above 0.95 threshold: 391 of 688 (56.8%)
  • Pairs above 0.90 threshold: 527 of 688 (76.6%)

Small drops below 1.0 come from legitimate sources: number-word to digit conversion ("two thirty" -> "2:30"), proper-noun capitalization that adds new tokens to the content-word set under our simple lowercase comparison ("acme" -> "Acme" should be counted as matching but our naive check might miss some), and contractions ("i" -> "I'm" via apostrophe restoration).

sample pairs

Two per category, illustrating the shape of the dataset:

casual_messages

  • raw: hey sarah you free for lunch around noon

  • clean: Hey Sarah, you free for lunch around noon?

  • raw: did you finish the writeup for the all hands

  • clean: Did you finish the writeup for the all hands?

long_form_thoughts

  • raw: so ive been mulling over the pricing change and honestly i think we we moved too fast on the annual discount like the data showed a bump in conversions but um when you look at retention three months out the cohort actually churns harder than the monthly folks so we might be optimizing for a vanity metric instead of long term value

  • clean: I've been mulling over the pricing change, and honestly I think we moved too fast on the annual discount. The data showed a bump in conversions, but when you look at retention three months out, the cohort actually churns harder than the monthly folks. We might be optimizing for a vanity metric instead of long-term value.

  • raw: the thing that bugs me about our okr process is that we we set them in january then never look at them again until december so they become this performative document instead of a living tool and um if a goal isnt influencing weekly decisions its not really a goal its just a wish we wrote down

  • clean: The thing that bugs me about our OKR process is that we set them in January then never look at them again until December, so they become this performative document instead of a living tool. If a goal isn't influencing weekly decisions, it's not really a goal. It's just a wish we wrote down.

meeting_notes

  • raw: ok so we decided to ship the dark mode rollout next sprint priya is the owner

  • clean: We decided to ship the dark mode rollout next sprint. Priya is the owner.

  • raw: uh action item add a slack channel for the the procurement project

  • clean: Action item: add a Slack channel for the procurement project.

mixed_content

  • raw: meeting with sarah at two fifteen pm tomorrow

  • clean: Meeting with Sarah at 2:15 PM tomorrow.

  • raw: uber driver arriving in four minutes silver toyota camry license seven a b c three four five

  • clean: Uber driver arriving in 4 minutes. Silver Toyota Camry, license 7ABC345.

professional_emails

  • raw: hi marcus quick follow up on the contract did legal sign off yet

  • clean: Hi Marcus, quick follow up on the contract. Did legal sign off yet?

  • raw: thanks ill review and get back to you by tomorrow noon

  • clean: Thanks, I'll review and get back to you by tomorrow noon.

questions_and_asks

  • raw: can you send me the deck

  • clean: Can you send me the deck?

  • raw: is there a reason why we picked dynamodb over postgres for this i want to understand the tradeoff

  • clean: Is there a reason why we picked DynamoDB over Postgres for this? I want to understand the tradeoff.

technical_dictation

  • raw: the mutex is held across the await point which deadlocks tokio

  • clean: The mutex is held across the await point, which deadlocks Tokio.

  • raw: well i mean the the perf regression came from the new react server component because its now making a fresh database connection per render instead of reusing the pool

  • clean: The perf regression came from the new React server component because it's now making a fresh database connection per render instead of reusing the pool.

todo_lists

  • raw: pick up bread milk and eggs

  • clean: Pick up bread, milk, and eggs.

  • raw: favorite restaurants in seattle are canlis altura and the walrus and the carpenter

  • clean: Favorite restaurants in Seattle:

  • Canlis

  • Altura

  • The Walrus and the Carpenter

limitations

  • Synthetic origin: every pair was generated by an LLM workflow, not transcribed from real Parakeet output. The disfluency patterns are modeled to match real ASR failure modes but may under-represent edge cases the model will face in production.
  • Size: 688 pairs is on the lower-middle end of the documented sweet spot for narrow LoRA fine-tunes (200-500 floor, 2k-5k comfortable). Adequate for a v1 ship; if eval pass rate is below 0.85 we regenerate another 600-1000 pairs and retrain.
  • Faithfulness is statistical, not strict: a few pairs may drop below 0.95 because of legitimate transformations (numeric formatting, proper-noun casing). We don't filter these out because the training task explicitly wants the model to learn those transformations.
  • English only.