# data report

The synthetic seed dataset that backs the Mumble cleanup model. Built by a multi-agent workflow that spawned 8 specialist agents in parallel and produced 612 pairs across 8 dictation categories; a polish pass added 76 more `long_form_thoughts` pairs with strictly diverse openers, bringing the total to **688 pairs**.

Every pair is `{ raw: <Parakeet-shaped lowercase no-punct disfluent input>, clean: <proper English output> }`. The clean side is faithful by construction: every content word in `clean` exists in `raw` (modulo standard homophone fixes, contractions, and casing). This is what stops the model from learning to hallucinate.

## category mix

![category counts](data_images/category_counts.png)

| category | count |
|---|---:|
| `casual_messages` | 81 |
| `long_form_thoughts` | 145 |
| `meeting_notes` | 81 |
| `mixed_content` | 70 |
| `professional_emails` | 80 |
| `questions_and_asks` | 70 |
| `technical_dictation` | 80 |
| `todo_lists` | 81 |
| **total** | **688** |

`long_form_thoughts` is intentionally over-weighted because paragraph-length cleanup is the hardest behavior (multiple sentence boundaries, sustained context, false starts) and 145 examples gives the model the signal it needs to handle 60-90 word inputs.

## length distribution

![length distribution](data_images/length_distribution.png)

Raw inputs span **2 to 70 words** with a median of **17**. Clean outputs are slightly shorter on average (16 median words) because they have fillers and stutters removed. The categories show meaningfully different length distributions: short utterances dominate `casual_messages`, `questions_and_asks`, and `mixed_content`; long paragraph-shaped inputs dominate `long_form_thoughts`.

## raw vs clean length

![raw vs clean length](data_images/raw_vs_clean_length.png)

Points below the diagonal mean clean is shorter than raw — the model is being trained to remove material, not add it. The cluster sits just below the diagonal, which is the expected shape for a faithful cleanup task: a few words removed per input on average, never more than ~25%.

## disfluency intensity

![filler intensity by category](data_images/filler_intensity.png)

Average filler-word count per raw input, by category. `meeting_notes` and `long_form_thoughts` carry the heaviest disfluency load (people think out loud during meetings); `mixed_content` and `questions_and_asks` are leanest (those categories are about precision, not verbosity).

![top fillers](data_images/top_fillers.png)

Distribution of filler words across the entire dataset. `um` and `uh` dominate (matching real Parakeet output), with `like`, `you know`, and `so` following at a moderate rate. The mix matches what shows up in real dictation transcripts.

## faithfulness check

![faithfulness distribution](data_images/faithfulness.png)

For each pair, we compute the fraction of content words in the clean side that also appear in the raw side. A perfect value is 1.0 (every clean content word came from raw); lower values indicate the clean introduced content the raw did not have, which would train the model to hallucinate.

- **Mean faithfulness**: 0.929
- **Median faithfulness**: 0.957
- **Pairs above 0.95 threshold**: 391 of 688 (56.8%)
- **Pairs above 0.90 threshold**: 527 of 688 (76.6%)

Small drops below 1.0 come from legitimate sources: number-word to digit conversion ("two thirty" -> "2:30"), proper-noun capitalization that adds new tokens to the content-word set under our simple lowercase comparison ("acme" -> "Acme" should be counted as matching but our naive check might miss some), and contractions ("i" -> "I'm" via apostrophe restoration).

## sample pairs

Two per category, illustrating the shape of the dataset:

### `casual_messages`

- **raw**: `hey sarah you free for lunch around noon`
- **clean**: Hey Sarah, you free for lunch around noon?

- **raw**: `did you finish the writeup for the all hands`
- **clean**: Did you finish the writeup for the all hands?

### `long_form_thoughts`

- **raw**: `so ive been mulling over the pricing change and honestly i think we we moved too fast on the annual discount like the data showed a bump in conversions but um when you look at retention three months out the cohort actually churns harder than the monthly folks so we might be optimizing for a vanity metric instead of long term value`
- **clean**: I've been mulling over the pricing change, and honestly I think we moved too fast on the annual discount. The data showed a bump in conversions, but when you look at retention three months out, the cohort actually churns harder than the monthly folks. We might be optimizing for a vanity metric instead of long-term value.

- **raw**: `the thing that bugs me about our okr process is that we we set them in january then never look at them again until december so they become this performative document instead of a living tool and um if a goal isnt influencing weekly decisions its not really a goal its just a wish we wrote down`
- **clean**: The thing that bugs me about our OKR process is that we set them in January then never look at them again until December, so they become this performative document instead of a living tool. If a goal isn't influencing weekly decisions, it's not really a goal. It's just a wish we wrote down.

### `meeting_notes`

- **raw**: `ok so we decided to ship the dark mode rollout next sprint priya is the owner`
- **clean**: We decided to ship the dark mode rollout next sprint. Priya is the owner.

- **raw**: `uh action item add a slack channel for the the procurement project`
- **clean**: Action item: add a Slack channel for the procurement project.

### `mixed_content`

- **raw**: `meeting with sarah at two fifteen pm tomorrow`
- **clean**: Meeting with Sarah at 2:15 PM tomorrow.

- **raw**: `uber driver arriving in four minutes silver toyota camry license seven a b c three four five`
- **clean**: Uber driver arriving in 4 minutes. Silver Toyota Camry, license 7ABC345.

### `professional_emails`

- **raw**: `hi marcus quick follow up on the contract did legal sign off yet`
- **clean**: Hi Marcus, quick follow up on the contract. Did legal sign off yet?

- **raw**: `thanks ill review and get back to you by tomorrow noon`
- **clean**: Thanks, I'll review and get back to you by tomorrow noon.

### `questions_and_asks`

- **raw**: `can you send me the deck`
- **clean**: Can you send me the deck?

- **raw**: `is there a reason why we picked dynamodb over postgres for this i want to understand the tradeoff`
- **clean**: Is there a reason why we picked DynamoDB over Postgres for this? I want to understand the tradeoff.

### `technical_dictation`

- **raw**: `the mutex is held across the await point which deadlocks tokio`
- **clean**: The mutex is held across the await point, which deadlocks Tokio.

- **raw**: `well i mean the the perf regression came from the new react server component because its now making a fresh database connection per render instead of reusing the pool`
- **clean**: The perf regression came from the new React server component because it's now making a fresh database connection per render instead of reusing the pool.

### `todo_lists`

- **raw**: `pick up bread milk and eggs`
- **clean**: Pick up bread, milk, and eggs.

- **raw**: `favorite restaurants in seattle are canlis altura and the walrus and the carpenter`
- **clean**: Favorite restaurants in Seattle:
- Canlis
- Altura
- The Walrus and the Carpenter

## limitations

- **Synthetic origin**: every pair was generated by an LLM workflow, not transcribed from real Parakeet output. The disfluency patterns are modeled to match real ASR failure modes but may under-represent edge cases the model will face in production.
- **Size**: 688 pairs is on the lower-middle end of the documented sweet spot for narrow LoRA fine-tunes (200-500 floor, 2k-5k comfortable). Adequate for a v1 ship; if eval pass rate is below 0.85 we regenerate another 600-1000 pairs and retrain.
- **Faithfulness is statistical, not strict**: a few pairs may drop below 0.95 because of legitimate transformations (numeric formatting, proper-noun casing). We don't filter these out because the training task explicitly wants the model to learn those transformations.
- **English only.**