Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adikuma/mumble-cleanup with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup") model = AutoModelForCausalLM.from_pretrained("adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use adikuma/mumble-cleanup with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adikuma/mumble-cleanup" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adikuma/mumble-cleanup
- SGLang
How to use adikuma/mumble-cleanup with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
docker model run hf.co/adikuma/mumble-cleanup
data report
The synthetic seed dataset that backs the Mumble cleanup model. Built by a multi-agent workflow that spawned 8 specialist agents in parallel and produced 612 pairs across 8 dictation categories; a polish pass added 76 more long_form_thoughts pairs with strictly diverse openers, bringing the total to 688 pairs.
Every pair is { raw: <Parakeet-shaped lowercase no-punct disfluent input>, clean: <proper English output> }. The clean side is faithful by construction: every content word in clean exists in raw (modulo standard homophone fixes, contractions, and casing). This is what stops the model from learning to hallucinate.
category mix
| category | count |
|---|---|
casual_messages |
81 |
long_form_thoughts |
145 |
meeting_notes |
81 |
mixed_content |
70 |
professional_emails |
80 |
questions_and_asks |
70 |
technical_dictation |
80 |
todo_lists |
81 |
| total | 688 |
long_form_thoughts is intentionally over-weighted because paragraph-length cleanup is the hardest behavior (multiple sentence boundaries, sustained context, false starts) and 145 examples gives the model the signal it needs to handle 60-90 word inputs.
length distribution
Raw inputs span 2 to 70 words with a median of 17. Clean outputs are slightly shorter on average (16 median words) because they have fillers and stutters removed. The categories show meaningfully different length distributions: short utterances dominate casual_messages, questions_and_asks, and mixed_content; long paragraph-shaped inputs dominate long_form_thoughts.
raw vs clean length
Points below the diagonal mean clean is shorter than raw — the model is being trained to remove material, not add it. The cluster sits just below the diagonal, which is the expected shape for a faithful cleanup task: a few words removed per input on average, never more than ~25%.
disfluency intensity
Average filler-word count per raw input, by category. meeting_notes and long_form_thoughts carry the heaviest disfluency load (people think out loud during meetings); mixed_content and questions_and_asks are leanest (those categories are about precision, not verbosity).
Distribution of filler words across the entire dataset. um and uh dominate (matching real Parakeet output), with like, you know, and so following at a moderate rate. The mix matches what shows up in real dictation transcripts.
faithfulness check
For each pair, we compute the fraction of content words in the clean side that also appear in the raw side. A perfect value is 1.0 (every clean content word came from raw); lower values indicate the clean introduced content the raw did not have, which would train the model to hallucinate.
- Mean faithfulness: 0.929
- Median faithfulness: 0.957
- Pairs above 0.95 threshold: 391 of 688 (56.8%)
- Pairs above 0.90 threshold: 527 of 688 (76.6%)
Small drops below 1.0 come from legitimate sources: number-word to digit conversion ("two thirty" -> "2:30"), proper-noun capitalization that adds new tokens to the content-word set under our simple lowercase comparison ("acme" -> "Acme" should be counted as matching but our naive check might miss some), and contractions ("i" -> "I'm" via apostrophe restoration).
sample pairs
Two per category, illustrating the shape of the dataset:
casual_messages
raw:
hey sarah you free for lunch around noonclean: Hey Sarah, you free for lunch around noon?
raw:
did you finish the writeup for the all handsclean: Did you finish the writeup for the all hands?
long_form_thoughts
raw:
so ive been mulling over the pricing change and honestly i think we we moved too fast on the annual discount like the data showed a bump in conversions but um when you look at retention three months out the cohort actually churns harder than the monthly folks so we might be optimizing for a vanity metric instead of long term valueclean: I've been mulling over the pricing change, and honestly I think we moved too fast on the annual discount. The data showed a bump in conversions, but when you look at retention three months out, the cohort actually churns harder than the monthly folks. We might be optimizing for a vanity metric instead of long-term value.
raw:
the thing that bugs me about our okr process is that we we set them in january then never look at them again until december so they become this performative document instead of a living tool and um if a goal isnt influencing weekly decisions its not really a goal its just a wish we wrote downclean: The thing that bugs me about our OKR process is that we set them in January then never look at them again until December, so they become this performative document instead of a living tool. If a goal isn't influencing weekly decisions, it's not really a goal. It's just a wish we wrote down.
meeting_notes
raw:
ok so we decided to ship the dark mode rollout next sprint priya is the ownerclean: We decided to ship the dark mode rollout next sprint. Priya is the owner.
raw:
uh action item add a slack channel for the the procurement projectclean: Action item: add a Slack channel for the procurement project.
mixed_content
raw:
meeting with sarah at two fifteen pm tomorrowclean: Meeting with Sarah at 2:15 PM tomorrow.
raw:
uber driver arriving in four minutes silver toyota camry license seven a b c three four fiveclean: Uber driver arriving in 4 minutes. Silver Toyota Camry, license 7ABC345.
professional_emails
raw:
hi marcus quick follow up on the contract did legal sign off yetclean: Hi Marcus, quick follow up on the contract. Did legal sign off yet?
raw:
thanks ill review and get back to you by tomorrow noonclean: Thanks, I'll review and get back to you by tomorrow noon.
questions_and_asks
raw:
can you send me the deckclean: Can you send me the deck?
raw:
is there a reason why we picked dynamodb over postgres for this i want to understand the tradeoffclean: Is there a reason why we picked DynamoDB over Postgres for this? I want to understand the tradeoff.
technical_dictation
raw:
the mutex is held across the await point which deadlocks tokioclean: The mutex is held across the await point, which deadlocks Tokio.
raw:
well i mean the the perf regression came from the new react server component because its now making a fresh database connection per render instead of reusing the poolclean: The perf regression came from the new React server component because it's now making a fresh database connection per render instead of reusing the pool.
todo_lists
raw:
pick up bread milk and eggsclean: Pick up bread, milk, and eggs.
raw:
favorite restaurants in seattle are canlis altura and the walrus and the carpenterclean: Favorite restaurants in Seattle:
Canlis
Altura
The Walrus and the Carpenter
limitations
- Synthetic origin: every pair was generated by an LLM workflow, not transcribed from real Parakeet output. The disfluency patterns are modeled to match real ASR failure modes but may under-represent edge cases the model will face in production.
- Size: 688 pairs is on the lower-middle end of the documented sweet spot for narrow LoRA fine-tunes (200-500 floor, 2k-5k comfortable). Adequate for a v1 ship; if eval pass rate is below 0.85 we regenerate another 600-1000 pairs and retrain.
- Faithfulness is statistical, not strict: a few pairs may drop below 0.95 because of legitimate transformations (numeric formatting, proper-noun casing). We don't filter these out because the training task explicitly wants the model to learn those transformations.
- English only.





