Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adikuma/mumble-cleanup with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup")
model = AutoModelForCausalLM.from_pretrained("adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use adikuma/mumble-cleanup with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adikuma/mumble-cleanup"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adikuma/mumble-cleanup

SGLang

How to use adikuma/mumble-cleanup with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adikuma/mumble-cleanup" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adikuma/mumble-cleanup" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
```
docker model run hf.co/adikuma/mumble-cleanup
```

mumble-cleanup / docs /data_report.md

adikuma

initial upload: cleanup code and 688-pair seed dataset

fd0b01f verified 29 days ago

preview code

Raw

History Blame Contribute Delete

8.42 kB

data report

The synthetic seed dataset that backs the Mumble cleanup model. Built by a multi-agent workflow that spawned 8 specialist agents in parallel and produced 612 pairs across 8 dictation categories; a polish pass added 76 more long_form_thoughts pairs with strictly diverse openers, bringing the total to 688 pairs.

Every pair is { raw: <Parakeet-shaped lowercase no-punct disfluent input>, clean: <proper English output> }. The clean side is faithful by construction: every content word in clean exists in raw (modulo standard homophone fixes, contractions, and casing). This is what stops the model from learning to hallucinate.

category mix

category	count
`casual_messages`	81
`long_form_thoughts`	145
`meeting_notes`	81
`mixed_content`	70
`professional_emails`	80
`questions_and_asks`	70
`technical_dictation`	80
`todo_lists`	81
total	688

long_form_thoughts is intentionally over-weighted because paragraph-length cleanup is the hardest behavior (multiple sentence boundaries, sustained context, false starts) and 145 examples gives the model the signal it needs to handle 60-90 word inputs.

length distribution

Raw inputs span 2 to 70 words with a median of 17. Clean outputs are slightly shorter on average (16 median words) because they have fillers and stutters removed. The categories show meaningfully different length distributions: short utterances dominate casual_messages, questions_and_asks, and mixed_content; long paragraph-shaped inputs dominate long_form_thoughts.

raw vs clean length

Points below the diagonal mean clean is shorter than raw — the model is being trained to remove material, not add it. The cluster sits just below the diagonal, which is the expected shape for a faithful cleanup task: a few words removed per input on average, never more than ~25%.

disfluency intensity

Average filler-word count per raw input, by category. meeting_notes and long_form_thoughts carry the heaviest disfluency load (people think out loud during meetings); mixed_content and questions_and_asks are leanest (those categories are about precision, not verbosity).

Distribution of filler words across the entire dataset. um and uh dominate (matching real Parakeet output), with like, you know, and so following at a moderate rate. The mix matches what shows up in real dictation transcripts.

faithfulness check

For each pair, we compute the fraction of content words in the clean side that also appear in the raw side. A perfect value is 1.0 (every clean content word came from raw); lower values indicate the clean introduced content the raw did not have, which would train the model to hallucinate.

Mean faithfulness: 0.929
Median faithfulness: 0.957
Pairs above 0.95 threshold: 391 of 688 (56.8%)
Pairs above 0.90 threshold: 527 of 688 (76.6%)

Small drops below 1.0 come from legitimate sources: number-word to digit conversion ("two thirty" -> "2:30"), proper-noun capitalization that adds new tokens to the content-word set under our simple lowercase comparison ("acme" -> "Acme" should be counted as matching but our naive check might miss some), and contractions ("i" -> "I'm" via apostrophe restoration).

sample pairs

Two per category, illustrating the shape of the dataset:

`casual_messages`

raw: hey sarah you free for lunch around noon
clean: Hey Sarah, you free for lunch around noon?
raw: did you finish the writeup for the all hands
clean: Did you finish the writeup for the all hands?

`long_form_thoughts`

raw: so ive been mulling over the pricing change and honestly i think we we moved too fast on the annual discount like the data showed a bump in conversions but um when you look at retention three months out the cohort actually churns harder than the monthly folks so we might be optimizing for a vanity metric instead of long term value
clean: I've been mulling over the pricing change, and honestly I think we moved too fast on the annual discount. The data showed a bump in conversions, but when you look at retention three months out, the cohort actually churns harder than the monthly folks. We might be optimizing for a vanity metric instead of long-term value.
raw: the thing that bugs me about our okr process is that we we set them in january then never look at them again until december so they become this performative document instead of a living tool and um if a goal isnt influencing weekly decisions its not really a goal its just a wish we wrote down
clean: The thing that bugs me about our OKR process is that we set them in January then never look at them again until December, so they become this performative document instead of a living tool. If a goal isn't influencing weekly decisions, it's not really a goal. It's just a wish we wrote down.

`meeting_notes`

raw: ok so we decided to ship the dark mode rollout next sprint priya is the owner
clean: We decided to ship the dark mode rollout next sprint. Priya is the owner.
raw: uh action item add a slack channel for the the procurement project
clean: Action item: add a Slack channel for the procurement project.

`mixed_content`

raw: meeting with sarah at two fifteen pm tomorrow
clean: Meeting with Sarah at 2:15 PM tomorrow.
raw: uber driver arriving in four minutes silver toyota camry license seven a b c three four five
clean: Uber driver arriving in 4 minutes. Silver Toyota Camry, license 7ABC345.

`professional_emails`

raw: hi marcus quick follow up on the contract did legal sign off yet
clean: Hi Marcus, quick follow up on the contract. Did legal sign off yet?
raw: thanks ill review and get back to you by tomorrow noon
clean: Thanks, I'll review and get back to you by tomorrow noon.

`questions_and_asks`

raw: can you send me the deck
clean: Can you send me the deck?
raw: is there a reason why we picked dynamodb over postgres for this i want to understand the tradeoff
clean: Is there a reason why we picked DynamoDB over Postgres for this? I want to understand the tradeoff.

`technical_dictation`

raw: the mutex is held across the await point which deadlocks tokio
clean: The mutex is held across the await point, which deadlocks Tokio.
raw: well i mean the the perf regression came from the new react server component because its now making a fresh database connection per render instead of reusing the pool
clean: The perf regression came from the new React server component because it's now making a fresh database connection per render instead of reusing the pool.

`todo_lists`

raw: pick up bread milk and eggs
clean: Pick up bread, milk, and eggs.
raw: favorite restaurants in seattle are canlis altura and the walrus and the carpenter
clean: Favorite restaurants in Seattle:
Canlis
Altura
The Walrus and the Carpenter

limitations

Synthetic origin: every pair was generated by an LLM workflow, not transcribed from real Parakeet output. The disfluency patterns are modeled to match real ASR failure modes but may under-represent edge cases the model will face in production.
Size: 688 pairs is on the lower-middle end of the documented sweet spot for narrow LoRA fine-tunes (200-500 floor, 2k-5k comfortable). Adequate for a v1 ship; if eval pass rate is below 0.85 we regenerate another 600-1000 pairs and retrain.
Faithfulness is statistical, not strict: a few pairs may drop below 0.95 because of legitimate transformations (numeric formatting, proper-noun casing). We don't filter these out because the training task explicitly wants the model to learn those transformations.
English only.