Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adikuma/mumble-cleanup with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup") model = AutoModelForCausalLM.from_pretrained("adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use adikuma/mumble-cleanup with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adikuma/mumble-cleanup" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adikuma/mumble-cleanup
- SGLang
How to use adikuma/mumble-cleanup with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
docker model run hf.co/adikuma/mumble-cleanup
Mumble cleanup model report
A small fine-tuned language model that cleans dictation transcripts. Trained on a GPU, runs on a CPU.
What it does
You dictate something messy. It returns the cleaned version. Five things it handles:
- removes filler words (um, uh, like, you know)
- collapses word stutters ("we we" -> "we")
- recovers punctuation and capitalization
- corrects homophones (their / there, your / you're)
- formats numbers, dates, lists where the cue is clear
Example: "um so the the meeting is at three thirty tomorrow" becomes "The meeting is at 3:30 tomorrow."
How it was built
flowchart TD
A[hand-curated seed jsonl<br/>688 pairs, 8 categories] --> B[stratified split<br/>85/10/5 train/val/test]
B --> C[lora sft on qwen2.5-0.5b-instruct<br/>1 epoch on a single rtx 4090]
C --> D[eval on held-out test<br/>raw vs base vs fine-tuned]
D --> E[merge lora + export onnx<br/>fp32 + int8 for cpu inference]
E --> F[cpu latency benchmark<br/>run on the target laptop]
The seed dataset was generated by a multi-agent workflow that spawned eight specialist agents in parallel, each producing ~70-80 pairs in a distinct dictation category. After dedup, the final dataset has 612 unique pairs.
Training uses TRL's SFTTrainer with DataCollatorForCompletionOnlyLM. The collator masks system and user tokens with -100, so cross-entropy only fires on the assistant turn. This is what keeps the model honest: gradients flow only through the cleaned output, never through the raw disfluent input.
Accuracy
Filled in after running
make evaluate.
| model | disfluency removal | punct f1 | faithfulness | length ratio | pass rate |
|---|---|---|---|---|---|
| raw (no cleanup) | tbd | tbd | tbd | tbd | tbd |
| Qwen base zero-shot | tbd | tbd | tbd | tbd | tbd |
| fine-tuned | tbd | tbd | tbd | tbd | tbd |
Pass rate is the percentage of test examples that simultaneously meet: disfluency removal ≥ 0.95, punctuation F1 ≥ 0.85, faithfulness ≥ 0.98, length ratio in [0.85, 1.05].
The base model has a documented failure mode: it answers questions instead of cleaning them ("what's the capital of france" → "Paris"). The adversarial question check confirms whether fine-tuning corrects this.
Training
Filled in after running
make evaluate. Look for: train loss drops smoothly, val loss tracks train, no late divergence.
Speed on CPU
Measured on a laptop CPU. Laptop timings are noisy because they depend on what else the machine is doing; treat these as approximate. For an authoritative number, run the benchmark inside the actual deployment environment (the Mumble Tauri app via the Rust
ortcrate).
| input length (tokens) | fp32 p50 (ms) | fp32 p95 (ms) | int8 p50 (ms) | int8 p95 (ms) |
|---|---|---|---|---|
| 16 | tbd | tbd | tbd | tbd |
| 32 | tbd | tbd | tbd | tbd |
| 64 | tbd | tbd | tbd | tbd |
| 128 | tbd | tbd | tbd | tbd |
| 256 | tbd | tbd | tbd | tbd |
| 512 | tbd | tbd | tbd | tbd |
Realistic mix on ~500 real test inputs (variable length): tbd.
What you get
The deliverable is:
runs/<run-id>/onnx/model.onnx— fp32 ONNX, ~1 GBruns/<run-id>/onnx/int8/model.onnx— int8 ONNX, ~250 MB (target for the Mumble app)
Both run on CPU with onnxruntime. The Rust ort crate consumes the int8 build.
Limits
- English only.
- Trained on synthetic data. Test set is held out from the same synthetic distribution. Real ASR output may have failure modes the synthetic operators did not model. The cleanup operators were tuned to match Parakeet's failure distribution as observed in the bench harness; expect some domain shift in production.
- Inputs longer than 512 tokens must be chunked before cleanup.
- Single-turn only. Does not maintain conversation history.
- Fixed system prompt baked in at training time. Changing the prompt at inference will degrade quality.


