Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adikuma/mumble-cleanup with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup")
model = AutoModelForCausalLM.from_pretrained("adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use adikuma/mumble-cleanup with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adikuma/mumble-cleanup"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adikuma/mumble-cleanup

SGLang

How to use adikuma/mumble-cleanup with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adikuma/mumble-cleanup" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adikuma/mumble-cleanup" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
```
docker model run hf.co/adikuma/mumble-cleanup
```

mumble-cleanup / docs /model_report.md

adikuma

initial upload: cleanup code and 688-pair seed dataset

fd0b01f verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

4.15 kB

Mumble cleanup model report

A small fine-tuned language model that cleans dictation transcripts. Trained on a GPU, runs on a CPU.

What it does

You dictate something messy. It returns the cleaned version. Five things it handles:

removes filler words (um, uh, like, you know)
collapses word stutters ("we we" -> "we")
recovers punctuation and capitalization
corrects homophones (their / there, your / you're)
formats numbers, dates, lists where the cue is clear

Example: "um so the the meeting is at three thirty tomorrow" becomes "The meeting is at 3:30 tomorrow."

How it was built

flowchart TD
    A[hand-curated seed jsonl<br/>688 pairs, 8 categories] --> B[stratified split<br/>85/10/5 train/val/test]
    B --> C[lora sft on qwen2.5-0.5b-instruct<br/>1 epoch on a single rtx 4090]
    C --> D[eval on held-out test<br/>raw vs base vs fine-tuned]
    D --> E[merge lora + export onnx<br/>fp32 + int8 for cpu inference]
    E --> F[cpu latency benchmark<br/>run on the target laptop]

The seed dataset was generated by a multi-agent workflow that spawned eight specialist agents in parallel, each producing ~70-80 pairs in a distinct dictation category. After dedup, the final dataset has 612 unique pairs.

Training uses TRL's SFTTrainer with DataCollatorForCompletionOnlyLM. The collator masks system and user tokens with -100, so cross-entropy only fires on the assistant turn. This is what keeps the model honest: gradients flow only through the cleaned output, never through the raw disfluent input.

Accuracy

Filled in after running make evaluate.

model	disfluency removal	punct f1	faithfulness	length ratio	pass rate
raw (no cleanup)	tbd	tbd	tbd	tbd	tbd
Qwen base zero-shot	tbd	tbd	tbd	tbd	tbd
fine-tuned	tbd	tbd	tbd	tbd	tbd

Pass rate is the percentage of test examples that simultaneously meet: disfluency removal ≥ 0.95, punctuation F1 ≥ 0.85, faithfulness ≥ 0.98, length ratio in [0.85, 1.05].

The base model has a documented failure mode: it answers questions instead of cleaning them ("what's the capital of france" → "Paris"). The adversarial question check confirms whether fine-tuning corrects this.

Training

Filled in after running make evaluate. Look for: train loss drops smoothly, val loss tracks train, no late divergence.

Speed on CPU

Measured on a laptop CPU. Laptop timings are noisy because they depend on what else the machine is doing; treat these as approximate. For an authoritative number, run the benchmark inside the actual deployment environment (the Mumble Tauri app via the Rust ort crate).

input length (tokens)	fp32 p50 (ms)	fp32 p95 (ms)	int8 p50 (ms)	int8 p95 (ms)
16	tbd	tbd	tbd	tbd
32	tbd	tbd	tbd	tbd
64	tbd	tbd	tbd	tbd
128	tbd	tbd	tbd	tbd
256	tbd	tbd	tbd	tbd
512	tbd	tbd	tbd	tbd

Realistic mix on ~500 real test inputs (variable length): tbd.

What you get

The deliverable is:

runs/<run-id>/onnx/model.onnx — fp32 ONNX, ~1 GB
runs/<run-id>/onnx/int8/model.onnx — int8 ONNX, ~250 MB (target for the Mumble app)

Both run on CPU with onnxruntime. The Rust ort crate consumes the int8 build.

Limits

English only.
Trained on synthetic data. Test set is held out from the same synthetic distribution. Real ASR output may have failure modes the synthetic operators did not model. The cleanup operators were tuned to match Parakeet's failure distribution as observed in the bench harness; expect some domain shift in production.
Inputs longer than 512 tokens must be chunked before cleanup.
Single-turn only. Does not maintain conversation history.
Fixed system prompt baked in at training time. Changing the prompt at inference will degrade quality.