Qwen Enhanced Typo Fixer

A fine-tuned Qwen model for typo correction using advanced error patterns and multi-domain training data.

Model Description

This model is a fine-tuned version of Qwen/Qwen2-0.5B for typo correction. It was trained on an enhanced dataset featuring:

  • 80,677 training examples with realistic error patterns
  • Multi-domain coverage: conversational, professional, educational, creative, instructional, general
  • Advanced error types: keyboard errors, phonetic confusions, contextual mistakes, punctuation variations
  • Balanced punctuation: 50/50 split between sentences with/without ending punctuation

Training Details

  • Base Model: Qwen/Qwen2-0.5B
  • Training Hardware: Dual RTX5090 (48GB total VRAM)
  • Dataset Size: 80,677 examples
  • Epochs: 3
  • Batch Size: 32 (16 per GPU × 2 GPUs)
  • Learning Rate: 5e-5

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)

# Example usage
text_with_typos = "I beleive this is teh correct answr."
prompt = f"<|im_start|>user\nCorrect the typos in this text: {text_with_typos}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.1)
correction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(correction)

Training Configuration

The model was trained with the following key parameters:

  • Learning rate: 5e-5
  • Batch size: 32 (16 per GPU × 2 GPUs)
  • Gradient accumulation: 2 steps
  • Weight decay: 0.01
  • Warmup ratio: 0.1

Dataset Features

Error Pattern Distribution

  • Spelling errors: 60%+ coverage
  • Keyboard errors: 30%+ coverage
  • Phonetic errors: 8%+ coverage
  • Grammar errors: 2%+ coverage
  • Punctuation errors: <1% coverage

Domain Distribution

  • Educational: 15,837 (19.6%)
  • Instructional: 15,469 (19.2%)
  • Creative: 15,254 (18.9%)
  • Professional: 14,582 (18.1%)
  • Conversational: 12,875 (16.0%)
  • General: 6,660 (8.3%)

Complexity Distribution

  • Simple: 39,159 (48.5%)
  • Medium: 28,959 (35.9%)
  • Complex: 12,559 (15.6%)

Evaluation

The model achieves strong performance on typo correction tasks, with particular strength in:

  • Single and multi-word typos
  • Contextual corrections
  • Maintaining original meaning and style
  • Handling various text domains

Limitations

  • Optimized for English text
  • Best performance on sentences under 150 characters
  • May struggle with highly technical or domain-specific terminology
  • Designed for typo correction, not general text improvement

Citation

If you use this model, please cite:

@misc{qwen-enhanced-typo-fixer,
  title={Qwen Enhanced Typo Fixer},
  author={mazhewitt},
  year={2025},
  url={https://huggingface.co/mazhewitt/qwen-typo-fixer}
}

Training Data Details

  • Average Difficulty Score: 35.9
  • Average Errors per Example: 1.8
  • Top Error Types:
    • spelling: 70,135 occurrences
    • keyboard: 35,935 occurrences
    • phonetic: 9,875 occurrences
    • grammar: 1,833 occurrences
    • punctuation: 79 occurrences
Downloads last month
4
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support