Qwen Enhanced Typo Fixer
A fine-tuned Qwen model for typo correction using advanced error patterns and multi-domain training data.
Model Description
This model is a fine-tuned version of Qwen/Qwen2-0.5B for typo correction. It was trained on an enhanced dataset featuring:
- 80,677 training examples with realistic error patterns
- Multi-domain coverage: conversational, professional, educational, creative, instructional, general
- Advanced error types: keyboard errors, phonetic confusions, contextual mistakes, punctuation variations
- Balanced punctuation: 50/50 split between sentences with/without ending punctuation
Training Details
- Base Model: Qwen/Qwen2-0.5B
- Training Hardware: Dual RTX5090 (48GB total VRAM)
- Dataset Size: 80,677 examples
- Epochs: 3
- Batch Size: 32 (16 per GPU × 2 GPUs)
- Learning Rate: 5e-5
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)
# Example usage
text_with_typos = "I beleive this is teh correct answr."
prompt = f"<|im_start|>user\nCorrect the typos in this text: {text_with_typos}<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.1)
correction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(correction)
Training Configuration
The model was trained with the following key parameters:
- Learning rate: 5e-5
- Batch size: 32 (16 per GPU × 2 GPUs)
- Gradient accumulation: 2 steps
- Weight decay: 0.01
- Warmup ratio: 0.1
Dataset Features
Error Pattern Distribution
- Spelling errors: 60%+ coverage
- Keyboard errors: 30%+ coverage
- Phonetic errors: 8%+ coverage
- Grammar errors: 2%+ coverage
- Punctuation errors: <1% coverage
Domain Distribution
- Educational: 15,837 (19.6%)
- Instructional: 15,469 (19.2%)
- Creative: 15,254 (18.9%)
- Professional: 14,582 (18.1%)
- Conversational: 12,875 (16.0%)
- General: 6,660 (8.3%)
Complexity Distribution
- Simple: 39,159 (48.5%)
- Medium: 28,959 (35.9%)
- Complex: 12,559 (15.6%)
Evaluation
The model achieves strong performance on typo correction tasks, with particular strength in:
- Single and multi-word typos
- Contextual corrections
- Maintaining original meaning and style
- Handling various text domains
Limitations
- Optimized for English text
- Best performance on sentences under 150 characters
- May struggle with highly technical or domain-specific terminology
- Designed for typo correction, not general text improvement
Citation
If you use this model, please cite:
@misc{qwen-enhanced-typo-fixer,
title={Qwen Enhanced Typo Fixer},
author={mazhewitt},
year={2025},
url={https://huggingface.co/mazhewitt/qwen-typo-fixer}
}
Training Data Details
- Average Difficulty Score: 35.9
- Average Errors per Example: 1.8
- Top Error Types:
- spelling: 70,135 occurrences
- keyboard: 35,935 occurrences
- phonetic: 9,875 occurrences
- grammar: 1,833 occurrences
- punctuation: 79 occurrences
- Downloads last month
- 4