Qwen Enhanced Typo Fixer

A fine-tuned Qwen model for typo correction using advanced error patterns and multi-domain training data.

Model Description

This model is a fine-tuned version of Qwen/Qwen2-0.5B for typo correction. It was trained on an enhanced dataset featuring:

80,677 training examples with realistic error patterns
Multi-domain coverage: conversational, professional, educational, creative, instructional, general
Advanced error types: keyboard errors, phonetic confusions, contextual mistakes, punctuation variations
Balanced punctuation: 50/50 split between sentences with/without ending punctuation

Training Details

Base Model: Qwen/Qwen2-0.5B
Training Hardware: Dual RTX5090 (48GB total VRAM)
Dataset Size: 80,677 examples
Epochs: 3
Batch Size: 32 (16 per GPU × 2 GPUs)
Learning Rate: 5e-5

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("mazhewitt/qwen-typo-fixer", trust_remote_code=True)

# Example usage
text_with_typos = "I beleive this is teh correct answr."
prompt = f"<|im_start|>user\nCorrect the typos in this text: {text_with_typos}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.1)
correction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(correction)

Training Configuration

The model was trained with the following key parameters:

Learning rate: 5e-5
Batch size: 32 (16 per GPU × 2 GPUs)
Gradient accumulation: 2 steps
Weight decay: 0.01
Warmup ratio: 0.1

Dataset Features

Error Pattern Distribution

Spelling errors: 60%+ coverage
Keyboard errors: 30%+ coverage
Phonetic errors: 8%+ coverage
Grammar errors: 2%+ coverage
Punctuation errors: <1% coverage

Domain Distribution

Educational: 15,837 (19.6%)
Instructional: 15,469 (19.2%)
Creative: 15,254 (18.9%)
Professional: 14,582 (18.1%)
Conversational: 12,875 (16.0%)
General: 6,660 (8.3%)

Complexity Distribution

Simple: 39,159 (48.5%)
Medium: 28,959 (35.9%)
Complex: 12,559 (15.6%)

Evaluation

The model achieves strong performance on typo correction tasks, with particular strength in:

Single and multi-word typos
Contextual corrections
Maintaining original meaning and style
Handling various text domains

Limitations

Optimized for English text
Best performance on sentences under 150 characters
May struggle with highly technical or domain-specific terminology
Designed for typo correction, not general text improvement

Citation

If you use this model, please cite:

@misc{qwen-enhanced-typo-fixer,
  title={Qwen Enhanced Typo Fixer},
  author={mazhewitt},
  year={2025},
  url={https://huggingface.co/mazhewitt/qwen-typo-fixer}
}

Training Data Details

Average Difficulty Score: 35.9
Average Errors per Example: 1.8
Top Error Types:
- spelling: 70,135 occurrences
- keyboard: 35,935 occurrences
- phonetic: 9,875 occurrences
- grammar: 1,833 occurrences
- punctuation: 79 occurrences

Downloads last month: 4

Safetensors

Model size

0.6B params

Tensor type

BF16