--- title: Multilingual Punctuation Capitalization Correction emoji: 🌍 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 --- # 🌍 Multilingual Punctuation & Capitalization Correction This Space provides an interactive interface for restoring punctuation, fixing capitalization, and detecting sentence boundaries in text across **47 languages**. ## Features - **Multi-language support**: Works with 47 languages including English, French, Spanish, German, Italian, Portuguese, Russian, Turkish, Chinese, Japanese, Arabic, and more - **Three correction modes**: - 📝 **Conservative**: Minimal changes, preserves original flow - 📖 **With Sentence Boundaries**: Splits text into clear sentences - ⚖️ **Balanced**: Smart chunking for longer texts - **Interactive UI**: Compare different correction styles and select the best one - **Copy functionality**: Easy clipboard access for each version ## Model This application uses the [1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase](https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase) model, which is an XLM-RoBERTa model fine-tuned for: - Punctuation restoration - True-casing (capitalization) - Sentence boundary detection ## Usage 1. Enter text without proper punctuation or capitalization 2. Click "Add Punctuation & Capitalization" 3. Review the three different correction styles 4. Select and copy the version that best fits your needs ## Examples Try these example inputs: - English: "hello there how are you doing today i hope everything is going well" - French: "bonjour comment allez vous aujourdhui jespere que tout va bien" - Spanish: "hola como estas espero que todo este bien contigo y tu familia" ## Technical Details - **Base Model**: XLM-RoBERTa - **Languages Supported**: 47 - **Tasks**: Punctuation restoration, capitalization, sentence boundary detection - **Framework**: Gradio interface with ONNX runtime for efficient inference ## Limitations - Model was primarily trained on news data - May not perform optimally on conversational or informal text - Some languages may have better performance than others based on training data distribution