asdfasdfdsafdsa's picture
Upload 3 files
06027df verified

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: Multilingual Punctuation Capitalization Correction
emoji: 🌍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

🌍 Multilingual Punctuation & Capitalization Correction

This Space provides an interactive interface for restoring punctuation, fixing capitalization, and detecting sentence boundaries in text across 47 languages.

Features

  • Multi-language support: Works with 47 languages including English, French, Spanish, German, Italian, Portuguese, Russian, Turkish, Chinese, Japanese, Arabic, and more
  • Three correction modes:
    • πŸ“ Conservative: Minimal changes, preserves original flow
    • πŸ“– With Sentence Boundaries: Splits text into clear sentences
    • βš–οΈ Balanced: Smart chunking for longer texts
  • Interactive UI: Compare different correction styles and select the best one
  • Copy functionality: Easy clipboard access for each version

Model

This application uses the 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase model, which is an XLM-RoBERTa model fine-tuned for:

  • Punctuation restoration
  • True-casing (capitalization)
  • Sentence boundary detection

Usage

  1. Enter text without proper punctuation or capitalization
  2. Click "Add Punctuation & Capitalization"
  3. Review the three different correction styles
  4. Select and copy the version that best fits your needs

Examples

Try these example inputs:

  • English: "hello there how are you doing today i hope everything is going well"
  • French: "bonjour comment allez vous aujourdhui jespere que tout va bien"
  • Spanish: "hola como estas espero que todo este bien contigo y tu familia"

Technical Details

  • Base Model: XLM-RoBERTa
  • Languages Supported: 47
  • Tasks: Punctuation restoration, capitalization, sentence boundary detection
  • Framework: Gradio interface with ONNX runtime for efficient inference

Limitations

  • Model was primarily trained on news data
  • May not perform optimally on conversational or informal text
  • Some languages may have better performance than others based on training data distribution