asdfasdfdsafdsa commited on
Commit
06027df
Β·
verified Β·
1 Parent(s): 2435057

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +53 -6
  2. app.py +215 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,59 @@
1
  ---
2
- title: Punctuation Capitalization Correction
3
- emoji: 🐨
4
- colorFrom: pink
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 5.46.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Multilingual Punctuation Capitalization Correction
3
+ emoji: 🌍
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
  ---
12
 
13
+ # 🌍 Multilingual Punctuation & Capitalization Correction
14
+
15
+ This Space provides an interactive interface for restoring punctuation, fixing capitalization, and detecting sentence boundaries in text across **47 languages**.
16
+
17
+ ## Features
18
+
19
+ - **Multi-language support**: Works with 47 languages including English, French, Spanish, German, Italian, Portuguese, Russian, Turkish, Chinese, Japanese, Arabic, and more
20
+ - **Three correction modes**:
21
+ - πŸ“ **Conservative**: Minimal changes, preserves original flow
22
+ - πŸ“– **With Sentence Boundaries**: Splits text into clear sentences
23
+ - βš–οΈ **Balanced**: Smart chunking for longer texts
24
+ - **Interactive UI**: Compare different correction styles and select the best one
25
+ - **Copy functionality**: Easy clipboard access for each version
26
+
27
+ ## Model
28
+
29
+ This application uses the [1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase](https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase) model, which is an XLM-RoBERTa model fine-tuned for:
30
+ - Punctuation restoration
31
+ - True-casing (capitalization)
32
+ - Sentence boundary detection
33
+
34
+ ## Usage
35
+
36
+ 1. Enter text without proper punctuation or capitalization
37
+ 2. Click "Add Punctuation & Capitalization"
38
+ 3. Review the three different correction styles
39
+ 4. Select and copy the version that best fits your needs
40
+
41
+ ## Examples
42
+
43
+ Try these example inputs:
44
+ - English: "hello there how are you doing today i hope everything is going well"
45
+ - French: "bonjour comment allez vous aujourdhui jespere que tout va bien"
46
+ - Spanish: "hola como estas espero que todo este bien contigo y tu familia"
47
+
48
+ ## Technical Details
49
+
50
+ - **Base Model**: XLM-RoBERTa
51
+ - **Languages Supported**: 47
52
+ - **Tasks**: Punctuation restoration, capitalization, sentence boundary detection
53
+ - **Framework**: Gradio interface with ONNX runtime for efficient inference
54
+
55
+ ## Limitations
56
+
57
+ - Model was primarily trained on news data
58
+ - May not perform optimally on conversational or informal text
59
+ - Some languages may have better performance than others based on training data distribution
app.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from punctuators.models import PunctCapSegModelONNX
3
+
4
+ # Load the punctuation model
5
+ print("Loading XLM-RoBERTa punctuation model...")
6
+ model = PunctCapSegModelONNX.from_pretrained(
7
+ "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
8
+ )
9
+ print("Model loaded successfully!")
10
+
11
+ def punctuate_text(input_text, progress=gr.Progress()):
12
+ """
13
+ Generate 3 different punctuation corrections with varying strategies
14
+ """
15
+ if not input_text.strip():
16
+ return ["", "", ""]
17
+
18
+ corrections = []
19
+
20
+ # Three different approaches
21
+ configs = [
22
+ {"name": "Conservative", "apply_sbd": False},
23
+ {"name": "With Sentence Boundaries", "apply_sbd": True},
24
+ {"name": "Balanced", "apply_sbd": True}
25
+ ]
26
+
27
+ for i, config in enumerate(configs):
28
+ progress((i + 0.5) / 3, desc=f"Generating {config['name']} version...")
29
+
30
+ if config["name"] == "Conservative":
31
+ # Single text processing without sentence boundaries
32
+ result = model.infer(texts=[input_text], apply_sbd=config["apply_sbd"])
33
+ corrected_text = result[0]
34
+
35
+ elif config["name"] == "With Sentence Boundaries":
36
+ # Process with sentence boundary detection
37
+ result = model.infer(texts=[input_text], apply_sbd=config["apply_sbd"])
38
+ corrected_text = "\n".join(result[0]) if isinstance(result[0], list) else result[0]
39
+
40
+ else: # Balanced
41
+ # Process text in chunks if it's long
42
+ if len(input_text) > 500:
43
+ # Split into chunks
44
+ chunks = [input_text[i:i+500] for i in range(0, len(input_text), 400)]
45
+ results = []
46
+ for chunk in chunks:
47
+ chunk_result = model.infer(texts=[chunk], apply_sbd=False)
48
+ results.append(chunk_result[0])
49
+ corrected_text = " ".join(results)
50
+ else:
51
+ result = model.infer(texts=[input_text], apply_sbd=config["apply_sbd"])
52
+ corrected_text = "\n".join(result[0]) if isinstance(result[0], list) else result[0]
53
+
54
+ corrections.append(corrected_text)
55
+ progress((i + 1) / 3, desc=f"{config['name']} version complete")
56
+
57
+ progress(1.0, desc="All corrections generated!")
58
+ return corrections
59
+
60
+ # Create Gradio interface
61
+ with gr.Blocks(title="Multilingual Punctuation & Capitalization Correction", theme=gr.themes.Soft()) as demo:
62
+ gr.Markdown("""
63
+ # 🌍 Multilingual Punctuation & Capitalization Correction
64
+
65
+ This tool uses **XLM-RoBERTa** to restore punctuation, fix capitalization, and detect sentence boundaries in **47 languages**.
66
+
67
+ Enter text without proper punctuation or capitalization, and get 3 different correction styles:
68
+ - **πŸ“ Conservative**: Minimal changes, preserves original flow
69
+ - **πŸ“– With Sentence Boundaries**: Splits text into clear sentences
70
+ - **βš–οΈ Balanced**: Smart chunking for longer texts
71
+ """)
72
+
73
+ with gr.Row():
74
+ with gr.Column(scale=2):
75
+ input_text = gr.Textbox(
76
+ label="Input Text (any of 47 supported languages)",
77
+ placeholder="enter text without punctuation or capitalization like this example here it will be fixed",
78
+ lines=12,
79
+ max_lines=20
80
+ )
81
+ correct_btn = gr.Button("πŸš€ Add Punctuation & Capitalization", variant="primary", size="lg")
82
+
83
+ # Output section with 3 versions
84
+ gr.Markdown("### πŸ“ Correction Options")
85
+
86
+ with gr.Row():
87
+ with gr.Column():
88
+ gr.Markdown("#### πŸ“ Conservative")
89
+ output_conservative = gr.Textbox(
90
+ label="Conservative Correction",
91
+ lines=10,
92
+ max_lines=15,
93
+ interactive=True,
94
+ elem_id="conservative_output"
95
+ )
96
+ copy_btn_1 = gr.Button("πŸ“‹ Copy", variant="secondary", size="sm")
97
+
98
+ with gr.Column():
99
+ gr.Markdown("#### πŸ“– With Sentence Boundaries")
100
+ output_boundaries = gr.Textbox(
101
+ label="Sentence Boundary Detection",
102
+ lines=10,
103
+ max_lines=15,
104
+ interactive=True,
105
+ elem_id="boundaries_output"
106
+ )
107
+ copy_btn_2 = gr.Button("πŸ“‹ Copy", variant="secondary", size="sm")
108
+
109
+ with gr.Column():
110
+ gr.Markdown("#### βš–οΈ Balanced")
111
+ output_balanced = gr.Textbox(
112
+ label="Balanced Correction",
113
+ lines=10,
114
+ max_lines=15,
115
+ interactive=True,
116
+ elem_id="balanced_output"
117
+ )
118
+ copy_btn_3 = gr.Button("πŸ“‹ Copy", variant="secondary", size="sm")
119
+
120
+ # Selected version display
121
+ with gr.Row():
122
+ gr.Markdown("### βœ… Selected Correction")
123
+ selected_text = gr.Textbox(
124
+ label="Your Selected Correction",
125
+ lines=5,
126
+ interactive=True,
127
+ placeholder="Click 'Use This' under any correction to select it"
128
+ )
129
+
130
+ # Add selection buttons
131
+ with gr.Row():
132
+ with gr.Column():
133
+ select_btn_1 = gr.Button("βœ… Use This", variant="primary", size="sm")
134
+ with gr.Column():
135
+ select_btn_2 = gr.Button("βœ… Use This", variant="primary", size="sm")
136
+ with gr.Column():
137
+ select_btn_3 = gr.Button("βœ… Use This", variant="primary", size="sm")
138
+
139
+ # Add examples
140
+ gr.Examples(
141
+ examples=[
142
+ ["hello there how are you doing today i hope everything is going well"],
143
+ ["the quick brown fox jumps over the lazy dog this is a test sentence"],
144
+ ["machine learning is revolutionizing many industries from healthcare to finance"],
145
+ ["bonjour comment allez vous aujourdhui jespere que tout va bien"],
146
+ ["hola como estas espero que todo este bien contigo y tu familia"],
147
+ ],
148
+ inputs=input_text,
149
+ label="Example sentences (click to try)"
150
+ )
151
+
152
+ # Set up event handlers
153
+ outputs = [output_conservative, output_boundaries, output_balanced]
154
+ correct_btn.click(fn=punctuate_text, inputs=input_text, outputs=outputs)
155
+ input_text.submit(fn=punctuate_text, inputs=input_text, outputs=outputs)
156
+
157
+ # Selection handlers
158
+ select_btn_1.click(fn=lambda x: x, inputs=output_conservative, outputs=selected_text)
159
+ select_btn_2.click(fn=lambda x: x, inputs=output_boundaries, outputs=selected_text)
160
+ select_btn_3.click(fn=lambda x: x, inputs=output_balanced, outputs=selected_text)
161
+
162
+ # JavaScript for copy functionality
163
+ copy_btn_1.click(
164
+ None,
165
+ None,
166
+ None,
167
+ js="""
168
+ () => {
169
+ const outputText = document.querySelector('#conservative_output textarea').value;
170
+ navigator.clipboard.writeText(outputText);
171
+ alert('Conservative version copied to clipboard!');
172
+ }
173
+ """
174
+ )
175
+
176
+ copy_btn_2.click(
177
+ None,
178
+ None,
179
+ None,
180
+ js="""
181
+ () => {
182
+ const outputText = document.querySelector('#boundaries_output textarea').value;
183
+ navigator.clipboard.writeText(outputText);
184
+ alert('Sentence boundaries version copied to clipboard!');
185
+ }
186
+ """
187
+ )
188
+
189
+ copy_btn_3.click(
190
+ None,
191
+ None,
192
+ None,
193
+ js="""
194
+ () => {
195
+ const outputText = document.querySelector('#balanced_output textarea').value;
196
+ navigator.clipboard.writeText(outputText);
197
+ alert('Balanced version copied to clipboard!');
198
+ }
199
+ """
200
+ )
201
+
202
+ gr.Markdown("""
203
+ ---
204
+ **Model:** [1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase](https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase)
205
+
206
+ **Supports 47 languages** including: English, French, Spanish, German, Italian, Portuguese, Russian, Turkish, Chinese, Japanese, Arabic, and many more!
207
+ """)
208
+
209
+ # Launch the app
210
+ if __name__ == "__main__":
211
+ demo.launch(
212
+ server_name="0.0.0.0",
213
+ server_port=7860,
214
+ share=False
215
+ )
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ punctuators>=1.0.0
3
+ torch>=2.0.0
4
+ onnx>=1.14.0
5
+ onnxruntime>=1.15.0