Ander Arriandiaga commited on
Commit
a7c0c81
·
1 Parent(s): 42aaddc

Initial commit for Hugging Face Space

Browse files
.gitattributes CHANGED
@@ -1,35 +1,3 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ # Track large dictionary files and binary with Git LFS if enabled
2
+ dict/* filter=lfs diff=lfs merge=lfs -text
3
+ modulo1y2/modulo1y2 filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ outputs/
README.md CHANGED
@@ -1,14 +1,18 @@
1
- ---
2
- title: Phonemizer Eus Esp
3
- emoji: 🏃
4
- colorFrom: green
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 6.0.0
8
- app_file: app.py
9
- pinned: false
10
- license: cc-by-nc-4.0
11
- short_description: Web UI to phonemize Basque (eu) and Spanish (es) tex
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
1
+ # Phonemizer — Gradio demo (Hugging Face Space)
2
+
3
+ This Space provides a small web UI to phonemize Basque (eu) and Spanish (es) text.
4
+
5
+ How to use
6
+ - Input text: paste text into the main box or upload a `.txt` file.
7
+ - Language: select `eu` (Basque) or `es` (Spanish).
8
+ - Symbols: choose `sampa` (default) or `ipa` for the phoneme output format.
9
+ - Separate phonemes: toggle whether phonemes are separated by spaces to make easier to see multi-character phonemes.
10
+ - Submit: press `Submit` to run normalization + phonemization.
11
+ - Download: use the download buttons to get the phonemes or normalized text as `.txt` files.
12
+
13
+ Privacy
14
+ - This Space does not store user inputs beyond temporary files used to serve downloads. Do not upload sensitive data.
15
+
16
+ Credits
17
+ - Developed by Ander Arriandiaga in Aholab (HiTZ).
18
+
README_developer.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phonemizer Gradio Space — Developer Notes
2
+
3
+ This repository contains a Gradio app wrapper for the Phonemizer used in this project.
4
+
5
+ Files to keep in the Space repo for runtime
6
+ - `gradio_phonemizer.py` (UI) and `eu_phonemizer_v2.py` (phonemizer logic)
7
+ - `app.py` (Gradio entrypoint)
8
+ - `modulo1y2/modulo1y2` (the phonemizer executable) OR source+build files in `modulo1y2/`
9
+ - `dict/` containing `eu_dicc` (or `eu_dicc.dic`) and `es_dicc` (or `es_dicc.dic`)
10
+ - `requirements.txt`
11
+
12
+ Recommended deployment options
13
+
14
+ - Ship the `modulo1y2` executable and the minimal dictionary files in the repo (fastest).
15
+ - OR keep only sources and build the executable on Space startup using an `apt.txt` and a `make` step.
16
+ - OR host large dictionaries/executables on the Hugging Face Hub (dataset/model repo) and download them at startup using `huggingface_hub.hf_hub_download`.
17
+
18
+ Quick local test
19
+
20
+ 1. Create a venv and install dependencies:
21
+
22
+ ```bash
23
+ python3 -m venv .venv
24
+ source .venv/bin/activate
25
+ pip install -r requirements.txt
26
+ ```
27
+
28
+ 2. Ensure the executable is present and executable:
29
+
30
+ ```bash
31
+ chmod +x modulo1y2/modulo1y2
32
+ ls -l modulo1y2/modulo1y2
33
+ ls -l dict/eu_dicc* dict/es_dicc*
34
+ ```
35
+
36
+ 3. Run the app locally:
37
+
38
+ ```bash
39
+ python app.py
40
+ # then open http://localhost:7860
41
+ ```
42
+
43
+ Pushing to Hugging Face Spaces
44
+
45
+ 1. (Optional) Install git-lfs and track large files:
46
+
47
+ ```bash
48
+ git lfs install
49
+ git lfs track "dict/*"
50
+ git lfs track "modulo1y2/modulo1y2"
51
+ ```
52
+
53
+ 2. Create a Space (via web UI or `huggingface-cli repo create <user>/<space> --type=space`), then push this repo to the Space remote.
54
+
55
+ Licensing and redistribution
56
+
57
+ Before uploading binaries or dictionary files, confirm you have the right to redistribute them.
app.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from gradio_phonemizer import build_interface
3
+
4
+ demo = build_interface()
5
+
6
+ if __name__ == "__main__":
7
+ # Respect common env vars used by hosting platforms
8
+ port = int(os.environ.get("PORT", os.environ.get("GRADIO_SERVER_PORT", 7860)))
9
+ demo.launch(server_name="0.0.0.0", server_port=port)
dict/es_dicc.dic ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3880d688565dcfc4c1a239cb94c6cc0466b603cbf86fbf8a20ca411d64cb3c03
3
+ size 141770
dict/es_dicc_20241204.dic ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3880d688565dcfc4c1a239cb94c6cc0466b603cbf86fbf8a20ca411d64cb3c03
3
+ size 141770
dict/eu_dicc.dic ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a4c6553965ac7c7937b599d3e8a3d8d94df48a0bdef943a84c63f4b261172f8
3
+ size 865575
dict/eu_dicc_20250326.dic ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a4c6553965ac7c7937b599d3e8a3d8d94df48a0bdef943a84c63f4b261172f8
3
+ size 865575
eu_phonemizer_v2.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import subprocess
2
+ import logging
3
+ import string
4
+ from pathlib import Path
5
+ from collections import OrderedDict
6
+ from nltk.tokenize import TweetTokenizer
7
+ from typing import List, Dict, Optional
8
+ import re
9
+
10
+ # Constants
11
+ SUPPORTED_LANGUAGES = {'eu', 'es'}
12
+ SUPPORTED_SYMBOLS = {'sampa', 'ipa'}
13
+ SAMPA_TO_IPA = OrderedDict([
14
+ ("p", "p"), ("b", "b"), ("t", "t"), ("c", "c"), ("d", "d"),
15
+ ("k", "k"), ("g", "ɡ"), ("tS", "tʃ"), ("ts", "ts"), ("ts`", "tʂ"),
16
+ ("gj", "ɟ"), ("jj", "ɪ"), ("f", "f"), ("B", "β"), ("T", "θ"),
17
+ ("D", "ð"), ("s", "s"), ("s`", "ʂ"), ("S", "ʃ"), ("x", "x"),
18
+ ("G", "ɣ"), ("m", "m"), ("n", "n"), ("J", "ɲ"), ("l", "l"),
19
+ ("L", "ʎ"), ("r", "ɾ"), ("rr", "r"), ("j", "j"), ("w", "w"),
20
+ ("i", "i"), ("'i", "'i"), ("e", "e"), ("'e", "'e"), ("a", "a"),
21
+ ("'a", "'a"), ("o", "o"), ("'o", "'o"), ("u", "u"), ("'u", "'u"),
22
+ ("y", "y"), ("Z", "ʒ"), ("h", "h"), ("ph", "pʰ"), ("kh", "kʰ"),
23
+ ("th", "tʰ")
24
+ ])
25
+
26
+ MULTICHAR_TO_SINGLECHAR = {
27
+ "tʃ": "C",
28
+ "ts": "V",
29
+ "tʂ": "P",
30
+ "'i": "I",
31
+ "'e": "E",
32
+ "'a": "A",
33
+ "'o": "O",
34
+ "'u": "U",
35
+ "pʰ": "H",
36
+ "kʰ": "K",
37
+ "tʰ": "T"
38
+ }
39
+
40
+ class PhonemizerError(Exception):
41
+ """Custom exception for Phonemizer errors."""
42
+ pass
43
+
44
+ class Phonemizer:
45
+ def __init__(self, language: str = "eu", symbol: str = "sampa",
46
+ path_modulo1y2: str = "modulo1y2/modulo1y2",
47
+ path_dicts: str = "dict") -> None:
48
+ """Initialize the Phonemizer with the given language and symbol."""
49
+ if language not in SUPPORTED_LANGUAGES:
50
+ raise PhonemizerError(f"Unsupported language: {language}")
51
+ if symbol not in SUPPORTED_SYMBOLS:
52
+ raise PhonemizerError(f"Unsupported symbol type: {symbol}")
53
+
54
+ self.language = language
55
+ self.symbol = symbol
56
+ self.path_modulo1y2 = Path(path_modulo1y2)
57
+ self.path_dicts = Path(path_dicts)
58
+ self.logger = logging.getLogger(__name__)
59
+
60
+ # Initialize SAMPA to IPA dictionary
61
+ self._sampa_to_ipa_dict = SAMPA_TO_IPA
62
+
63
+ # Initialize word splitter regex
64
+ self._word_splitter = re.compile(r'\w+|[^\w\s]', re.UNICODE)
65
+
66
+ self._validate_paths()
67
+
68
+ def normalize(self, text: str) -> str:
69
+ """Normalize the given text using an external command."""
70
+ try:
71
+ command = self._build_normalization_command()
72
+ process = subprocess.Popen(
73
+ command,
74
+ stdin=subprocess.PIPE,
75
+ stdout=subprocess.PIPE,
76
+ stderr=subprocess.PIPE,
77
+ text=True,
78
+ encoding='ISO-8859-15',
79
+ shell=True
80
+ )
81
+ stdout, stderr = process.communicate(input=text)
82
+
83
+ if process.returncode != 0:
84
+ # Filter out the SetDur warning from the error message
85
+ filtered_stderr = '\n'.join(line for line in stderr.split('\n')
86
+ if 'Warning: argument not used SetDur' not in line)
87
+ if filtered_stderr.strip(): # Only raise error if there are other errors
88
+ error_msg = f"Normalization failed: {filtered_stderr}"
89
+ self.logger.error(error_msg)
90
+ raise PhonemizerError(error_msg)
91
+
92
+ return stdout.strip()
93
+
94
+ except Exception as e:
95
+ error_msg = f"Error during normalization: {str(e)}"
96
+ self.logger.error(error_msg)
97
+ return text
98
+
99
+ def getPhonemes(self, text: str, separate_phonemes: bool = False) -> str:
100
+ """Extract phonemes from the given text.
101
+
102
+ Args:
103
+ text (str): The input text to convert to phonemes
104
+ separate_phonemes (bool): If True, keeps spaces between phonemes. If False, produces compact phoneme strings.
105
+ Defaults to False.
106
+
107
+ Returns:
108
+ str: The phoneme sequence with words separated by " | "
109
+ """
110
+ try:
111
+ # Pre-process text to handle dots consistently
112
+ # Replace multiple dots with a single dot to avoid issues with ellipsis
113
+ text = re.sub(r'\.{2,}', '.', text)
114
+
115
+ # Process input line-by-line so we preserve original newlines
116
+ lines = text.split('\n')
117
+ per_line_outputs = []
118
+ for line in lines:
119
+ # If the input line is empty, preserve empty line
120
+ if not line.strip():
121
+ per_line_outputs.append('')
122
+ continue
123
+
124
+ command = self._build_phoneme_extraction_command()
125
+ proc = subprocess.Popen(
126
+ command,
127
+ stdin=subprocess.PIPE,
128
+ stdout=subprocess.PIPE,
129
+ stderr=subprocess.PIPE,
130
+ text=True,
131
+ encoding='ISO-8859-15',
132
+ shell=True
133
+ )
134
+ stdout, stderr = proc.communicate(input=line)
135
+ if proc.returncode != 0:
136
+ error_msg = f"Phoneme extraction failed: {stderr}"
137
+ self.logger.error(error_msg)
138
+ raise PhonemizerError(error_msg)
139
+
140
+ # Replace any internal newlines in tool output with sentinel (shouldn't normally occur for single line)
141
+ stdout_line = stdout.replace('\n', ' | _ | ')
142
+
143
+ # Split into words and handle each separately for this line
144
+ word_phonemes = stdout_line.split(" | ")
145
+ result_phonemes = []
146
+ cleaned_phonemes = []
147
+ for phoneme_seq in word_phonemes:
148
+ if not phoneme_seq.strip():
149
+ continue
150
+ if phoneme_seq.strip() == "_":
151
+ continue
152
+ cleaned_phonemes.append(phoneme_seq.strip())
153
+ # Tokenize the original line into words/punctuation
154
+ words = self._word_splitter.findall(line)
155
+
156
+ # Count non-punctuation words
157
+ non_punct_words = [w for w in words if w not in string.punctuation]
158
+
159
+ # Ensure we have enough phonemes for all non-punctuation words
160
+ if len(cleaned_phonemes) < len(non_punct_words):
161
+ while len(cleaned_phonemes) < len(non_punct_words):
162
+ if cleaned_phonemes:
163
+ cleaned_phonemes.append(cleaned_phonemes[-1])
164
+ else:
165
+ cleaned_phonemes.append("a")
166
+
167
+ # Process words and phonemes together for this line
168
+ phoneme_idx = 0
169
+ word_idx = 0
170
+ line_result = []
171
+
172
+ while word_idx < len(words):
173
+ word = words[word_idx]
174
+
175
+ if word in string.punctuation:
176
+ line_result.append(word)
177
+ word_idx += 1
178
+ continue
179
+
180
+ # Regular word processing
181
+ if phoneme_idx < len(cleaned_phonemes):
182
+ phonemes = cleaned_phonemes[phoneme_idx].split()
183
+ if self.symbol == "sampa":
184
+ if separate_phonemes:
185
+ processed_phonemes = " ".join(p for p in phonemes if p != "-")
186
+ else:
187
+ processed_phonemes = "".join(p for p in phonemes if p != "-")
188
+ else:
189
+ ipa_phonemes = [self._sampa_to_ipa_dict.get(p, p) for p in phonemes if p != "-"]
190
+ if separate_phonemes:
191
+ processed_phonemes = " ".join(ipa_phonemes)
192
+ else:
193
+ processed_phonemes = "".join(ipa_phonemes)
194
+
195
+ line_result.append(processed_phonemes)
196
+ phoneme_idx += 1
197
+ word_idx += 1
198
+ else:
199
+ # No phoneme left for this word: skip it
200
+ word_idx += 1
201
+
202
+ # If there are leftover phonemes, append them
203
+ while phoneme_idx < len(cleaned_phonemes):
204
+ phonemes = cleaned_phonemes[phoneme_idx].split()
205
+ if self.symbol == "sampa":
206
+ processed_phonemes = " ".join(p for p in phonemes if p != "-")
207
+ else:
208
+ ipa_phonemes = [self._sampa_to_ipa_dict.get(p, p) for p in phonemes if p != "-"]
209
+ if separate_phonemes:
210
+ processed_phonemes = " ".join(ipa_phonemes)
211
+ else:
212
+ processed_phonemes = "".join(ipa_phonemes)
213
+
214
+ line_result.append(processed_phonemes)
215
+ phoneme_idx += 1
216
+
217
+ # Format final output for this line using spacing rules
218
+ out_parts = []
219
+ # Keep a parallel map to the original words so we can decide sentence splits
220
+ orig_map = []
221
+ for idx, token in enumerate(line_result):
222
+ is_punct = token in string.punctuation
223
+ if not is_punct:
224
+ normalized = re.sub(r"\s+", " ", token.strip())
225
+ out_parts.append(normalized)
226
+ # Map this output token to the corresponding original word (if available)
227
+ if idx < len(words):
228
+ orig_map.append(words[idx])
229
+ else:
230
+ orig_map.append(None)
231
+ else:
232
+ out_parts.append(token)
233
+ if idx < len(words):
234
+ orig_map.append(words[idx])
235
+ else:
236
+ orig_map.append(None)
237
+
238
+ final_line = ""
239
+ for i, tok in enumerate(out_parts):
240
+ if i == 0:
241
+ final_line += tok
242
+ continue
243
+
244
+ prev = out_parts[i-1]
245
+
246
+ if tok in string.punctuation:
247
+ final_line = final_line.rstrip(' ')
248
+ final_line += (' ' if separate_phonemes else ' ') + tok
249
+ # Preserve input line boundaries: do NOT insert newlines mid-line.
250
+ # Always add the standard separator after punctuation.
251
+ if i < len(out_parts) - 1:
252
+ final_line += (' ' if separate_phonemes else ' ')
253
+ else:
254
+ if prev in string.punctuation:
255
+ final_line += tok
256
+ else:
257
+ sep = ' ' if separate_phonemes else ' '
258
+ final_line += sep + tok
259
+
260
+ # If a sentence-ending punctuation is followed by a capital letter,
261
+ # split into separate lines (keeps numeric periods like "1980. urtean" intact).
262
+ # This turns "... ? Ni ..." into two lines at the sentence boundary.
263
+ split_line = re.sub(r"(?<=[\?\!\.])\s+(?=[A-ZÁÉÍÓÚÜÑ])", "\n", final_line)
264
+ per_line_outputs.append(split_line)
265
+
266
+ return "\n".join(per_line_outputs)
267
+
268
+ except Exception as e:
269
+ error_msg = f"Error in phoneme extraction: {str(e)}"
270
+ self.logger.error(error_msg)
271
+ return ""
272
+
273
+ def _build_normalization_command(self) -> str:
274
+ """Build the command string for normalization."""
275
+ modulo_path = self._get_file_path() / self.path_modulo1y2
276
+ dict_path = self._get_file_path() / self.path_dicts
277
+ dict_file = f"{self.language}_dicc"
278
+ return f'{modulo_path} -TxtMode=Word -Lang={self.language} -HDic={dict_path/dict_file}'
279
+
280
+ def _build_phoneme_extraction_command(self) -> str:
281
+ """Build the command string for phoneme extraction."""
282
+ modulo_path = self._get_file_path() / self.path_modulo1y2
283
+ dict_path = self._get_file_path() / self.path_dicts
284
+ dict_file = f"{self.language}_dicc"
285
+ return f'{modulo_path} -Lang={self.language} -HDic={dict_path/dict_file}'
286
+
287
+ def _get_file_path(self) -> Path:
288
+ return Path(__file__).parent
289
+
290
+ def _validate_paths(self) -> None:
291
+ """Validate paths with enhanced error reporting."""
292
+ try:
293
+ if not self.path_modulo1y2.exists():
294
+ raise PhonemizerError(f"Modulo1y2 executable not found at: {self.path_modulo1y2}")
295
+ if not self.path_dicts.exists():
296
+ raise PhonemizerError(f"Dictionary directory not found at: {self.path_dicts}")
297
+
298
+ # Check for both possible dictionary files
299
+ dict_file = self.path_dicts / f"{self.language}_dicc"
300
+ if not dict_file.exists():
301
+ # Try with .dic extension as fallback
302
+ dict_file_alt = self.path_dicts / f"{self.language}_dicc.dic"
303
+ if not dict_file_alt.exists():
304
+ raise PhonemizerError(f"Dictionary file not found at either {dict_file} or {dict_file_alt}")
305
+
306
+ except Exception as e:
307
+ self.logger.error(f"Path validation error: {str(e)}")
308
+ raise
309
+
310
+ def _transform_multichar_phonemes(self, phoneme_sequence: str) -> str:
311
+ """
312
+ Transform multicharacter IPA phonemes to single characters using the MULTICHAR_TO_SINGLECHAR mapping.
313
+
314
+ Args:
315
+ phoneme_sequence (str): A string containing phonemes separated by spaces
316
+
317
+ Returns:
318
+ str: The transformed phoneme sequence with multicharacter phonemes replaced by single characters
319
+ """
320
+ # Split the sequence into individual phonemes
321
+ phonemes = phoneme_sequence.split()
322
+ transformed_phonemes = []
323
+
324
+ for phoneme in phonemes:
325
+ # Check if the phoneme exists in our mapping
326
+ if phoneme in MULTICHAR_TO_SINGLECHAR:
327
+ transformed_phonemes.append(MULTICHAR_TO_SINGLECHAR[phoneme])
328
+ else:
329
+ transformed_phonemes.append(phoneme)
330
+
331
+ return " ".join(transformed_phonemes)
332
+
333
+
gradio_phonemizer.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import tempfile
3
+ import base64
4
+ import re
5
+ import socket
6
+ import os
7
+ from pathlib import Path
8
+ from typing import Optional, Tuple
9
+ import threading
10
+ import time
11
+ import atexit
12
+
13
+ # Output cleanup configuration
14
+ OUTPUTS_DIR = Path(__file__).parent / 'outputs'
15
+ OUTPUT_CLEANUP_TTL = 24 * 3600 # seconds, default 24 hours
16
+ OUTPUT_CLEANUP_MAX_FILES = 500 # keep at most this many files
17
+ OUTPUT_CLEANUP_INTERVAL = 60 * 60 # in seconds, run cleanup every hour
18
+
19
+
20
+ def _cleanup_outputs(out_dir: Path = None, max_files: int = None, ttl: int = None):
21
+ """Delete old files in `out_dir` older than `ttl` seconds and keep at most
22
+ `max_files` newest files. If parameters are None, use module defaults."""
23
+ if out_dir is None:
24
+ out_dir = OUTPUTS_DIR
25
+ if not out_dir.exists():
26
+ return
27
+ if max_files is None:
28
+ max_files = OUTPUT_CLEANUP_MAX_FILES
29
+ if ttl is None:
30
+ ttl = OUTPUT_CLEANUP_TTL
31
+
32
+ now = time.time()
33
+ files = [p for p in out_dir.iterdir() if p.is_file()]
34
+ # Remove files older than ttl
35
+ for p in files:
36
+ try:
37
+ if now - p.stat().st_mtime > ttl:
38
+ p.unlink()
39
+ except Exception:
40
+ pass
41
+
42
+ # Re-list and trim to max_files
43
+ files = sorted([p for p in out_dir.iterdir() if p.is_file()], key=lambda p: p.stat().st_mtime, reverse=True)
44
+ if len(files) > max_files:
45
+ for p in files[max_files:]:
46
+ try:
47
+ p.unlink()
48
+ except Exception:
49
+ pass
50
+
51
+
52
+ def _cleanup_all_on_exit():
53
+ """Remove all files in outputs folder on process exit."""
54
+ try:
55
+ if OUTPUTS_DIR.exists():
56
+ for p in OUTPUTS_DIR.iterdir():
57
+ try:
58
+ if p.is_file():
59
+ p.unlink()
60
+ except Exception:
61
+ pass
62
+ except Exception:
63
+ pass
64
+
65
+
66
+ def _start_periodic_cleanup():
67
+ def _worker():
68
+ while True:
69
+ try:
70
+ _cleanup_outputs(OUTPUTS_DIR)
71
+ except Exception:
72
+ pass
73
+ time.sleep(OUTPUT_CLEANUP_INTERVAL)
74
+
75
+ t = threading.Thread(target=_worker, daemon=True, name='outputs-cleaner')
76
+ t.start()
77
+
78
+
79
+ # Ensure outputs dir exists and start background cleaner; register atexit
80
+ OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
81
+ _start_periodic_cleanup()
82
+ atexit.register(_cleanup_all_on_exit)
83
+ from eu_phonemizer_v2 import Phonemizer, PhonemizerError
84
+
85
+
86
+ def _read_uploaded_file(file_obj) -> str:
87
+ if not file_obj:
88
+ return ""
89
+ # gradio will provide a temporary file path
90
+ p = Path(file_obj.name) if hasattr(file_obj, "name") else Path(file_obj)
91
+ try:
92
+ return p.read_text(encoding='utf-8')
93
+ except Exception:
94
+ return p.read_text(encoding='ISO-8859-15')
95
+
96
+
97
+ def process(text: str,
98
+ uploaded_file,
99
+ language: str,
100
+ symbol: str,
101
+ separate_phonemes: bool) -> Tuple[str, Optional[str], str, Optional[str]]:
102
+ """Process either text input or uploaded txt file and return (text_output, download_file_path)
103
+
104
+ If the user uploaded a file, the function will return the path to a tmp file
105
+ suitable for download as the second return value and an empty text output.
106
+ If the user provided text in the box, the function will return the phonemes
107
+ as text and also a downloadable txt file containing the same output.
108
+ """
109
+ # Prefer uploaded file if present
110
+ source_text = ""
111
+ is_file_input = False
112
+ if uploaded_file:
113
+ source_text = _read_uploaded_file(uploaded_file)
114
+ is_file_input = True
115
+ else:
116
+ source_text = text or ""
117
+
118
+ # Try to instantiate Phonemizer using repo-local modulo1y2 and dicts
119
+ try:
120
+ phon = Phonemizer(language=language, symbol=symbol)
121
+ except PhonemizerError as e:
122
+ if language == 'eu':
123
+ err = f"Ezin izan da fonemizadorea hasi: {e}\nEgiaztatu 'modulo1y2' eta 'dict' karpetak."
124
+ else:
125
+ err = f"No se pudo inicializar el fonemizador: {e}\nComprueba las carpetas 'modulo1y2' y 'dict'."
126
+ # Return 6 outputs matching the UI: result text, file, normalized text, norm file, ph_path, norm_path
127
+ return err, None, "", None, "", ""
128
+ except Exception as e:
129
+ if language == 'eu':
130
+ return f"Hasieratze errore ezezaguna: {e}", None, "", None, "", ""
131
+ return f"Error inesperado al inicializar: {e}", None
132
+
133
+
134
+ # Normalize then get phonemes. Run normalization per original input line so the
135
+ # external normalizer doesn't insert extra newlines across sentences and
136
+ # we preserve the user's original line boundaries.
137
+ try:
138
+ lines = source_text.split('\n')
139
+ normalized_lines = []
140
+ for ln in lines:
141
+ if not ln.strip():
142
+ normalized_lines.append('')
143
+ else:
144
+ # normalize each line independently, collapse any internal newlines
145
+ # produced by the external normalizer, collapse multiple whitespace
146
+ # (this avoids producing double spaces when the normalizer inserts
147
+ # a '\n' while the original text already had a space), and strip
148
+ norm_line = phon.normalize(ln)
149
+ norm_line = norm_line.replace('\n', ' ')
150
+ norm_line = re.sub(r"\s+", ' ', norm_line).strip()
151
+ normalized_lines.append(norm_line)
152
+ normalized = '\n'.join(normalized_lines)
153
+
154
+ phonemes = phon.getPhonemes(normalized, separate_phonemes=separate_phonemes)
155
+ # Defensive cleanup: if any '|' separators remain, replace them with single spaces
156
+ if isinstance(phonemes, str) and '|' in phonemes:
157
+ phonemes = re.sub(r"\s*\|\s*", " ", phonemes)
158
+ except PhonemizerError as e:
159
+ if language == 'eu':
160
+ msg = f"Fonemizazio errorea: {e}"
161
+ else:
162
+ msg = f"Error del fonemizador: {e}"
163
+ return msg, None, "", None, "", ""
164
+ except Exception as e:
165
+ if language == 'eu':
166
+ msg = f"Errore ezezaguna prozesatzean: {e}"
167
+ else:
168
+ msg = f"Error inesperado al procesar: {e}"
169
+ return msg, None, "", None, "", ""
170
+
171
+ # Create persistent downloadable files under outputs/ so the browser can reliably
172
+ # download them using Gradio's `gr.File` component (avoid ephemeral tmp files
173
+ # that some browsers may not fetch correctly).
174
+ out_dir = Path(__file__).parent / 'outputs'
175
+ out_dir.mkdir(parents=True, exist_ok=True)
176
+ from datetime import datetime
177
+ ts = datetime.now().strftime('%Y%m%d_%H%M%S')
178
+ ph_file = out_dir / f'phonemes_{ts}.txt'
179
+ norm_file = out_dir / f'normalized_{ts}.txt'
180
+ ph_file.write_text(phonemes, encoding='utf-8')
181
+ norm_file.write_text(normalized, encoding='utf-8')
182
+
183
+ # Cleanup old files opportunistically after creating new ones
184
+ try:
185
+ _cleanup_outputs(out_dir)
186
+ except Exception:
187
+ pass
188
+
189
+ # Return phonemes and normalized text in all cases (text or uploaded file)
190
+ # so users who upload a .txt can see the processed text inline and download it.
191
+ return phonemes, str(ph_file), normalized, str(norm_file), str(ph_file), str(norm_file)
192
+
193
+
194
+ def download_from_text(text: str) -> Optional[str]:
195
+ """Create a temporary .txt file from the given text and return its path for download."""
196
+ if not text:
197
+ return None
198
+ # Save into a persistent outputs/ directory with a readable timestamped filename
199
+ out_dir = Path(__file__).parent / 'outputs'
200
+ out_dir.mkdir(parents=True, exist_ok=True)
201
+ from datetime import datetime
202
+ ts = datetime.now().strftime('%Y%m%d_%H%M%S')
203
+ filename = f'phonemes_{ts}.txt'
204
+ out_path = out_dir / filename
205
+ out_path.write_text(text, encoding='utf-8')
206
+ # Return the path string so Gradio's File component can serve it
207
+ return str(out_path)
208
+
209
+
210
+ def build_interface():
211
+ with gr.Blocks(title="Eu/Es Phonemizer") as demo:
212
+ # Simple header (image removed per user preference)
213
+ header = gr.Markdown("# Fonemizadorea — Euskara (eu) eta Gaztelania (es)")
214
+ # Style the Submit button to be orange for better visibility (higher specificity)
215
+ gr.HTML("""
216
+ <style>
217
+ /* Stronger selectors to override theme/defaults */
218
+ #submit_btn, #submit_btn button, button#submit_btn, .gradio-container #submit_btn button {
219
+ background-color: #ff8c00 !important;
220
+ color: white !important;
221
+ border-radius: 6px !important;
222
+ padding: 6px 12px !important;
223
+ border: none !important;
224
+ }
225
+ #submit_btn:hover, #submit_btn button:hover, button#submit_btn:hover {
226
+ background-color: #ff7a00 !important;
227
+ }
228
+ /* Don't force download buttons to orange */
229
+ #download_ph_btn button, #download_norm_btn button { background-color: transparent !important; }
230
+
231
+ /* Compact upload file box */
232
+ #upload_file { max-width: 160px !important; }
233
+ #upload_file .gr-file {
234
+ height: 32px !important;
235
+ padding: 2px 6px !important;
236
+ font-size: 0.9rem !important;
237
+ line-height: 1 !important;
238
+ }
239
+ #upload_file .gr-file input[type=file] { height: 32px !important; }
240
+
241
+ /* Make textareas vertically resizable and more roomy */
242
+ #input_text textarea, #normalized_box textarea, #result_box textarea {
243
+ resize: vertical !important;
244
+ min-height: 120px !important;
245
+ max-height: 800px !important;
246
+ width: 100% !important;
247
+ box-sizing: border-box !important;
248
+ }
249
+
250
+ /* Center container and add padding for a cleaner look */
251
+ .gradio-container { max-width: 1100px; margin: 12px auto !important; padding: 8px !important; }
252
+ /* Fix controls column width so changing labels doesn't reflow layout.
253
+ Use a slightly smaller fixed width so the upload column sits closer. */
254
+ /* Make controls column appear taller by increasing internal spacing
255
+ between control rows rather than forcing the whole column height.
256
+ This avoids adding extra vertical gap between adjacent columns
257
+ (upload box / buttons). */
258
+ #controls_col { min-width: 220px; max-width: 260px; flex: 0 0 240px; align-self: flex-start; padding-top: 6px; padding-bottom: 6px; box-sizing: border-box; }
259
+ /* Increase the gap between controls so the column looks taller without
260
+ enlarging its outer box or shifting neighboring columns. */
261
+ #controls_col .gr-row { gap: 12px; row-gap: 12px; }
262
+ #controls_col .gr-label, #controls_col label { line-height: 1.4; }
263
+
264
+ /* Ensure the upload column aligns to the top of the row so it doesn't
265
+ get vertically centered when other columns grow; keep the upload box
266
+ compact but aligned with the controls stack. */
267
+ #upload_col { min-height: 110px; display: flex !important; align-items: flex-start !important; justify-content: center !important; align-self: flex-start; padding-top: 6px; }
268
+ /* Ensure labels wrap instead of expanding layout */
269
+ #controls_col .gr-label, #controls_col label { white-space: normal !important; word-break: break-word !important; }
270
+ /* Keep action buttons a fixed size so they don't push layout when language changes */
271
+ #submit_btn, #clear_btn { }
272
+ /* Enforce pixel-perfect identical size and box-model for both action buttons */
273
+ #submit_btn button, #clear_btn button {
274
+ width: 120px !important;
275
+ height: 40px !important;
276
+ min-height: 40px !important;
277
+ box-sizing: border-box !important;
278
+ padding: 6px 12px !important;
279
+ display: inline-flex !important;
280
+ align-items: center !important;
281
+ justify-content: center !important;
282
+ font-size: 14px !important;
283
+ line-height: 1 !important;
284
+ border-radius: 6px !important;
285
+ border: none !important;
286
+ margin: 0 !important;
287
+ vertical-align: middle !important;
288
+ font-family: inherit !important;
289
+ background-clip: padding-box !important;
290
+ }
291
+ /* Make main column flexible and allow it to shrink without pushing controls */
292
+ #main_col { flex: 1 1 auto; min-width: 0; }
293
+ /* Pull the upload box a bit left to close the gap if needed */
294
+ #upload_file { margin-left: -6px !important; }
295
+ /* Force a compact file control so it doesn't become taller than the
296
+ nearby control stack. */
297
+ /* Keep the file control compact so it doesn't exceed nearby controls */
298
+ #upload_file .gr-file { max-height: 44px !important; height: 36px !important; box-sizing: border-box !important; }
299
+ /* Position decorative image absolutely so it doesn't force wrapping.
300
+ Reserve space on the right of #top_row to avoid overlap. */
301
+ #top_row { position: relative !important; padding-right: 520px !important; }
302
+ #img_col { position: absolute !important; right: 8px !important; top: 6px !important; width: 480px !important; max-width: 100% !important; box-sizing: border-box !important; }
303
+ #download_img img { width: 480px !important; max-width: 100% !important; height: auto !important; display:block !important; pointer-events: none !important; user-select: none !important; }
304
+ /* Ensure action buttons share the same height and vertical alignment.
305
+ Consolidated to authoritative sizing above to avoid conflicting rules. */
306
+ /* (Sizing enforced in the main button block above.) */
307
+ </style>
308
+ """)
309
+
310
+ with gr.Row():
311
+ # Left controls column
312
+ with gr.Column(scale=1, elem_id='controls_col'):
313
+ language = gr.Radio(choices=['eu', 'es'], value='eu', label='Hizkuntza / Idioma')
314
+ symbol = gr.Radio(choices=['sampa', 'ipa'], value='sampa', label='Sinboloak / Símbolos (Irteera)')
315
+ # Default checked and Basque-only label; will switch to Spanish when language changes
316
+ separate_phonemes = gr.Checkbox(label='Banatu fonemak espazioz', value=True)
317
+
318
+ # Small column to the right of controls that holds the upload box
319
+ with gr.Column(scale=1, elem_id='upload_col'):
320
+ upload = gr.File(file_types=['.txt'], label='Igo .txt fitxategia / Subir archivo .txt', elem_id='upload_file')
321
+
322
+ # Decorative/download image column to the right of the upload box.
323
+ # We'll embed the local `img/download.png` as a base64 <img> inside
324
+ # gr.HTML so Gradio doesn't add overlay controls (download/enlarge).
325
+ # Use an integer `scale` to avoid Gradio warnings; keep the image
326
+ # column compact by using a small integer scale and reserving width
327
+ # via CSS (#img_col). Changing to `scale=1` prevents the float-scale
328
+ # warning while preserving layout.
329
+ with gr.Column(scale=1, elem_id='img_col'):
330
+ img_path = Path(__file__).parent / 'img' / 'download.png'
331
+ _img_data_uri = ''
332
+ try:
333
+ with open(img_path, 'rb') as _img_f:
334
+ _img_b64 = base64.b64encode(_img_f.read()).decode('ascii')
335
+ _img_data_uri = f"data:image/png;base64,{_img_b64}"
336
+ except Exception:
337
+ _img_data_uri = ''
338
+
339
+ # Render HTML with a non-interactive <img>; let CSS control width
340
+ download_img = gr.HTML(f'<img src="{_img_data_uri}" alt="download" style="height:auto;pointer-events:none;user-select:none;">', elem_id='download_img')
341
+
342
+ # Main column on the right: buttons above the wide input textbox
343
+ with gr.Column(scale=3, elem_id='main_col'):
344
+ with gr.Row():
345
+ submit_btn = gr.Button('Submit', elem_id='submit_btn')
346
+ clear_btn = gr.Button('Clear', elem_id='clear_btn')
347
+ with gr.Row():
348
+ with gr.Column(scale=5):
349
+ input_text = gr.Textbox(lines=12, elem_id='input_text', label="Sarrera testua (uzteko hutsik .txt fitxategia igo behar baduzu) / Texto de entrada (dejar vacío si subes un .txt)")
350
+ # Outputs area: normalized text and phoneme output side-by-side
351
+ with gr.Row():
352
+ with gr.Column(scale=1):
353
+ normalized_box = gr.Textbox(lines=12, elem_id='normalized_box', label='Normalizatua', interactive=False)
354
+ download_norm_btn = gr.DownloadButton('Deskargatu normalizatua', elem_id='download_norm_btn')
355
+
356
+ with gr.Column(scale=1):
357
+ result_box = gr.Textbox(lines=12, elem_id='result_box', label='Fonemak', interactive=False)
358
+ download_ph_btn = gr.DownloadButton('Deskargatu fonemak', elem_id='download_ph_btn')
359
+
360
+ # hidden boxes to hold latest generated file paths so download buttons can trigger
361
+ ph_path_box = gr.Textbox(visible=False, elem_id='ph_path_box')
362
+ norm_path_box = gr.Textbox(visible=False, elem_id='norm_path_box')
363
+
364
+ def _on_click(input_text, upload, language, symbol, separate_phonemes):
365
+ return process(input_text, upload, language, symbol, separate_phonemes)
366
+
367
+ # When a user uploads a .txt file, read its contents and populate the
368
+ # `input_text` box so they can review or edit before sending.
369
+ def _on_upload(uploaded_file):
370
+ if not uploaded_file:
371
+ return gr.update(value="")
372
+ try:
373
+ content = _read_uploaded_file(uploaded_file)
374
+ except Exception:
375
+ content = ''
376
+ return gr.update(value=content)
377
+
378
+ def _clear_all():
379
+ # Clear input, outputs and any hidden path boxes so UI resets
380
+ return (
381
+ gr.update(value=""), # input_text
382
+ gr.update(value=None), # upload (clear any uploaded file)
383
+ gr.update(value=""), # normalized_box
384
+ gr.update(value=""), # result_box
385
+ gr.update(value=None), # download_ph_btn
386
+ gr.update(value=None), # download_norm_btn
387
+ gr.update(value=""), # ph_path_box
388
+ gr.update(value="") # norm_path_box
389
+ )
390
+
391
+ # Re-run processing automatically when symbol or separation options change
392
+ # so users don't have to press the Process button again.
393
+ symbol.change(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
394
+ separate_phonemes.change(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
395
+
396
+ # Populate the input textbox when a file is uploaded so users can see/edit it
397
+ # before sending. Does not auto-run processing.
398
+ upload.change(fn=_on_upload, inputs=[upload], outputs=[input_text])
399
+
400
+ # Update UI texts when language selection changes
401
+ def _update_language_ui(lang):
402
+ # Note: we intentionally do NOT update the header here to avoid
403
+ # large DOM changes that reflow the layout when switching languages.
404
+ if lang == 'eu':
405
+ return (
406
+ gr.update(label='Sinboloak (Irteera)'), # symbol
407
+ gr.update(label='Banatu fonemak espazioz'), # separate_phonemes
408
+ # keep input/upload labels stable (do not update them to avoid reflow)
409
+ gr.update(label='Fonemak'),
410
+ gr.update(label='Deskargatu irteera (.txt)'),
411
+ gr.update(label='Normalizatua'),
412
+ gr.update(label='Deskargatu normalizatua (.txt)'),
413
+ gr.update(value=''),
414
+ gr.update(value='')
415
+ )
416
+ else:
417
+ return (
418
+ gr.update(label='Símbolos (Salida)'),
419
+ gr.update(label='Separar fonemas con espacios'),
420
+ # keep input/upload labels stable (do not update them to avoid reflow)
421
+ gr.update(label='Fonemas'),
422
+ gr.update(label='Descargar salida (.txt)'),
423
+ gr.update(label='Normalizado'),
424
+ gr.update(label='Descargar normalizado (.txt)'),
425
+ gr.update(value=''),
426
+ gr.update(value='')
427
+ )
428
+
429
+ # Note: don't include `header`, `input_text`, upload or action buttons
430
+ # in outputs to avoid reflow when changing language. Only update the
431
+ # smaller output labels and hidden path boxes which the function
432
+ # actually returns (8 outputs).
433
+ language.change(fn=_update_language_ui, inputs=[language], outputs=[symbol, separate_phonemes, result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
434
+
435
+ submit_btn.click(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
436
+ clear_btn.click(fn=_clear_all, inputs=[], outputs=[input_text, upload, normalized_box, result_box, download_ph_btn, download_norm_btn, ph_path_box, norm_path_box])
437
+
438
+ # Note: download buttons are created in the outputs area above.
439
+
440
+ def _download_file(path: str):
441
+ # Keep a simple path-return helper for backwards compatibility
442
+ if not path:
443
+ return None
444
+ p = Path(path)
445
+ if not p.exists():
446
+ return None
447
+ return str(p)
448
+
449
+ # Provide download callbacks that generate the outputs on-demand so a
450
+ # single click will both create and return the file path to the browser.
451
+ def _download_ph_from_inputs(input_text, upload, language, symbol, separate_phonemes):
452
+ # Call the same `process()` function to ensure files are generated
453
+ res = process(input_text, upload, language, symbol, separate_phonemes)
454
+ # process() returns (result_text, ph_path, normalized_text, norm_path, ph_path, norm_path)
455
+ if isinstance(res, tuple) and len(res) >= 2:
456
+ return _download_file(res[1])
457
+ return None
458
+
459
+ def _download_norm_from_inputs(input_text, upload, language, symbol, separate_phonemes):
460
+ res = process(input_text, upload, language, symbol, separate_phonemes)
461
+ if isinstance(res, tuple) and len(res) >= 4:
462
+ return _download_file(res[3])
463
+ return None
464
+
465
+ # Wire the DownloadButtons to generate-and-return callbacks so a single
466
+ # click performs generation and triggers immediate download.
467
+ download_ph_btn.click(fn=_download_ph_from_inputs, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[download_ph_btn])
468
+ download_norm_btn.click(fn=_download_norm_from_inputs, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[download_norm_btn])
469
+
470
+ return demo
471
+
472
+
473
+ def _find_free_port(start: int = 7860, end: int = 7870) -> Optional[int]:
474
+ """Find a free TCP port in the given inclusive range."""
475
+ for port in range(start, end + 1):
476
+ with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
477
+ try:
478
+ s.bind(('0.0.0.0', port))
479
+ return port
480
+ except OSError:
481
+ continue
482
+ return None
483
+
484
+
485
+ if __name__ == '__main__':
486
+ app = build_interface()
487
+
488
+ # Allow explicit override via environment variable
489
+ env_port = os.environ.get('GRADIO_SERVER_PORT')
490
+ if env_port:
491
+ try:
492
+ port = int(env_port)
493
+ except ValueError:
494
+ print(f"Invalid GRADIO_SERVER_PORT='{env_port}', falling back to automatic selection.")
495
+ port = None
496
+ else:
497
+ port = None
498
+
499
+ if port is None:
500
+ port = _find_free_port(7860, 7880)
501
+
502
+ if port is None:
503
+ raise OSError("No free port found in range 7860-7880. Set GRADIO_SERVER_PORT to a free port.")
504
+
505
+ print(f"Launching Gradio on port {port} (server_name=0.0.0.0)")
506
+ app.launch(server_name='0.0.0.0', server_port=port)
img/download.png ADDED
modulo1y2/modulo1y2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c122bd6197e5e360d534957322f8d98a06cb3bcb4d412ee9978e891ae1b43e8a
3
+ size 2245952
prepare.sh ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ echo "Preparing phonemizer workspace..."
5
+
6
+ # Make sure executable bit is set if present
7
+ if [ -f "modulo1y2/modulo1y2" ]; then
8
+ chmod +x modulo1y2/modulo1y2 || true
9
+ echo "Ensured modulo1y2/modulo1y2 is executable."
10
+ else
11
+ echo "Warning: modulo1y2/modulo1y2 not found. If you plan to ship the binary, add it to the repo."
12
+ fi
13
+
14
+ echo "Preparation complete. To run locally:
15
+ python3 -m venv .venv
16
+ source .venv/bin/activate
17
+ pip install -r requirements.txt
18
+ python app.py
19
+ "
push_to_hf.sh ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # Safe push script for Hugging Face Spaces using an env var HF_TOKEN.
5
+ # Usage:
6
+ # export HF_TOKEN="<your_token>"
7
+ # cd /path/to/tmp_space
8
+ # chmod +x push_to_hf.sh
9
+ # ./push_to_hf.sh
10
+
11
+ REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
12
+ cd "$REPO_DIR"
13
+
14
+ if [ -z "${HF_TOKEN:-}" ]; then
15
+ echo "ERROR: HF_TOKEN is not set. Run: export HF_TOKEN=\"<your_token>\""
16
+ exit 1
17
+ fi
18
+
19
+ # Show current branch and changes
20
+ git --no-pager status --porcelain --branch
21
+
22
+ # Push using http.extraHeader so token is not stored in git config or logs
23
+ echo "Pushing to origin (authenticated via HF_TOKEN) ..."
24
+ GIT_HTTP_EXTRAHEADER="Authorization: Bearer $HF_TOKEN"
25
+ # Use git -c to pass header for single command
26
+ git -c http.extraHeader="Authorization: Bearer $HF_TOKEN" push origin HEAD:main
27
+
28
+ RET=$?
29
+ if [ $RET -eq 0 ]; then
30
+ echo "Push succeeded. Space should start building shortly on Hugging Face."
31
+ else
32
+ echo "Push failed with exit code $RET"
33
+ fi
34
+
35
+ exit $RET
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio>=3.0
2
+ nltk
3
+ huggingface-hub