Spaces:

arrandi
/

phonemizer-eus-esp

Runtime error

App Files Files Community

Ander Arriandiaga commited on Nov 25, 2025

Commit

a7c0c81

1 Parent(s): 42aaddc

Initial commit for Hugging Face Space

Browse files

Files changed (16) hide show

.gitattributes +3 -35
.gitignore +1 -0
README.md +18 -14
README_developer.md +57 -0
app.py +9 -0
dict/es_dicc.dic +3 -0
dict/es_dicc_20241204.dic +3 -0
dict/eu_dicc.dic +3 -0
dict/eu_dicc_20250326.dic +3 -0
eu_phonemizer_v2.py +333 -0
gradio_phonemizer.py +506 -0
img/download.png +0 -0
modulo1y2/modulo1y2 +3 -0
prepare.sh +19 -0
push_to_hf.sh +35 -0
requirements.txt +3 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,3 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+# Track large dictionary files and binary with Git LFS if enabled
+dict/* filter=lfs diff=lfs merge=lfs -text
+modulo1y2/modulo1y2 filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ outputs/

README.md CHANGED Viewed

@@ -1,14 +1,18 @@
----
-title: Phonemizer Eus Esp
-emoji: 🏃
-colorFrom: green
-colorTo: gray
-sdk: gradio
-sdk_version: 6.0.0
-app_file: app.py
-pinned: false
-license: cc-by-nc-4.0
-short_description: Web UI to phonemize Basque (eu) and Spanish (es) tex
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Phonemizer — Gradio demo (Hugging Face Space)
+This Space provides a small web UI to phonemize Basque (eu) and Spanish (es) text.
+How to use
+- Input text: paste text into the main box or upload a `.txt` file.
+- Language: select `eu` (Basque) or `es` (Spanish).
+- Symbols: choose `sampa` (default) or `ipa` for the phoneme output format.
+- Separate phonemes: toggle whether phonemes are separated by spaces to make easier to see multi-character phonemes.
+- Submit: press `Submit` to run normalization + phonemization.
+- Download: use the download buttons to get the phonemes or normalized text as `.txt` files.
+Privacy
+- This Space does not store user inputs beyond temporary files used to serve downloads. Do not upload sensitive data.
+Credits
+- Developed by Ander Arriandiaga in Aholab (HiTZ).

README_developer.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# Phonemizer Gradio Space — Developer Notes
+This repository contains a Gradio app wrapper for the Phonemizer used in this project.
+Files to keep in the Space repo for runtime
+- `gradio_phonemizer.py` (UI) and `eu_phonemizer_v2.py` (phonemizer logic)
+- `app.py` (Gradio entrypoint)
+- `modulo1y2/modulo1y2` (the phonemizer executable) OR source+build files in `modulo1y2/`
+- `dict/` containing `eu_dicc` (or `eu_dicc.dic`) and `es_dicc` (or `es_dicc.dic`)
+- `requirements.txt`
+Recommended deployment options
+- Ship the `modulo1y2` executable and the minimal dictionary files in the repo (fastest).
+- OR keep only sources and build the executable on Space startup using an `apt.txt` and a `make` step.
+- OR host large dictionaries/executables on the Hugging Face Hub (dataset/model repo) and download them at startup using `huggingface_hub.hf_hub_download`.
+Quick local test
+1. Create a venv and install dependencies:
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+2. Ensure the executable is present and executable:
+```bash
+chmod +x modulo1y2/modulo1y2
+ls -l modulo1y2/modulo1y2
+ls -l dict/eu_dicc* dict/es_dicc*
+```
+3. Run the app locally:
+```bash
+python app.py
+# then open http://localhost:7860
+```
+Pushing to Hugging Face Spaces
+1. (Optional) Install git-lfs and track large files:
+```bash
+git lfs install
+git lfs track "dict/*"
+git lfs track "modulo1y2/modulo1y2"
+```
+2. Create a Space (via web UI or `huggingface-cli repo create <user>/<space> --type=space`), then push this repo to the Space remote.
+Licensing and redistribution
+Before uploading binaries or dictionary files, confirm you have the right to redistribute them.

app.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import os
+from gradio_phonemizer import build_interface
+demo = build_interface()
+if __name__ == "__main__":
+    # Respect common env vars used by hosting platforms
+    port = int(os.environ.get("PORT", os.environ.get("GRADIO_SERVER_PORT", 7860)))
+    demo.launch(server_name="0.0.0.0", server_port=port)

dict/es_dicc.dic ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3880d688565dcfc4c1a239cb94c6cc0466b603cbf86fbf8a20ca411d64cb3c03
+size 141770

dict/es_dicc_20241204.dic ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3880d688565dcfc4c1a239cb94c6cc0466b603cbf86fbf8a20ca411d64cb3c03
+size 141770

dict/eu_dicc.dic ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a4c6553965ac7c7937b599d3e8a3d8d94df48a0bdef943a84c63f4b261172f8
+size 865575

dict/eu_dicc_20250326.dic ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a4c6553965ac7c7937b599d3e8a3d8d94df48a0bdef943a84c63f4b261172f8
+size 865575

eu_phonemizer_v2.py ADDED Viewed

	@@ -0,0 +1,333 @@

+import subprocess
+import logging
+import string
+from pathlib import Path
+from collections import OrderedDict
+from nltk.tokenize import TweetTokenizer
+from typing import List, Dict, Optional
+import re
+# Constants
+SUPPORTED_LANGUAGES = {'eu', 'es'}
+SUPPORTED_SYMBOLS = {'sampa', 'ipa'}
+SAMPA_TO_IPA = OrderedDict([
+    ("p", "p"), ("b", "b"), ("t", "t"), ("c", "c"), ("d", "d"),
+    ("k", "k"), ("g", "ɡ"), ("tS", "tʃ"), ("ts", "ts"), ("ts`", "tʂ"),
+    ("gj", "ɟ"), ("jj", "ɪ"), ("f", "f"), ("B", "β"), ("T", "θ"),
+    ("D", "ð"), ("s", "s"), ("s`", "ʂ"), ("S", "ʃ"), ("x", "x"),
+    ("G", "ɣ"), ("m", "m"), ("n", "n"), ("J", "ɲ"), ("l", "l"),
+    ("L", "ʎ"), ("r", "ɾ"), ("rr", "r"), ("j", "j"), ("w", "w"),
+    ("i", "i"), ("'i", "'i"), ("e", "e"), ("'e", "'e"), ("a", "a"),
+    ("'a", "'a"), ("o", "o"), ("'o", "'o"), ("u", "u"), ("'u", "'u"),
+    ("y", "y"), ("Z", "ʒ"), ("h", "h"), ("ph", "pʰ"), ("kh", "kʰ"),
+    ("th", "tʰ")
+])
+MULTICHAR_TO_SINGLECHAR = {
+    "tʃ": "C",
+    "ts": "V",
+    "tʂ": "P",
+    "'i": "I",
+    "'e": "E",
+    "'a": "A",
+    "'o": "O",
+    "'u": "U",
+    "pʰ": "H",
+    "kʰ": "K",
+    "tʰ": "T"
+}
+class PhonemizerError(Exception):
+    """Custom exception for Phonemizer errors."""
+    pass
+class Phonemizer:
+    def __init__(self, language: str = "eu", symbol: str = "sampa",
+                path_modulo1y2: str = "modulo1y2/modulo1y2",
+                path_dicts: str = "dict") -> None:
+        """Initialize the Phonemizer with the given language and symbol."""
+        if language not in SUPPORTED_LANGUAGES:
+            raise PhonemizerError(f"Unsupported language: {language}")
+        if symbol not in SUPPORTED_SYMBOLS:
+            raise PhonemizerError(f"Unsupported symbol type: {symbol}")
+        self.language = language
+        self.symbol = symbol
+        self.path_modulo1y2 = Path(path_modulo1y2)
+        self.path_dicts = Path(path_dicts)
+        self.logger = logging.getLogger(__name__)
+        # Initialize SAMPA to IPA dictionary
+        self._sampa_to_ipa_dict = SAMPA_TO_IPA
+        # Initialize word splitter regex
+        self._word_splitter = re.compile(r'\w+|[^\w\s]', re.UNICODE)
+        self._validate_paths()
+    def normalize(self, text: str) -> str:
+        """Normalize the given text using an external command."""
+        try:
+            command = self._build_normalization_command()
+            process = subprocess.Popen(
+                command,
+                stdin=subprocess.PIPE,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+                encoding='ISO-8859-15',
+                shell=True
+            )
+            stdout, stderr = process.communicate(input=text)
+            if process.returncode != 0:
+                # Filter out the SetDur warning from the error message
+                filtered_stderr = '\n'.join(line for line in stderr.split('\n')
+                                          if 'Warning: argument not used SetDur' not in line)
+                if filtered_stderr.strip():  # Only raise error if there are other errors
+                    error_msg = f"Normalization failed: {filtered_stderr}"
+                    self.logger.error(error_msg)
+                    raise PhonemizerError(error_msg)
+            return stdout.strip()
+        except Exception as e:
+            error_msg = f"Error during normalization: {str(e)}"
+            self.logger.error(error_msg)
+            return text
+    def getPhonemes(self, text: str, separate_phonemes: bool = False) -> str:
+        """Extract phonemes from the given text.
+        Args:
+            text (str): The input text to convert to phonemes
+            separate_phonemes (bool): If True, keeps spaces between phonemes. If False, produces compact phoneme strings.
+                                Defaults to False.
+        Returns:
+            str: The phoneme sequence with words separated by " | "
+        """
+        try:
+            # Pre-process text to handle dots consistently
+            # Replace multiple dots with a single dot to avoid issues with ellipsis
+            text = re.sub(r'\.{2,}', '.', text)
+            # Process input line-by-line so we preserve original newlines
+            lines = text.split('\n')
+            per_line_outputs = []
+            for line in lines:
+                # If the input line is empty, preserve empty line
+                if not line.strip():
+                    per_line_outputs.append('')
+                    continue
+                command = self._build_phoneme_extraction_command()
+                proc = subprocess.Popen(
+                    command,
+                    stdin=subprocess.PIPE,
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.PIPE,
+                    text=True,
+                    encoding='ISO-8859-15',
+                    shell=True
+                )
+                stdout, stderr = proc.communicate(input=line)
+                if proc.returncode != 0:
+                    error_msg = f"Phoneme extraction failed: {stderr}"
+                    self.logger.error(error_msg)
+                    raise PhonemizerError(error_msg)
+                # Replace any internal newlines in tool output with sentinel (shouldn't normally occur for single line)
+                stdout_line = stdout.replace('\n', ' | _ | ')
+                # Split into words and handle each separately for this line
+                word_phonemes = stdout_line.split(" | ")
+                result_phonemes = []
+                cleaned_phonemes = []
+                for phoneme_seq in word_phonemes:
+                    if not phoneme_seq.strip():
+                        continue
+                    if phoneme_seq.strip() == "_":
+                        continue
+                    cleaned_phonemes.append(phoneme_seq.strip())
+                # Tokenize the original line into words/punctuation
+                words = self._word_splitter.findall(line)
+                # Count non-punctuation words
+                non_punct_words = [w for w in words if w not in string.punctuation]
+                # Ensure we have enough phonemes for all non-punctuation words
+                if len(cleaned_phonemes) < len(non_punct_words):
+                    while len(cleaned_phonemes) < len(non_punct_words):
+                        if cleaned_phonemes:
+                            cleaned_phonemes.append(cleaned_phonemes[-1])
+                        else:
+                            cleaned_phonemes.append("a")
+                # Process words and phonemes together for this line
+                phoneme_idx = 0
+                word_idx = 0
+                line_result = []
+                while word_idx < len(words):
+                    word = words[word_idx]
+                    if word in string.punctuation:
+                        line_result.append(word)
+                        word_idx += 1
+                        continue
+                    # Regular word processing
+                    if phoneme_idx < len(cleaned_phonemes):
+                        phonemes = cleaned_phonemes[phoneme_idx].split()
+                        if self.symbol == "sampa":
+                            if separate_phonemes:
+                                processed_phonemes = " ".join(p for p in phonemes if p != "-")
+                            else:
+                                processed_phonemes = "".join(p for p in phonemes if p != "-")
+                        else:
+                            ipa_phonemes = [self._sampa_to_ipa_dict.get(p, p) for p in phonemes if p != "-"]
+                            if separate_phonemes:
+                                processed_phonemes = " ".join(ipa_phonemes)
+                            else:
+                                processed_phonemes = "".join(ipa_phonemes)
+                        line_result.append(processed_phonemes)
+                        phoneme_idx += 1
+                        word_idx += 1
+                    else:
+                        # No phoneme left for this word: skip it
+                        word_idx += 1
+                # If there are leftover phonemes, append them
+                while phoneme_idx < len(cleaned_phonemes):
+                    phonemes = cleaned_phonemes[phoneme_idx].split()
+                    if self.symbol == "sampa":
+                        processed_phonemes = " ".join(p for p in phonemes if p != "-")
+                    else:
+                        ipa_phonemes = [self._sampa_to_ipa_dict.get(p, p) for p in phonemes if p != "-"]
+                        if separate_phonemes:
+                            processed_phonemes = " ".join(ipa_phonemes)
+                        else:
+                            processed_phonemes = "".join(ipa_phonemes)
+                    line_result.append(processed_phonemes)
+                    phoneme_idx += 1
+                # Format final output for this line using spacing rules
+                out_parts = []
+                # Keep a parallel map to the original words so we can decide sentence splits
+                orig_map = []
+                for idx, token in enumerate(line_result):
+                    is_punct = token in string.punctuation
+                    if not is_punct:
+                        normalized = re.sub(r"\s+", " ", token.strip())
+                        out_parts.append(normalized)
+                        # Map this output token to the corresponding original word (if available)
+                        if idx < len(words):
+                            orig_map.append(words[idx])
+                        else:
+                            orig_map.append(None)
+                    else:
+                        out_parts.append(token)
+                        if idx < len(words):
+                            orig_map.append(words[idx])
+                        else:
+                            orig_map.append(None)
+                final_line = ""
+                for i, tok in enumerate(out_parts):
+                    if i == 0:
+                        final_line += tok
+                        continue
+                    prev = out_parts[i-1]
+                    if tok in string.punctuation:
+                        final_line = final_line.rstrip(' ')
+                        final_line += ('  ' if separate_phonemes else ' ') + tok
+                        # Preserve input line boundaries: do NOT insert newlines mid-line.
+                        # Always add the standard separator after punctuation.
+                        if i < len(out_parts) - 1:
+                            final_line += ('  ' if separate_phonemes else ' ')
+                    else:
+                        if prev in string.punctuation:
+                            final_line += tok
+                        else:
+                            sep = '  ' if separate_phonemes else ' '
+                            final_line += sep + tok
+                # If a sentence-ending punctuation is followed by a capital letter,
+                # split into separate lines (keeps numeric periods like "1980. urtean" intact).
+                # This turns "... ? Ni ..." into two lines at the sentence boundary.
+                split_line = re.sub(r"(?<=[\?\!\.])\s+(?=[A-ZÁÉÍÓÚÜÑ])", "\n", final_line)
+                per_line_outputs.append(split_line)
+            return "\n".join(per_line_outputs)
+        except Exception as e:
+            error_msg = f"Error in phoneme extraction: {str(e)}"
+            self.logger.error(error_msg)
+            return ""
+    def _build_normalization_command(self) -> str:
+        """Build the command string for normalization."""
+        modulo_path = self._get_file_path() / self.path_modulo1y2
+        dict_path = self._get_file_path() / self.path_dicts
+        dict_file = f"{self.language}_dicc"
+        return f'{modulo_path} -TxtMode=Word -Lang={self.language} -HDic={dict_path/dict_file}'
+    def _build_phoneme_extraction_command(self) -> str:
+        """Build the command string for phoneme extraction."""
+        modulo_path = self._get_file_path() / self.path_modulo1y2
+        dict_path = self._get_file_path() / self.path_dicts
+        dict_file = f"{self.language}_dicc"
+        return f'{modulo_path} -Lang={self.language} -HDic={dict_path/dict_file}'
+    def _get_file_path(self) -> Path:
+        return Path(__file__).parent
+    def _validate_paths(self) -> None:
+        """Validate paths with enhanced error reporting."""
+        try:
+            if not self.path_modulo1y2.exists():
+                raise PhonemizerError(f"Modulo1y2 executable not found at: {self.path_modulo1y2}")
+            if not self.path_dicts.exists():
+                raise PhonemizerError(f"Dictionary directory not found at: {self.path_dicts}")
+            # Check for both possible dictionary files
+            dict_file = self.path_dicts / f"{self.language}_dicc"
+            if not dict_file.exists():
+                # Try with .dic extension as fallback
+                dict_file_alt = self.path_dicts / f"{self.language}_dicc.dic"
+                if not dict_file_alt.exists():
+                    raise PhonemizerError(f"Dictionary file not found at either {dict_file} or {dict_file_alt}")
+        except Exception as e:
+            self.logger.error(f"Path validation error: {str(e)}")
+            raise
+    def _transform_multichar_phonemes(self, phoneme_sequence: str) -> str:
+        """
+        Transform multicharacter IPA phonemes to single characters using the MULTICHAR_TO_SINGLECHAR mapping.
+        Args:
+            phoneme_sequence (str): A string containing phonemes separated by spaces
+        Returns:
+            str: The transformed phoneme sequence with multicharacter phonemes replaced by single characters
+        """
+        # Split the sequence into individual phonemes
+        phonemes = phoneme_sequence.split()
+        transformed_phonemes = []
+        for phoneme in phonemes:
+            # Check if the phoneme exists in our mapping
+            if phoneme in MULTICHAR_TO_SINGLECHAR:
+                transformed_phonemes.append(MULTICHAR_TO_SINGLECHAR[phoneme])
+            else:
+                transformed_phonemes.append(phoneme)
+        return " ".join(transformed_phonemes)

gradio_phonemizer.py ADDED Viewed

	@@ -0,0 +1,506 @@

+import gradio as gr
+import tempfile
+import base64
+import re
+import socket
+import os
+from pathlib import Path
+from typing import Optional, Tuple
+import threading
+import time
+import atexit
+# Output cleanup configuration
+OUTPUTS_DIR = Path(__file__).parent / 'outputs'
+OUTPUT_CLEANUP_TTL = 24 * 3600  # seconds, default 24 hours
+OUTPUT_CLEANUP_MAX_FILES = 500  # keep at most this many files
+OUTPUT_CLEANUP_INTERVAL = 60 * 60  # in seconds, run cleanup every hour
+def _cleanup_outputs(out_dir: Path = None, max_files: int = None, ttl: int = None):
+    """Delete old files in `out_dir` older than `ttl` seconds and keep at most
+    `max_files` newest files. If parameters are None, use module defaults."""
+    if out_dir is None:
+        out_dir = OUTPUTS_DIR
+    if not out_dir.exists():
+        return
+    if max_files is None:
+        max_files = OUTPUT_CLEANUP_MAX_FILES
+    if ttl is None:
+        ttl = OUTPUT_CLEANUP_TTL
+    now = time.time()
+    files = [p for p in out_dir.iterdir() if p.is_file()]
+    # Remove files older than ttl
+    for p in files:
+        try:
+            if now - p.stat().st_mtime > ttl:
+                p.unlink()
+        except Exception:
+            pass
+    # Re-list and trim to max_files
+    files = sorted([p for p in out_dir.iterdir() if p.is_file()], key=lambda p: p.stat().st_mtime, reverse=True)
+    if len(files) > max_files:
+        for p in files[max_files:]:
+            try:
+                p.unlink()
+            except Exception:
+                pass
+def _cleanup_all_on_exit():
+    """Remove all files in outputs folder on process exit."""
+    try:
+        if OUTPUTS_DIR.exists():
+            for p in OUTPUTS_DIR.iterdir():
+                try:
+                    if p.is_file():
+                        p.unlink()
+                except Exception:
+                    pass
+    except Exception:
+        pass
+def _start_periodic_cleanup():
+    def _worker():
+        while True:
+            try:
+                _cleanup_outputs(OUTPUTS_DIR)
+            except Exception:
+                pass
+            time.sleep(OUTPUT_CLEANUP_INTERVAL)
+    t = threading.Thread(target=_worker, daemon=True, name='outputs-cleaner')
+    t.start()
+# Ensure outputs dir exists and start background cleaner; register atexit
+OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)
+_start_periodic_cleanup()
+atexit.register(_cleanup_all_on_exit)
+from eu_phonemizer_v2 import Phonemizer, PhonemizerError
+def _read_uploaded_file(file_obj) -> str:
+    if not file_obj:
+        return ""
+    # gradio will provide a temporary file path
+    p = Path(file_obj.name) if hasattr(file_obj, "name") else Path(file_obj)
+    try:
+        return p.read_text(encoding='utf-8')
+    except Exception:
+        return p.read_text(encoding='ISO-8859-15')
+def process(text: str,
+            uploaded_file,
+            language: str,
+            symbol: str,
+            separate_phonemes: bool) -> Tuple[str, Optional[str], str, Optional[str]]:
+    """Process either text input or uploaded txt file and return (text_output, download_file_path)
+    If the user uploaded a file, the function will return the path to a tmp file
+    suitable for download as the second return value and an empty text output.
+    If the user provided text in the box, the function will return the phonemes
+    as text and also a downloadable txt file containing the same output.
+    """
+    # Prefer uploaded file if present
+    source_text = ""
+    is_file_input = False
+    if uploaded_file:
+        source_text = _read_uploaded_file(uploaded_file)
+        is_file_input = True
+    else:
+        source_text = text or ""
+    # Try to instantiate Phonemizer using repo-local modulo1y2 and dicts
+    try:
+        phon = Phonemizer(language=language, symbol=symbol)
+    except PhonemizerError as e:
+        if language == 'eu':
+            err = f"Ezin izan da fonemizadorea hasi: {e}\nEgiaztatu 'modulo1y2' eta 'dict' karpetak."
+        else:
+            err = f"No se pudo inicializar el fonemizador: {e}\nComprueba las carpetas 'modulo1y2' y 'dict'."
+        # Return 6 outputs matching the UI: result text, file, normalized text, norm file, ph_path, norm_path
+        return err, None, "", None, "", ""
+    except Exception as e:
+        if language == 'eu':
+            return f"Hasieratze errore ezezaguna: {e}", None, "", None, "", ""
+        return f"Error inesperado al inicializar: {e}", None
+    # Normalize then get phonemes. Run normalization per original input line so the
+    # external normalizer doesn't insert extra newlines across sentences and
+    # we preserve the user's original line boundaries.
+    try:
+        lines = source_text.split('\n')
+        normalized_lines = []
+        for ln in lines:
+            if not ln.strip():
+                normalized_lines.append('')
+            else:
+                # normalize each line independently, collapse any internal newlines
+                # produced by the external normalizer, collapse multiple whitespace
+                # (this avoids producing double spaces when the normalizer inserts
+                # a '\n' while the original text already had a space), and strip
+                norm_line = phon.normalize(ln)
+                norm_line = norm_line.replace('\n', ' ')
+                norm_line = re.sub(r"\s+", ' ', norm_line).strip()
+                normalized_lines.append(norm_line)
+        normalized = '\n'.join(normalized_lines)
+        phonemes = phon.getPhonemes(normalized, separate_phonemes=separate_phonemes)
+        # Defensive cleanup: if any '|' separators remain, replace them with single spaces
+        if isinstance(phonemes, str) and '|' in phonemes:
+            phonemes = re.sub(r"\s*\|\s*", " ", phonemes)
+    except PhonemizerError as e:
+        if language == 'eu':
+            msg = f"Fonemizazio errorea: {e}"
+        else:
+            msg = f"Error del fonemizador: {e}"
+        return msg, None, "", None, "", ""
+    except Exception as e:
+        if language == 'eu':
+            msg = f"Errore ezezaguna prozesatzean: {e}"
+        else:
+            msg = f"Error inesperado al procesar: {e}"
+        return msg, None, "", None, "", ""
+    # Create persistent downloadable files under outputs/ so the browser can reliably
+    # download them using Gradio's `gr.File` component (avoid ephemeral tmp files
+    # that some browsers may not fetch correctly).
+    out_dir = Path(__file__).parent / 'outputs'
+    out_dir.mkdir(parents=True, exist_ok=True)
+    from datetime import datetime
+    ts = datetime.now().strftime('%Y%m%d_%H%M%S')
+    ph_file = out_dir / f'phonemes_{ts}.txt'
+    norm_file = out_dir / f'normalized_{ts}.txt'
+    ph_file.write_text(phonemes, encoding='utf-8')
+    norm_file.write_text(normalized, encoding='utf-8')
+    # Cleanup old files opportunistically after creating new ones
+    try:
+        _cleanup_outputs(out_dir)
+    except Exception:
+        pass
+    # Return phonemes and normalized text in all cases (text or uploaded file)
+    # so users who upload a .txt can see the processed text inline and download it.
+    return phonemes, str(ph_file), normalized, str(norm_file), str(ph_file), str(norm_file)
+def download_from_text(text: str) -> Optional[str]:
+    """Create a temporary .txt file from the given text and return its path for download."""
+    if not text:
+        return None
+    # Save into a persistent outputs/ directory with a readable timestamped filename
+    out_dir = Path(__file__).parent / 'outputs'
+    out_dir.mkdir(parents=True, exist_ok=True)
+    from datetime import datetime
+    ts = datetime.now().strftime('%Y%m%d_%H%M%S')
+    filename = f'phonemes_{ts}.txt'
+    out_path = out_dir / filename
+    out_path.write_text(text, encoding='utf-8')
+    # Return the path string so Gradio's File component can serve it
+    return str(out_path)
+def build_interface():
+    with gr.Blocks(title="Eu/Es Phonemizer") as demo:
+        # Simple header (image removed per user preference)
+        header = gr.Markdown("# Fonemizadorea — Euskara (eu) eta Gaztelania (es)")
+        # Style the Submit button to be orange for better visibility (higher specificity)
+        gr.HTML("""
+    <style>
+    /* Stronger selectors to override theme/defaults */
+    #submit_btn, #submit_btn button, button#submit_btn, .gradio-container #submit_btn button {
+        background-color: #ff8c00 !important;
+        color: white !important;
+        border-radius: 6px !important;
+        padding: 6px 12px !important;
+        border: none !important;
+    }
+    #submit_btn:hover, #submit_btn button:hover, button#submit_btn:hover {
+        background-color: #ff7a00 !important;
+    }
+    /* Don't force download buttons to orange */
+    #download_ph_btn button, #download_norm_btn button { background-color: transparent !important; }
+    /* Compact upload file box */
+        #upload_file { max-width: 160px !important; }
+        #upload_file .gr-file {
+            height: 32px !important;
+            padding: 2px 6px !important;
+            font-size: 0.9rem !important;
+            line-height: 1 !important;
+        }
+        #upload_file .gr-file input[type=file] { height: 32px !important; }
+    /* Make textareas vertically resizable and more roomy */
+    #input_text textarea, #normalized_box textarea, #result_box textarea {
+        resize: vertical !important;
+        min-height: 120px !important;
+        max-height: 800px !important;
+        width: 100% !important;
+        box-sizing: border-box !important;
+    }
+    /* Center container and add padding for a cleaner look */
+    .gradio-container { max-width: 1100px; margin: 12px auto !important; padding: 8px !important; }
+     /* Fix controls column width so changing labels doesn't reflow layout.
+         Use a slightly smaller fixed width so the upload column sits closer. */
+     /* Make controls column appear taller by increasing internal spacing
+         between control rows rather than forcing the whole column height.
+         This avoids adding extra vertical gap between adjacent columns
+         (upload box / buttons). */
+     #controls_col { min-width: 220px; max-width: 260px; flex: 0 0 240px; align-self: flex-start; padding-top: 6px; padding-bottom: 6px; box-sizing: border-box; }
+     /* Increase the gap between controls so the column looks taller without
+         enlarging its outer box or shifting neighboring columns. */
+     #controls_col .gr-row { gap: 12px; row-gap: 12px; }
+     #controls_col .gr-label, #controls_col label { line-height: 1.4; }
+     /* Ensure the upload column aligns to the top of the row so it doesn't
+         get vertically centered when other columns grow; keep the upload box
+         compact but aligned with the controls stack. */
+     #upload_col { min-height: 110px; display: flex !important; align-items: flex-start !important; justify-content: center !important; align-self: flex-start; padding-top: 6px; }
+    /* Ensure labels wrap instead of expanding layout */
+    #controls_col .gr-label, #controls_col label { white-space: normal !important; word-break: break-word !important; }
+    /* Keep action buttons a fixed size so they don't push layout when language changes */
+    #submit_btn, #clear_btn { }
+    /* Enforce pixel-perfect identical size and box-model for both action buttons */
+    #submit_btn button, #clear_btn button {
+        width: 120px !important;
+        height: 40px !important;
+        min-height: 40px !important;
+        box-sizing: border-box !important;
+        padding: 6px 12px !important;
+        display: inline-flex !important;
+        align-items: center !important;
+        justify-content: center !important;
+        font-size: 14px !important;
+        line-height: 1 !important;
+        border-radius: 6px !important;
+        border: none !important;
+        margin: 0 !important;
+        vertical-align: middle !important;
+        font-family: inherit !important;
+        background-clip: padding-box !important;
+    }
+    /* Make main column flexible and allow it to shrink without pushing controls */
+    #main_col { flex: 1 1 auto; min-width: 0; }
+    /* Pull the upload box a bit left to close the gap if needed */
+    #upload_file { margin-left: -6px !important; }
+     /* Force a compact file control so it doesn't become taller than the
+         nearby control stack. */
+         /* Keep the file control compact so it doesn't exceed nearby controls */
+         #upload_file .gr-file { max-height: 44px !important; height: 36px !important; box-sizing: border-box !important; }
+     /* Position decorative image absolutely so it doesn't force wrapping.
+         Reserve space on the right of #top_row to avoid overlap. */
+     #top_row { position: relative !important; padding-right: 520px !important; }
+     #img_col { position: absolute !important; right: 8px !important; top: 6px !important; width: 480px !important; max-width: 100% !important; box-sizing: border-box !important; }
+     #download_img img { width: 480px !important; max-width: 100% !important; height: auto !important; display:block !important; pointer-events: none !important; user-select: none !important; }
+      /* Ensure action buttons share the same height and vertical alignment.
+          Consolidated to authoritative sizing above to avoid conflicting rules. */
+      /* (Sizing enforced in the main button block above.) */
+    </style>
+    """)
+        with gr.Row():
+            # Left controls column
+            with gr.Column(scale=1, elem_id='controls_col'):
+                language = gr.Radio(choices=['eu', 'es'], value='eu', label='Hizkuntza / Idioma')
+                symbol = gr.Radio(choices=['sampa', 'ipa'], value='sampa', label='Sinboloak / Símbolos (Irteera)')
+                # Default checked and Basque-only label; will switch to Spanish when language changes
+                separate_phonemes = gr.Checkbox(label='Banatu fonemak espazioz', value=True)
+            # Small column to the right of controls that holds the upload box
+            with gr.Column(scale=1, elem_id='upload_col'):
+                upload = gr.File(file_types=['.txt'], label='Igo .txt fitxategia / Subir archivo .txt', elem_id='upload_file')
+            # Decorative/download image column to the right of the upload box.
+            # We'll embed the local `img/download.png` as a base64 <img> inside
+            # gr.HTML so Gradio doesn't add overlay controls (download/enlarge).
+            # Use an integer `scale` to avoid Gradio warnings; keep the image
+            # column compact by using a small integer scale and reserving width
+            # via CSS (#img_col). Changing to `scale=1` prevents the float-scale
+            # warning while preserving layout.
+            with gr.Column(scale=1, elem_id='img_col'):
+                img_path = Path(__file__).parent / 'img' / 'download.png'
+                _img_data_uri = ''
+                try:
+                    with open(img_path, 'rb') as _img_f:
+                        _img_b64 = base64.b64encode(_img_f.read()).decode('ascii')
+                        _img_data_uri = f"data:image/png;base64,{_img_b64}"
+                except Exception:
+                    _img_data_uri = ''
+                # Render HTML with a non-interactive <img>; let CSS control width
+                download_img = gr.HTML(f'<img src="{_img_data_uri}" alt="download" style="height:auto;pointer-events:none;user-select:none;">', elem_id='download_img')
+            # Main column on the right: buttons above the wide input textbox
+            with gr.Column(scale=3, elem_id='main_col'):
+                with gr.Row():
+                    submit_btn = gr.Button('Submit', elem_id='submit_btn')
+                    clear_btn = gr.Button('Clear', elem_id='clear_btn')
+                with gr.Row():
+                    with gr.Column(scale=5):
+                        input_text = gr.Textbox(lines=12, elem_id='input_text', label="Sarrera testua (uzteko hutsik .txt fitxategia igo behar baduzu) / Texto de entrada (dejar vacío si subes un .txt)")
+        # Outputs area: normalized text and phoneme output side-by-side
+        with gr.Row():
+            with gr.Column(scale=1):
+                normalized_box = gr.Textbox(lines=12, elem_id='normalized_box', label='Normalizatua', interactive=False)
+                download_norm_btn = gr.DownloadButton('Deskargatu normalizatua', elem_id='download_norm_btn')
+            with gr.Column(scale=1):
+                result_box = gr.Textbox(lines=12, elem_id='result_box', label='Fonemak', interactive=False)
+                download_ph_btn = gr.DownloadButton('Deskargatu fonemak', elem_id='download_ph_btn')
+        # hidden boxes to hold latest generated file paths so download buttons can trigger
+        ph_path_box = gr.Textbox(visible=False, elem_id='ph_path_box')
+        norm_path_box = gr.Textbox(visible=False, elem_id='norm_path_box')
+        def _on_click(input_text, upload, language, symbol, separate_phonemes):
+            return process(input_text, upload, language, symbol, separate_phonemes)
+        # When a user uploads a .txt file, read its contents and populate the
+        # `input_text` box so they can review or edit before sending.
+        def _on_upload(uploaded_file):
+            if not uploaded_file:
+                return gr.update(value="")
+            try:
+                content = _read_uploaded_file(uploaded_file)
+            except Exception:
+                content = ''
+            return gr.update(value=content)
+        def _clear_all():
+            # Clear input, outputs and any hidden path boxes so UI resets
+            return (
+                gr.update(value=""),            # input_text
+                gr.update(value=None),            # upload (clear any uploaded file)
+                gr.update(value=""),            # normalized_box
+                gr.update(value=""),            # result_box
+                gr.update(value=None),            # download_ph_btn
+                gr.update(value=None),            # download_norm_btn
+                gr.update(value=""),            # ph_path_box
+                gr.update(value="")             # norm_path_box
+            )
+        # Re-run processing automatically when symbol or separation options change
+        # so users don't have to press the Process button again.
+        symbol.change(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
+        separate_phonemes.change(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
+        # Populate the input textbox when a file is uploaded so users can see/edit it
+        # before sending. Does not auto-run processing.
+        upload.change(fn=_on_upload, inputs=[upload], outputs=[input_text])
+        # Update UI texts when language selection changes
+        def _update_language_ui(lang):
+            # Note: we intentionally do NOT update the header here to avoid
+            # large DOM changes that reflow the layout when switching languages.
+            if lang == 'eu':
+                return (
+                    gr.update(label='Sinboloak (Irteera)'),                  # symbol
+                    gr.update(label='Banatu fonemak espazioz'),               # separate_phonemes
+                    # keep input/upload labels stable (do not update them to avoid reflow)
+                    gr.update(label='Fonemak'),
+                    gr.update(label='Deskargatu irteera (.txt)'),
+                    gr.update(label='Normalizatua'),
+                    gr.update(label='Deskargatu normalizatua (.txt)'),
+                    gr.update(value=''),
+                    gr.update(value='')
+                )
+            else:
+                return (
+                    gr.update(label='Símbolos (Salida)'),
+                    gr.update(label='Separar fonemas con espacios'),
+                    # keep input/upload labels stable (do not update them to avoid reflow)
+                    gr.update(label='Fonemas'),
+                    gr.update(label='Descargar salida (.txt)'),
+                    gr.update(label='Normalizado'),
+                    gr.update(label='Descargar normalizado (.txt)'),
+                    gr.update(value=''),
+                    gr.update(value='')
+                )
+        # Note: don't include `header`, `input_text`, upload or action buttons
+        # in outputs to avoid reflow when changing language. Only update the
+        # smaller output labels and hidden path boxes which the function
+        # actually returns (8 outputs).
+        language.change(fn=_update_language_ui, inputs=[language], outputs=[symbol, separate_phonemes, result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
+        submit_btn.click(fn=_on_click, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[result_box, download_ph_btn, normalized_box, download_norm_btn, ph_path_box, norm_path_box])
+        clear_btn.click(fn=_clear_all, inputs=[], outputs=[input_text, upload, normalized_box, result_box, download_ph_btn, download_norm_btn, ph_path_box, norm_path_box])
+        # Note: download buttons are created in the outputs area above.
+        def _download_file(path: str):
+            # Keep a simple path-return helper for backwards compatibility
+            if not path:
+                return None
+            p = Path(path)
+            if not p.exists():
+                return None
+            return str(p)
+        # Provide download callbacks that generate the outputs on-demand so a
+        # single click will both create and return the file path to the browser.
+        def _download_ph_from_inputs(input_text, upload, language, symbol, separate_phonemes):
+            # Call the same `process()` function to ensure files are generated
+            res = process(input_text, upload, language, symbol, separate_phonemes)
+            # process() returns (result_text, ph_path, normalized_text, norm_path, ph_path, norm_path)
+            if isinstance(res, tuple) and len(res) >= 2:
+                return _download_file(res[1])
+            return None
+        def _download_norm_from_inputs(input_text, upload, language, symbol, separate_phonemes):
+            res = process(input_text, upload, language, symbol, separate_phonemes)
+            if isinstance(res, tuple) and len(res) >= 4:
+                return _download_file(res[3])
+            return None
+        # Wire the DownloadButtons to generate-and-return callbacks so a single
+        # click performs generation and triggers immediate download.
+        download_ph_btn.click(fn=_download_ph_from_inputs, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[download_ph_btn])
+        download_norm_btn.click(fn=_download_norm_from_inputs, inputs=[input_text, upload, language, symbol, separate_phonemes], outputs=[download_norm_btn])
+    return demo
+def _find_free_port(start: int = 7860, end: int = 7870) -> Optional[int]:
+    """Find a free TCP port in the given inclusive range."""
+    for port in range(start, end + 1):
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+            try:
+                s.bind(('0.0.0.0', port))
+                return port
+            except OSError:
+                continue
+    return None
+if __name__ == '__main__':
+    app = build_interface()
+    # Allow explicit override via environment variable
+    env_port = os.environ.get('GRADIO_SERVER_PORT')
+    if env_port:
+        try:
+            port = int(env_port)
+        except ValueError:
+            print(f"Invalid GRADIO_SERVER_PORT='{env_port}', falling back to automatic selection.")
+            port = None
+    else:
+        port = None
+    if port is None:
+        port = _find_free_port(7860, 7880)
+    if port is None:
+        raise OSError("No free port found in range 7860-7880. Set GRADIO_SERVER_PORT to a free port.")
+    print(f"Launching Gradio on port {port} (server_name=0.0.0.0)")
+    app.launch(server_name='0.0.0.0', server_port=port)

img/download.png ADDED Viewed

modulo1y2/modulo1y2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c122bd6197e5e360d534957322f8d98a06cb3bcb4d412ee9978e891ae1b43e8a
+size 2245952

prepare.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+#!/usr/bin/env bash
+set -euo pipefail
+echo "Preparing phonemizer workspace..."
+# Make sure executable bit is set if present
+if [ -f "modulo1y2/modulo1y2" ]; then
+  chmod +x modulo1y2/modulo1y2 || true
+  echo "Ensured modulo1y2/modulo1y2 is executable."
+else
+  echo "Warning: modulo1y2/modulo1y2 not found. If you plan to ship the binary, add it to the repo."
+fi
+echo "Preparation complete. To run locally:
+  python3 -m venv .venv
+  source .venv/bin/activate
+  pip install -r requirements.txt
+  python app.py
+"

push_to_hf.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# Safe push script for Hugging Face Spaces using an env var HF_TOKEN.
+# Usage:
+#   export HF_TOKEN="<your_token>"
+#   cd /path/to/tmp_space
+#   chmod +x push_to_hf.sh
+#   ./push_to_hf.sh
+REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
+cd "$REPO_DIR"
+if [ -z "${HF_TOKEN:-}" ]; then
+  echo "ERROR: HF_TOKEN is not set. Run: export HF_TOKEN=\"<your_token>\""
+  exit 1
+fi
+# Show current branch and changes
+git --no-pager status --porcelain --branch
+# Push using http.extraHeader so token is not stored in git config or logs
+echo "Pushing to origin (authenticated via HF_TOKEN) ..."
+GIT_HTTP_EXTRAHEADER="Authorization: Bearer $HF_TOKEN"
+# Use git -c to pass header for single command
+git -c http.extraHeader="Authorization: Bearer $HF_TOKEN" push origin HEAD:main
+RET=$?
+if [ $RET -eq 0 ]; then
+  echo "Push succeeded. Space should start building shortly on Hugging Face."
+else
+  echo "Push failed with exit code $RET"
+fi
+exit $RET

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio>=3.0
+nltk
+huggingface-hub