--- license: apache-2.0 language: - es pipeline_tag: translation tags: - Translation - Capitalization-and-punctuation - Transformer --- # HiTZ's Capitalization & Punctuation model for Spanish ## Model description This model was trained from scratch using Marian NMT. The dataset used in training contains 9,784,905 sentences. The model was evaluated on the Flores-101 dev and devtest datasets and some randomly picked CommonVoice dataset sentences. * **Developed by**: HiTZ Research Center (University of the Basque Country EHU) * **Model type**: Capitalization and Punctuation * **Language**: Spanish ## Intended uses and limitations You can use this model for Punctuation and Capitalization in Spanish ## Usage Required packages: - torch - transformers - sentencepiece ### Capitalizing using python: Clone repository to download the model: ```bash git clone https://huggingface.co/HiTZ/cap-punct-es ``` Given `MODEL_PATH` is the path that points to the downloaded `marianmt-cap-punct-es` folder. ```python from transformers import pipeline device = 0 # 0-->GPU, -1-->CPU segment_list = ["hola buenos días a todos", "faktoria se escucha en la radio de e i te be", "más o menos el cuarenta y dos por ciento","cuatro ocho quince dieciséis veintitrés cuarenta y dos","mi año de nacimiento es mil novecientos noventa y seis", "más información en uve doble uve doble uve doble punto e hache u punto eus"] translator = pipeline(task="translation", model=MODEL_PATH, tokenizer=MODEL_PATH, device=device) result_list = translator(segment_list) cp_segment_list = [result["translation_text"] for result in result_list] for text, cp_text in zip(segment_list, cp_segment_list): print(f"Normalized: {text}\n With C&P: {cp_text}\n") ``` ### Expected output: ```bash Normalized: hola buenos días a todos With C&P: Hola, buenos días a todos. Normalized: faktoria se escucha en la radio de e i te be With C&P: Faktoria se escucha en la radio de EiTB Normalized: más o menos el cuarenta y dos por ciento With C&P: Más o menos el 42 %. Normalized: cuatro ocho quince dieciséis veintitrés cuarenta y dos With C&P: Cuatro, ocho, quince, dieciséis, veintitrés, cuarenta y dos. Normalized: mi año de nacimiento es mil novecientos noventa y seis With C&P: Mi año de nacimiento es 1996. Normalized: más información en uve doble uve doble uve doble punto e hache u punto eus With C&P: Más información en www.ehu.eus ``` ## Training ### Data preparation The training data was compiled by our research group from multiple heterogeneous sources and consists of approximately 9,784,905 sentences. This dataset is a subset of the data used in the following machine translation model [mt-hitz-eu-es](https://huggingface.co/HiTZ/mt-hitz-eu-es) Prior to training, the data underwent preprocessing steps including cleaning, punctuation standardization, filtering, and the creation of aligned input–output sentence pairs for the capitalization and punctuation restoration task. To generate the input–output pairs, the target sentences were lowercased, punctuation was removed, and text normalization was applied using an in-house normalization tool. Example: ```bash Output (Cleaned, filetered and standarized): Esto supone pasar de los 0,22 euros por elector de las Elecciones Generales de 2011 a 0,18 euros en las de 2015. Input (Lowercased, without punctuation and normalized): esto supone pasar de los cero coma veintidos euros por elector de las elecciones generales de dos mil once a cero coma dieciocho euros en las de dos mil quince ``` ### Training procedure The model was trained using the official [MarianNMT](https://marian-nmt.github.io/quickstart/) implementation. Training was performed on a single NVIDIA TITAN RTX GPU. ## Performance The following table shows the model performance. We use the Word Error Rate metric. The WER-WITHOUT metric corresponds to the Word Error Rate computed on the evaluation dataset before applying capitalization and punctuation restoration. The WER metric corresponds to the output after processing with the model. The evaluation dataset underwent the same processing as the training dataset. | Metric | FLORES-101 | COMMON-VOICE | |-------------|----------------|----------------| | WER-WITHOUT | %19.55 | %22.42 | | WER | %5.99 | %5.75 | # Aditional Information ## Author HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU. ## Copyright Copyright (c) 2025 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU. ## Licensing Information This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Funding The development of these models have been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU ILENIA (with reference 2022/TL22/00215335), by the project IkerGaitu funded by the Basque Government and by the project HiTZketan (COLAB22/13) funded by the University of the Basque Country EHU. ## Disclaimer
Click to expand The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.