|
|
--- |
|
|
language: |
|
|
- ht |
|
|
thumbnail: null |
|
|
tags: |
|
|
- data2vec |
|
|
license: cc-by-nc-sa-4.0 |
|
|
extra_gated_prompt: >- |
|
|
To help us better understand how the model is being used and by whom, |
|
|
we ask you to provide some basic information. |
|
|
This will support future improvements and help ensure the model continues to meet the needs of its user community. |
|
|
Please note: this model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. |
|
|
extra_gated_fields: |
|
|
University/Company: text |
|
|
Website: text |
|
|
--- |
|
|
|
|
|
# data2vec-HAT-1.4K-base |
|
|
|
|
|
This repository provides access to a data2vec1-Base model for Haitian Creole (hat). |
|
|
|
|
|
## Model |
|
|
### Model and data description |
|
|
|
|
|
The model was pretrained on the following data sets: |
|
|
* [Atlas Linguistique d'Haïti](https://cocoon.huma-num.fr/exist/crdo/meta/cocoon-8ea988d2-bf16-303d-81a0-0c55cc035240) consisting of fieldwork recordings (directed by Dominique Fattier) collected between 1975 and 1985 |
|
|
* [Corpus of Northern Haitian Creole](https://archive.org/details/interview-8-ujf-107-a-ujm-107-a) consisting of fieldwork recordings (by Albert Valdman) collected in Cap-Haïtien |
|
|
* [Haiti-CMU](http://www.speech.cs.cmu.edu/haitian/) consisting of read speech |
|
|
* [IARPA Babel Haitian Creole Language Pack](https://catalog.ldc.upenn.edu/LDC2017S03) consisting of phone-based conversational speech and read speech |
|
|
* [VoxLingua207](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) consisting of 90h of recordings in Haitian Creole scraped from Youtube |
|
|
* [Radio Haiti](https://repository.duke.edu/dc/radiohaiti) consisting or radio broadcast recordings (1950 to early 2000s) in Haitian Creole |
|
|
|
|
|
The pre-processing scripts are located here : https://gin.g-node.org/CREAM/SSL-Haitian/ |
|
|
The original `fairseq` models where converted to HuggingFace format using the following code https://github.com/LLL-Orleans/convert_data2vec_to_hf The original fairseq model is also available, enabling continued pre-training or fine-tuning using this framework. |
|
|
|
|
|
For more details, see the paper. |
|
|
|
|
|
### Intended uses & limitations |
|
|
|
|
|
This model is distributed under the [Creative Commons Attribution Non Commercial Share Alike 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. |
|
|
|
|
|
This is a **gated model**. Access will be given on a per-user basis, pending formal approval by CREAM PI Pr. Emmanuel Schang. |
|
|
|
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-20-CE38-0006 (project CREAM). Experiments were conducted using Grid'5000, developed under INRIA ALADDIN with support from CNRS, RENATER, and various universities (see https://www.grid5000.fr). Additional resources include the CaSciModOT cluster (https://cascimodot.fr/) at Centre de Calcul Scientifique en région Centre-Val de Loire and HPC resources from IDRIS provided by GENCI (allocation 2024-AD011014940). |
|
|
|
|
|
## Referencing this model |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{havard-et-al-taln25, |
|
|
author = "Havard, William N. and Govain, Renauld and Lecouteux, Benjamin and Schang, Emmanuel", |
|
|
title = "Mod\`eles auto-supervis\'es de traitement de la parole pour le Cr\'eole Haitien", |
|
|
booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 32\`eme Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux", |
|
|
month = "6", |
|
|
year = "2025", |
|
|
address = "Marseille, France", |
|
|
publisher = "Association pour le Traitement Automatique des Langues", |
|
|
pages = "543-555", |
|
|
note = "", |
|
|
url = "https://talnarchives.atala.org/TALN/TALN-2025/98.pdf" |
|
|
} |
|
|
``` |
|
|
|