data2vec-audio-HAT-1.4K-base / README.md

Added link to conversion scripts

7d3a3a1 6 months ago

3.64 kB

	---
	language:
	- ht
	thumbnail: null
	tags:
	- data2vec
	license: cc-by-nc-sa-4.0
	extra_gated_prompt: >-
	To help us better understand how the model is being used and by whom,
	we ask you to provide some basic information.
	This will support future improvements and help ensure the model continues to meet the needs of its user community.
	Please note: this model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
	extra_gated_fields:
	University/Company: text
	Website: text
	---

	# data2vec-HAT-1.4K-base

	This repository provides access to a data2vec1-Base model for Haitian Creole (hat).

	## Model
	### Model and data description

	The model was pretrained on the following data sets:
	* [Atlas Linguistique d'Haïti](https://cocoon.huma-num.fr/exist/crdo/meta/cocoon-8ea988d2-bf16-303d-81a0-0c55cc035240) consisting of fieldwork recordings (directed by Dominique Fattier) collected between 1975 and 1985
	* [Corpus of Northern Haitian Creole](https://archive.org/details/interview-8-ujf-107-a-ujm-107-a) consisting of fieldwork recordings (by Albert Valdman) collected in Cap-Haïtien
	* [Haiti-CMU](http://www.speech.cs.cmu.edu/haitian/) consisting of read speech
	* [IARPA Babel Haitian Creole Language Pack](https://catalog.ldc.upenn.edu/LDC2017S03) consisting of phone-based conversational speech and read speech
	* [VoxLingua207](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) consisting of 90h of recordings in Haitian Creole scraped from Youtube
	* [Radio Haiti](https://repository.duke.edu/dc/radiohaiti) consisting or radio broadcast recordings (1950 to early 2000s) in Haitian Creole

	The pre-processing scripts are located here : https://gin.g-node.org/CREAM/SSL-Haitian/
	The original `fairseq` models where converted to HuggingFace format using the following code https://github.com/LLL-Orleans/convert_data2vec_to_hf The original fairseq model is also available, enabling continued pre-training or fine-tuning using this framework.

	For more details, see the paper.

	### Intended uses & limitations

	This model is distributed under the [Creative Commons Attribution Non Commercial Share Alike 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

	This is a gated model. Access will be given on a per-user basis, pending formal approval by CREAM PI Pr. Emmanuel Schang.


	## Acknowledgments

	The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-20-CE38-0006 (project CREAM). Experiments were conducted using Grid'5000, developed under INRIA ALADDIN with support from CNRS, RENATER, and various universities (see https://www.grid5000.fr). Additional resources include the CaSciModOT cluster (https://cascimodot.fr/) at Centre de Calcul Scientifique en région Centre-Val de Loire and HPC resources from IDRIS provided by GENCI (allocation 2024-AD011014940).

	## Referencing this model

	```bibtex
	@inproceedings{havard-et-al-taln25,
	author = "Havard, William N. and Govain, Renauld and Lecouteux, Benjamin and Schang, Emmanuel",
	title = "Mod\`eles auto-supervis\'es de traitement de la parole pour le Cr\'eole Haitien",
	booktitle = "Actes de CORIA-TALN-RJCRI-RECITAL 2025. Actes des 32\`eme Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux",
	month = "6",
	year = "2025",
	address = "Marseille, France",
	publisher = "Association pour le Traitement Automatique des Langues",
	pages = "543-555",
	note = "",
	url = "https://talnarchives.atala.org/TALN/TALN-2025/98.pdf"
	}
	```