Baoore Dataset 1 (French-Mooré)
Dataset Description
- Repository:
mooredataset/baoore-dataset-1 - Languages: French (
fra) - Mooré (mos) - Total sentences: 233,945
This dataset is a comprehensive parallel corpus for Machine Translation between French and Mooré. It is aggregated from multiple sources including dictionary entries, specific curated translations, generic translated data, and synthetic/filtered datasets. The dataset has been systematically curated and deduplicated for fine-tuning modern machine translation models such as NLLB.
Dataset Structure
The data is provided in JSONL format. Each line contains a JSON object representing a translation pair and its metadata.
Data Instances
A typical example from the dataset looks like this:
{
"translation": {
"fra": "Il mange une pomme.",
"mos": "A rita pomme.",
"origin": "traduction"
}
}
Data Fields
translation: A dictionary containing:fra: The French sentence (string).mos: The Mooré translation (string).origin: The source origin of the data (string). Possible values:laion-coco-nllb: Generic data from LAION-COCO-NLLB.jw: Biblical data from JW.french_moore: Human translation (D1).traduction: Human translation (D2).jw: Biblical data from JW.mafand: Data from the MAFAND dataset.wibonary: Data from the Webonary dictionary.
Corpus Statistics
The total corpus comprises 230,343 aligned sentence pairs (excluding SMOL data). The composition of the dataset by configuration (config_name) is detailed below. The statistics for the combined subsets are aggregated from their respective base subsets.
| Configuration | Sentences | (%) | Words (FR) | Words (MOS) | Ratio (MOS/FR) |
|---|---|---|---|---|---|
all-data |
230,343 | 100.0% | 2,379,468 | 2,548,167 | 1.07 |
generique (gen) |
116,431 | 50.5% | 1,116,198 | 1,137,263 | 1.02 |
traduction (trad) |
78,025 | 32.3% | 351,579 | 425,030 | 1.21 |
existing (old) |
39,489 | 17.1% | 911,691 | 985,874 | 1.08 |
trad-generique |
190,854 | 82.9% | 1,467,777 | 1,562,293 | 1.06 |
trad-existing |
113,912 | 49.5% | 1,263,270 | 1,410,904 | 1.12 |
Note: 'existing' data groups the previously available datasets (MAFAND, Webonary, and Jehovah's Witnesses). 'trad-generique' is the combination of 'traduction' and 'generique'. 'trad-existing' is the combination of 'traduction' and 'existing'. Percentages reflect the proportion relative to 'all-data'.
Subsets
The repository is structured into different subdirectories containing various subsets of the data, each corresponding to a specific configuration:
all_data: Combinaison de toutes les données (config :all-data).gen: Données génériques de laion-coco-nllb (config :generique).trad: Données de traduction humaine (config :traduction).old: Données qui étaient disponibles (MAFAND, Webonary, JW) (config :existing).trad_gen: Combinaison detraductionetgenerique(config :trad-generique).trad_old: Combinaison detraductionetexisting(config :trad-existing).
Preprocessing & Filtering
- Deduplication: The corpus has undergone exact deduplication across French and Mooré sentence pairs.
- Filtering: Datasets (such as the LAION subset) have been filtered using an alignment confidence threshold (e.g., > 0.80) based on Sentence-Transformers (LaBSE).
- Normalization: Basic character normalization is applied for the Mooré language to standardise orthography and maintain data quality.
Intended Use
This dataset is intended to be used for training and fine-tuning neural machine translation systems for the French-Mooré language pair, supporting the preservation and digitalization of the Mooré language.