Baoore Dataset 1 (French-Mooré)

Dataset Description

Repository: mooredataset/baoore-dataset-1
Languages: French (fra) - Mooré (mos)
Total sentences: 233,945

This dataset is a comprehensive parallel corpus for Machine Translation between French and Mooré. It is aggregated from multiple sources including dictionary entries, specific curated translations, generic translated data, and synthetic/filtered datasets. The dataset has been systematically curated and deduplicated for fine-tuning modern machine translation models such as NLLB.

Dataset Structure

The data is provided in JSONL format. Each line contains a JSON object representing a translation pair and its metadata.

Data Instances

A typical example from the dataset looks like this:

{
  "translation": {
    "fra": "Il mange une pomme.",
    "mos": "A rita pomme.",
    "origin": "traduction"
  }
}

Data Fields

translation: A dictionary containing:
- fra: The French sentence (string).
- mos: The Mooré translation (string).
- origin: The source origin of the data (string). Possible values:
  - laion-coco-nllb: Generic data from LAION-COCO-NLLB.
  - jw: Biblical data from JW.
  - french_moore: Human translation (D1).
  - traduction: Human translation (D2).
  - jw: Biblical data from JW.
  - mafand: Data from the MAFAND dataset.
  - wibonary: Data from the Webonary dictionary.

Corpus Statistics

The total corpus comprises 230,343 aligned sentence pairs (excluding SMOL data). The composition of the dataset by configuration (config_name) is detailed below. The statistics for the combined subsets are aggregated from their respective base subsets.

Configuration	Sentences	(%)	Words (FR)	Words (MOS)	Ratio (MOS/FR)
`all-data`	230,343	100.0%	2,379,468	2,548,167	1.07
`generique` (gen)	116,431	50.5%	1,116,198	1,137,263	1.02
`traduction` (trad)	78,025	32.3%	351,579	425,030	1.21
`existing` (old)	39,489	17.1%	911,691	985,874	1.08
`trad-generique`	190,854	82.9%	1,467,777	1,562,293	1.06
`trad-existing`	113,912	49.5%	1,263,270	1,410,904	1.12

Note: 'existing' data groups the previously available datasets (MAFAND, Webonary, and Jehovah's Witnesses). 'trad-generique' is the combination of 'traduction' and 'generique'. 'trad-existing' is the combination of 'traduction' and 'existing'. Percentages reflect the proportion relative to 'all-data'.

Subsets

The repository is structured into different subdirectories containing various subsets of the data, each corresponding to a specific configuration:

all_data: Combinaison de toutes les données (config : all-data).
gen: Données génériques de laion-coco-nllb (config : generique).
trad: Données de traduction humaine (config : traduction).
old: Données qui étaient disponibles (MAFAND, Webonary, JW) (config : existing).
trad_gen: Combinaison de traduction et generique (config : trad-generique).
trad_old: Combinaison de traduction et existing (config : trad-existing).

Preprocessing & Filtering

Deduplication: The corpus has undergone exact deduplication across French and Mooré sentence pairs.
Filtering: Datasets (such as the LAION subset) have been filtered using an alignment confidence threshold (e.g., > 0.80) based on Sentence-Transformers (LaBSE).
Normalization: Basic character normalization is applied for the Mooré language to standardise orthography and maintain data quality.

Intended Use

This dataset is intended to be used for training and fine-tuning neural machine translation systems for the French-Mooré language pair, supporting the preservation and digitalization of the Mooré language.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support