Baoore Dataset 1 (French-Mooré)

Dataset Description

  • Repository: mooredataset/baoore-dataset-1
  • Languages: French (fra) - Mooré (mos)
  • Total sentences: 233,945

This dataset is a comprehensive parallel corpus for Machine Translation between French and Mooré. It is aggregated from multiple sources including dictionary entries, specific curated translations, generic translated data, and synthetic/filtered datasets. The dataset has been systematically curated and deduplicated for fine-tuning modern machine translation models such as NLLB.

Dataset Structure

The data is provided in JSONL format. Each line contains a JSON object representing a translation pair and its metadata.

Data Instances

A typical example from the dataset looks like this:

{
  "translation": {
    "fra": "Il mange une pomme.",
    "mos": "A rita pomme.",
    "origin": "traduction"
  }
}

Data Fields

  • translation: A dictionary containing:
    • fra: The French sentence (string).
    • mos: The Mooré translation (string).
    • origin: The source origin of the data (string). Possible values:
      • laion-coco-nllb: Generic data from LAION-COCO-NLLB.
      • jw: Biblical data from JW.
      • french_moore: Human translation (D1).
      • traduction: Human translation (D2).
      • jw: Biblical data from JW.
      • mafand: Data from the MAFAND dataset.
      • wibonary: Data from the Webonary dictionary.

Corpus Statistics

The total corpus comprises 230,343 aligned sentence pairs (excluding SMOL data). The composition of the dataset by configuration (config_name) is detailed below. The statistics for the combined subsets are aggregated from their respective base subsets.

Configuration Sentences (%) Words (FR) Words (MOS) Ratio (MOS/FR)
all-data 230,343 100.0% 2,379,468 2,548,167 1.07
generique (gen) 116,431 50.5% 1,116,198 1,137,263 1.02
traduction (trad) 78,025 32.3% 351,579 425,030 1.21
existing (old) 39,489 17.1% 911,691 985,874 1.08
trad-generique 190,854 82.9% 1,467,777 1,562,293 1.06
trad-existing 113,912 49.5% 1,263,270 1,410,904 1.12

Note: 'existing' data groups the previously available datasets (MAFAND, Webonary, and Jehovah's Witnesses). 'trad-generique' is the combination of 'traduction' and 'generique'. 'trad-existing' is the combination of 'traduction' and 'existing'. Percentages reflect the proportion relative to 'all-data'.

Subsets

The repository is structured into different subdirectories containing various subsets of the data, each corresponding to a specific configuration:

  • all_data: Combinaison de toutes les données (config : all-data).
  • gen: Données génériques de laion-coco-nllb (config : generique).
  • trad: Données de traduction humaine (config : traduction).
  • old: Données qui étaient disponibles (MAFAND, Webonary, JW) (config : existing).
  • trad_gen: Combinaison de traduction et generique (config : trad-generique).
  • trad_old: Combinaison de traduction et existing (config : trad-existing).

Preprocessing & Filtering

  • Deduplication: The corpus has undergone exact deduplication across French and Mooré sentence pairs.
  • Filtering: Datasets (such as the LAION subset) have been filtered using an alignment confidence threshold (e.g., > 0.80) based on Sentence-Transformers (LaBSE).
  • Normalization: Basic character normalization is applied for the Mooré language to standardise orthography and maintain data quality.

Intended Use

This dataset is intended to be used for training and fine-tuning neural machine translation systems for the French-Mooré language pair, supporting the preservation and digitalization of the Mooré language.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support