chrismazii
/

kinycomet_unbabel

@@ -74,7 +74,7 @@ Rwanda's thriving MT ecosystem includes companies like Digital Umuganda, KINLP,
 | BLEU | N/A | 0.30 | 0.34 | 0.23 | 0.62 |
 | chrF | N/A | 0.38 | 0.30 | 0.21 | 0.34 |
-** State-of-the-Art Results**: Both KinyCOMET variants significantly outperform existing baselines, with KinyCOMET-Unbabel achieving the highest correlation across all metrics.
 ## Performance Highlights
@@ -98,105 +98,136 @@ Rwanda's thriving MT ecosystem includes companies like Digital Umuganda, KINLP,
 - Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
 - Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants
-# KinyCOMET Model Usage
-This model evaluates machine translation quality for Kinyarwanda translations.
 ## Installation
 ```bash
-pip install unbabel-comet huggingface_hub
 ```
-### Quick Start
-```from huggingface_hub import hf_hub_download
 from comet import load_from_checkpoint
-import os
-import shutil
-# Download model files from Hugging Face
-checkpoint_path = hf_hub_download(
-    repo_id="chrismazii/kinycomet_unbabel",
-    filename="KinyCOMET+Unbabel.ckpt"
-)
-hparams_path = hf_hub_download(
-    repo_id="chrismazii/kinycomet_unbabel",
-    filename="hparams.yaml"
-)
-# Set up directory structure for COMET
-work_dir = "/tmp/kinycomet"
-checkpoints_dir = os.path.join(work_dir, "checkpoints")
-os.makedirs(checkpoints_dir, exist_ok=True)
-# Copy files to expected structure
-shutil.copy(checkpoint_path, os.path.join(checkpoints_dir, "KinyCOMET+Unbabel.ckpt"))
-shutil.copy(hparams_path, os.path.join(work_dir, "hparams.yaml"))
-# Load the model
-model = load_from_checkpoint(os.path.join(checkpoints_dir, "KinyCOMET+Unbabel.ckpt"))
-# Now use the model with your data
-data = [{"src": "source text", "mt": "translation"}]
-segment_scores, system_score = model.predict(data, gpus=0)
-print(segment_scores, system_score)
 ```
-### Hugging Face Integration
 ```python
-from transformers import pipeline
-# Load via Hugging Face
-quality_estimator = pipeline(
-    "text-classification",
-    model="chrismazii/kinycomet_unbabel",
-    tokenizer="chrismazii/kinycomet_unbabel"
-)
-# Estimate quality
-result = quality_estimator({
-    "src": "Umugabo ararya",
-    "mt": "The man is eating"
 })
 ```
-## Training Details
-### Dataset
-- **Custom Kinyarwanda-English QE Dataset** with train/validation/test splits
-- **Score Normalization**: All quality scores normalized to [0,1] range during preprocessing
-- **Bidirectional Coverage**: Includes both translation directions
 ## Training Details
-### Dataset Construction
-Our training dataset represents a major community effort:
-- **4,323 samples** from three high-quality parallel corpora:
-  - Mbaza Education Dataset
-  - Mbaza Tourism Dataset
-  - Digital Umuganda Dataset
-- **15 linguistics students** as human annotators using Direct Assessment (DA) methodology
-- **Quality control**: Minimum 3 annotations per sample, removed samples with σ > 20 (410 samples/9.48%)
-- **Data split**: 80% train (3,497) / 10% validation (404) / 10% test (422)
-### Translation Systems Evaluated
-Six diverse MT systems for comprehensive coverage:
-- **LLM-based**: Claude 3.7-Sonnet, GPT-4o, GPT-4.1, Gemini Flash 2.0
-- **Traditional**: Facebook NLLB (1.3B and 600M parameters)
 ### Training Configuration
-- **Base Models**: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
 - **Methodology**: COMET framework with Direct Assessment supervision
 - **Evaluation Metrics**: Kendall's τ and Spearman ρ correlation with human DA scores
-- **Note**: XLM-RoBERTa was not originally trained on Kinyarwanda data, yet achieves strong performance
-### Data Distribution
-**DA Score Statistics**:
-- Overall: μ=87.73, σ=14.14
-- English→Kinyarwanda: μ=84.60, σ=16.28
-- Kinyarwanda→English: μ=91.05, σ=10.47
-Distribution pattern similar to WMT datasets (2017-2022), indicating alignment with international evaluation standards.
 ### MT System Benchmarking Results
 Our evaluation of production MT systems reveals interesting insights:
 | MT System | Kinyarwanda→English | English→Kinyarwanda | Overall |
@@ -213,75 +244,25 @@ Our evaluation of production MT systems reveals interesting insights:
 - All systems perform better on Kinyarwanda→English than English→Kinyarwanda
 - Score differences are subtle but statistically meaningful with KinyCOMET's precision
-## Evaluation & Metrics
-The model is evaluated using standard quality estimation metrics:
-- **Pearson Correlation**: Measures linear correlation with human judgments
-- **Spearman Correlation**: Measures monotonic correlation with human rankings
-- **System Score**: Overall translation system quality assessment
-- **MAE/RMSE**: Mean absolute error and root mean square error
-## Dataset Access
-Our human-annotated Kinyarwanda-English quality estimation dataset is publicly available:
-```python
-from huggingface_hub import hf_hub_download
-import pandas as pd
-# Download dataset files
-train_file = hf_hub_download(repo_id="chrismazii/kinycomet_unbabel", filename="train.csv")
-val_file = hf_hub_download(repo_id="chrismazii/kinycomet_unbabel", filename="valid.csv")
-test_file = hf_hub_download(repo_id="chrismazii/kinycomet_unbabel", filename="test.csv")
-# Load the datasets
-train_df = pd.read_csv(train_file)
-val_df = pd.read_csv(val_file)
-test_df = pd.read_csv(test_file)
-print(f"Training samples: {len(train_df)}")
-print(f"Validation samples: {len(val_df)}")
-print(f"Test samples: {len(test_df)}")
-# Convert to list of dictionaries for COMET usage
-train_samples = train_df.to_dict('records')
-# Example sample structure
-print(train_samples[0])
-# {
-#   'src': 'Umugabo ararya',
-#   'mt': 'The man is eating',
-#   'ref': 'The man is eating',
-#   'score': 0.89,
-#   'direction': 'kin2eng'
-# }
-```
-**Dataset Characteristics**:
-- **Total samples**: 4,323 (train: 3,497, val: 404, test: 422)
-- **Directions**: Bidirectional rw↔en
-- **Annotation**: Human Direct Assessment scores [0-100] normalized to [0-1]
-- **Quality**: Multi-annotator agreement, high-variance samples removed
-- **Coverage**: Multiple MT systems and domains (education, tourism)
 ## Real-World Impact & Applications
 ### Addressing Rwanda's NLP Ecosystem Needs
 KinyCOMET directly addresses pain points identified by the Rwandan MT community:
 **Before KinyCOMET:**
--  BLEU scores poorly correlate with human judgment for Kinyarwanda
--  Expensive, time-consuming human evaluation required
--  No reliable automatic metrics for morphologically rich Kinyarwanda
 **With KinyCOMET:**
--  **2.5x better correlation** with human judgments than BLEU
--  **Instant evaluation** for production MT systems
--  **Cost-effective** alternative to human annotation
--  **Specialized for Kinyarwanda** morphological complexity
 ### Production Use Cases
 **For MT Companies** (Digital Umuganda, KINLP, Awesomity, Artemis AI):
 - Real-time translation quality monitoring
 - A/B testing of model improvements
@@ -299,10 +280,15 @@ KinyCOMET directly addresses pain points identified by the Rwandan MT community:
 ## Limitations & Considerations
-- **Domain Specificity**: Trained on specific text domains; may not generalize to all content types
 - **Language Variants**: Optimized for standard Kinyarwanda; dialectal variations may affect performance
 - **Resource Requirements**: Requires COMET library and substantial computational resources
 - **Score Interpretation**: Scores are relative to training data distribution
 ## Citation & Research
@@ -311,7 +297,7 @@ If you use KinyCOMET in your research, please cite:
 ```bibtex
 @misc{kinycomet2025,
     title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
-    author={[Prince Chris Mazimpaka] and [Jan Nehring]},
     year={2025},
     publisher={Hugging Face},
     howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
@@ -329,14 +315,18 @@ KinyCOMET contributes to the growing ecosystem of African language NLP tools. We
 ## License
-This model is released under the Apache 2.0 License.
 ## Acknowledgments
-- **COMET Framework**: Built on the excellent COMET quality estimation framework
 - **Base Models**: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
 - **African NLP Community**: Inspired by ongoing efforts to advance African language technologies
-- **Contributors**: Thanks to all researchers and annotators who made this work possible
 ---

 | BLEU | N/A | 0.30 | 0.34 | 0.23 | 0.62 |
 | chrF | N/A | 0.38 | 0.30 | 0.21 | 0.34 |
+**State-of-the-Art Results**: Both KinyCOMET variants significantly outperform existing baselines, with KinyCOMET-Unbabel achieving the highest correlation across all metrics.
 ## Performance Highlights
 - Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
 - Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants
 ## Installation
+Make sure you have Python ≥ 3.8 and install COMET via pip:
 ```bash
+pip install unbabel-comet
 ```
+You can verify the CLI tool is installed:
+```bash
+which comet-score
+# should print something like: /usr/local/bin/comet-score
+```
+For more details on COMET, see the [official documentation](https://unbabel.github.io/COMET/html/index.html).
+## Usage
+###  Load and Use the Model in Python
+Here's a simple example to score translations directly in Python:
+```python
 from comet import load_from_checkpoint
+# Load the public KinyCOMET model
+model = load_from_checkpoint("chrismazii/kinycomet_unbabel")
+# Example translations
+samples = [
+    {
+        "src": "Umugabo ararya.",
+        "mt": "The man is eating.",
+        "ref": "The man is eating."
+    },
+    {
+        "src": "Umwana arasinzira.",
+        "mt": "A dog sleeps.",
+        "ref": "The child is sleeping."
+    }
+]
+# Predict scores
+pred = model.predict(samples, gpus=0)
+print(pred)
 ```
+**Output Example:**
 ```python
+Prediction({
+  'scores': [0.9899, 0.8813],
+  'system_score': 0.9356
 })
 ```
+### Using the Command Line Interface (CLI)
+You can also evaluate translations directly using the terminal.
+**Step 1: Create the text files**
+```bash
+cat > source.txt <<'SRC'
+Umugabo ararya.
+Umwana arasinzira.
+Uyu mwanya neza cyane.
+SRC
+cat > reference.txt <<'REF'
+The man is eating.
+The child is sleeping.
+This place is very nice.
+REF
+cat > hypothesis.txt <<'HYP'
+The man is eating.
+A dog sleeps.
+This place is very nice.
+HYP
+```
+**Step 2: Run KinyCOMET**
+```bash
+comet-score -s source.txt -r reference.txt -t hypothesis.txt \
+  --model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json
+```
+**Step 3: View the results**
+```bash
+cat results.json
+```
+**Example Output:**
+```json
+{
+  "system_score": 0.9547,
+  "segments": [
+    {"src":"Umugabo ararya.","mt":"The man is eating.","ref":"The man is eating.","score":0.9899},
+    {"src":"Umwana arasinzira.","mt":"A dog sleeps.","ref":"The child is sleeping.","score":0.8813},
+    {"src":"Uyu mwanya neza cyane.","mt":"This place is very nice.","ref":"This place is very nice.","score":0.9927}
+  ]
+}
+```
+### Score Interpretation
+- **Scores range from 0 to 1**: Higher scores indicate better translation quality
+- **System score**: Average quality across all translations
+- **Segment scores**: Individual quality scores for each translation pair
+- **Threshold guidance**: Scores above 0.8 typically indicate high-quality translations
 ## Training Details
+### Model Architecture
+- **Base Models**: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
+- **Framework**: COMET quality estimation framework
+- **Training Data**: 4,323 human-annotated Kinyarwanda-English translation pairs
 ### Training Configuration
 - **Methodology**: COMET framework with Direct Assessment supervision
 - **Evaluation Metrics**: Kendall's τ and Spearman ρ correlation with human DA scores
+- **Data Split**: 80% train (3,497) / 10% validation (404) / 10% test (422)
 ### MT System Benchmarking Results
 Our evaluation of production MT systems reveals interesting insights:
 | MT System | Kinyarwanda→English | English→Kinyarwanda | Overall |
 - All systems perform better on Kinyarwanda→English than English→Kinyarwanda
 - Score differences are subtle but statistically meaningful with KinyCOMET's precision
 ## Real-World Impact & Applications
 ### Addressing Rwanda's NLP Ecosystem Needs
 KinyCOMET directly addresses pain points identified by the Rwandan MT community:
 **Before KinyCOMET:**
+- BLEU scores poorly correlate with human judgment for Kinyarwanda
+- Expensive, time-consuming human evaluation required
+- No reliable automatic metrics for morphologically rich Kinyarwanda
 **With KinyCOMET:**
+- **2.5x better correlation** with human judgments than BLEU
+- **Instant evaluation** for production MT systems
+- **Cost-effective** alternative to human annotation
+- **Specialized for Kinyarwanda** morphological complexity
 ### Production Use Cases
 **For MT Companies** (Digital Umuganda, KINLP, Awesomity, Artemis AI):
 - Real-time translation quality monitoring
 - A/B testing of model improvements
 ## Limitations & Considerations
+- **Domain Specificity**: Trained on education and tourism domains; may not generalize to all content types
 - **Language Variants**: Optimized for standard Kinyarwanda; dialectal variations may affect performance
 - **Resource Requirements**: Requires COMET library and substantial computational resources
 - **Score Interpretation**: Scores are relative to training data distribution
+- **Reference Dependency**: Best performance achieved with reference translations
+## Dataset Access
+The training dataset is available separately. See the [KinyCOMET Dataset Card](https://huggingface.co/datasets/chrismazii/kinycomet_dataset) for details on accessing the human-annotated quality estimation data.
 ## Citation & Research
 ```bibtex
 @misc{kinycomet2025,
     title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
+    author={Prince Chris Mazimpaka and Jan Nehring},
     year={2025},
     publisher={Hugging Face},
     howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
 ## License
+This model is released under the Apache 2.0 License.
 ## Acknowledgments
+- **COMET Framework**: Built on the excellent [COMET quality estimation framework](https://unbabel.github.io/COMET/html/index.html)
 - **Base Models**: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
 - **African NLP Community**: Inspired by ongoing efforts to advance African language technologies
+- **Contributors**: Thanks to the 15 linguistics students and all researchers who made this work possible
 ---
+**Resources:**
+- [COMET Documentation](https://unbabel.github.io/COMET/html/index.html)
+- [Dataset Card](https://huggingface.co/datasets/chrismazii/kinycomet_dataset)
+- [Model Files](https://huggingface.co/chrismazii/kinycomet_unbabel/tree/main)