wasanx
/

ComeTH

+---
+license: other
+license_name: cometh-reserved
+datasets:
+  - wasanx/cometh_human_annot
+language:
+  - en
+  - th
+metrics:
+  - spearman correlation
+tags:
+  - translation-evaluation
+  - thai
+  - english
+  - translation-metrics
+  - mqm
+  - claude-augmented
+  - comet
+  - translation-quality
+base_model: Unbabel/wmt22-cometkiwi-da
+pipeline_tag: translation
+library_name: unbabel-comet
+model-index:
+  - name: ComETH-Augmented
+    results:
+      - task:
+          type: translation-quality-estimation
+          name: Thai-English Translation Quality Assessment
+        dataset:
+          type: wasanx/cometh_human_annot
+          name: COMETH Human Annotations
+        metrics:
+          - name: Spearman correlation
+            type: spearman
+            value: 0.4795
+            verified: false
+      - task:
+          type: translation-quality-estimation
+          name: Thai-English Translation Quality Comparison
+        dataset:
+          type: wasanx/cometh_human_annot
+          name: COMETH Baseline Comparison
+        metrics:
+          - name: COMET baseline
+            type: spearman
+            value: 0.4570
+            verified: false
+          - name: ComETH (human-only)
+            type: spearman
+            value: 0.4639
+            verified: false
+---
+# ComeTH: Thai-English Translation Quality Metrics
+ComETH is a fine-tuned version of the COMET (Crosslingual Optimized Metric for Evaluation of Translation) model specifically optimized for Thai-English translation quality assessment. This model evaluates machine translation outputs by providing quality scores that correlate highly with human judgments.
+## Model Overview
+- **Model Type**: Translation Quality Estimation
+- **Languages**: Thai-English
+- **Base Model**: COMET (Unbabel/wmt22-cometkiwi-da)
+- **Encoder**: XLM-RoBERTa-based (microsoft/infoxlm-large)
+- **Architecture**: Unified Metric with sentence-level scoring
+- **Framework**: COMET (Unbabel)
+- **Task**: Machine Translation Evaluation
+- **Parameters**: 565M (558M encoder + 6.3M estimator)
+## Versions
+We offer two variants of ComETH with different training approaches:
+- **ComETH**: Fine-tuned on human MQM annotations (Spearman's ρ = 0.4639)
+- **ComETH-Augmented**: Fine-tuned on human + Claude-assisted annotations (Spearman's ρ = 0.4795)
+Both models outperform the base COMET model (Spearman's ρ = 0.4570) on Thai-English translation evaluation. The Claude-augmented version leverages LLM-generated annotations to enhance correlation with human judgments by 4.9% over the baseline.
+## Technical Specifications
+- **Training Framework**: PyTorch Lightning
+- **Loss Function**: MSE
+- **Input Segments**: [mt, src]
+- **Final Layer Architecture**: [3072, 1024]
+- **Layer Transformation**: Sparsemax
+- **Activation Function**: Tanh
+- **Dropout**: 0.1
+- **Learning Rate**: 1.5e-05 (Encoder: 1e-06)
+- **Layerwise Decay**: 0.95
+- **Word Layer**: 24
+## Training Data
+The models were trained on:
+- **Size**: 23,530 English-Thai translation pairs
+- **Source Domains**: Diverse, including technical, conversational, and e-commerce
+- **Annotation Framework**: Multidimensional Quality Metrics (MQM)
+- **Error Categories**:
+  - Minor: Issues that don't significantly impact meaning or usability
+  - Major: Errors that significantly impact meaning but don't render content unusable
+  - Critical: Errors that make content unusable or could have serious consequences
+- **Claude Augmentation**: Claude 3.5 Sonnet was used to generate supplementary quality judgments, enhancing the model's alignment with human evaluations
+## Training Process
+ComETH was trained using a multi-step process:
+1. Starting from the wmt22-cometkiwi-da checkpoint
+2. Fine-tuning on human MQM annotations for 5 epochs
+3. Using gradient accumulation (8 steps) to simulate larger batch sizes
+4. Utilizing unified metric architecture that combines source and MT embeddings
+5. For the augmented variant: additional training with Claude-assisted annotations, weighted to balance human and machine judgments
+## Performance
+### Correlation with Human Judgments (Spearman's ρ)
+| Model | Spearman's ρ | RMSE |
+|-------|-------------|------|
+| COMET (baseline) | 0.4570 | 0.3185 |
+| ComETH (human annotations) | 0.4639 | 0.3093 |
+| ComETH-Augmented (human + Claude) | **0.4795** | **0.3078** |
+The Claude-augmented version demonstrates the highest correlation with human judgments, offering a significant improvement over both the baseline and human-only models.
+### Comparison with Other LLM Evaluators
+| Model | Spearman's ρ |
+|-------|-------------|
+| ComETH-Augmented | **0.4795** |
+| Claude 3.5 Sonnet | 0.4383 |
+| GPT-4o Mini | 0.4352 |
+| Gemini 2.0 Flash | 0.3918 |
+ComETH-Augmented outperforms direct evaluations from state-of-the-art LLMs, while being more computationally efficient for large-scale translation quality assessments.
+## Advanced Usage Examples
+### Basic Evaluation
+```python
+from comet import download_model, load_from_checkpoint
+# Load the model
+model = load_from_checkpoint("cometh-team/ComETH-Augmented")
+# Prepare input data
+translations = [
+    {
+        "src": "This is an English source text.",
+        "mt": "นี่คือข้อความภาษาอังกฤษ", # Machine translation to evaluate
+    }
+]
+# Get quality scores
+results = model.predict(translations, batch_size=8, gpus=1)
+scores = results['scores']
+```
+### Batch Processing With Progress Tracking
+```python
+import pandas as pd
+from tqdm import tqdm
+# Load translations from CSV
+df = pd.read_csv("translations.csv")
+input_data = df[['src', 'mt']].to_dict('records')
+# Process in batches
+batch_size = 32
+all_scores = []
+for i in tqdm(range(0, len(input_data), batch_size)):
+    batch = input_data[i:i+batch_size]
+    results = model.predict(batch, batch_size=len(batch), gpus=1)
+    all_scores.extend(results['scores'])
+# Add scores back to dataframe
+df['quality_score'] = all_scores
+```
+### System-Level Evaluation
+```python
+import numpy as np
+# Group by system and compute average scores
+systems = df.groupby('system_name')['quality_score'].agg(['mean', 'std', 'count']).reset_index()
+# Rank systems by average quality
+systems = systems.sort_values('mean', ascending=False)
+print(systems)
+```
+## License
+```
+The COMETH Reserved License
+Cometh English-to-Thai Translation Data and Model License
+Copyright (C) Cometh Team. All rights reserved.
+This license governs the use of the Cometh English-to-Thai translation data and model ("Cometh Model Data"), including but not limited to MQM scores, human translations, and human rankings from various translation sources.
+Permitted Use
+The Cometh Model Data is licensed exclusively for internal use by the designated Cometh team.
+Prohibited Use
+The following uses are strictly prohibited:
+1. Any usage outside the designated purposes unanimously approved by the Cometh team.
+2. Redistribution, sharing, or distribution of the Cometh Model Data in any form.
+3. Citation or public reference to the Cometh Model Data in any academic, commercial, or non-commercial context.
+4. Any use beyond the internal operations of the Cometh team.
+Legal Enforcement
+Unauthorized use, distribution, or citation of the Cometh Model Data constitutes a violation of this license and may result in legal action, including but not limited to prosecution under applicable laws.
+Reservation of Rights
+All rights to the Cometh Model Data are reserved by the Cometh team. This license does not transfer any ownership rights.
+By accessing or using the Cometh Model Data, you agree to be bound by the terms of this license.
+```
+## Citation
+```
+@misc{
+  title     = {COMETH: Thai-English Translation Quality Metrics},
+  author    = {COMETH Team},
+  year      = {2025},
+  howpublished = {Hugging Face Model Repository},
+  url       = {https://huggingface.co/wasanx/ComeTH}
+}
+```
+## Contact
+For questions or feedback: comethteam@gmail.com