README.md · wasanx/ComeTH at main

ComeTH / README.md

gri11

Update README.md

8ea542a verified 7 months ago

preview code

raw

history blame contribute delete

7.73 kB

	---
	license: other
	license_name: cometh-reserved
	datasets:
	- wasanx/cometh_claude_augment
	- wasanx/cometh_finetune
	language:
	- en
	- th
	metrics:
	- spearman correlation
	tags:
	- translation-evaluation
	- thai
	- english
	- translation-metrics
	- mqm
	- claude-augmented
	- comet
	- translation-quality
	base_model: Unbabel/wmt22-cometkiwi-da
	pipeline_tag: translation
	library_name: unbabel-comet
	model-index:
	- name: ComeTH
	results:
	- task:
	type: translation-quality-estimation
	name: English-Thai Translation Quality Assessment
	dataset:
	type: wasanx/cometh_claude_augment
	name: COMETH Claude Augmentation Datasets
	metrics:
	- name: Spearman correlation
	type: spearman
	value: 0.4795
	verified: false
	- task:
	type: translation-quality-estimation
	name: English-Thai Translation Quality Comparison
	dataset:
	type: wasanx/cometh_human_annot
	name: COMETH Baseline Comparison
	metrics:
	- name: COMET baseline
	type: spearman
	value: 0.4570
	verified: false
	- name: ComETH (human-only)
	type: spearman
	value: 0.4639
	verified: false
	---
	# ComeTH (คำไทย): English-Thai Translation Quality Metrics

	ComeTH is a fine-tuned version of the COMET (Crosslingual Optimized Metric for Evaluation of Translation) model specifically optimized for English-Thai translation quality assessment. This model evaluates machine translation outputs by providing quality scores that correlate highly with human judgments.

	## Model Overview

	- Model Type: Translation Quality Estimation
	- Languages: English-Thai
	- Base Model: COMET (Unbabel/wmt22-cometkiwi-da)
	- Encoder: XLM-RoBERTa-based (microsoft/infoxlm-large)
	- Architecture: Unified Metric with sentence-level scoring
	- Framework: COMET (Unbabel)
	- Task: Machine Translation Evaluation
	- Parameters: 565M (558M encoder + 6.3M estimator)

	## Versions

	We offer two variants of ComeTH with different training approaches:

	- ComeTH: Fine-tuned on human MQM annotations (Spearman's ρ = 0.4639)
	- ComeTH-Augmented: Fine-tuned on human + Claude-assisted annotations (Spearman's ρ = 0.4795)

	Both models outperform the base COMET model (Spearman's ρ = 0.4570) on English-Thai translation evaluation. The Claude-augmented version leverages LLM-generated annotations to enhance correlation with human judgments by 4.9% over the baseline.

	## Technical Specifications

	- Training Framework: PyTorch Lightning
	- Loss Function: MSE
	- Input Segments: [mt, src]
	- Final Layer Architecture: [3072, 1024]
	- Layer Transformation: Sparsemax
	- Activation Function: Tanh
	- Dropout: 0.1
	- Learning Rate: 1.5e-05 (Encoder: 1e-06)
	- Layerwise Decay: 0.95
	- Word Layer: 24

	## Training Data

	The models were trained on:
	- Size: 23,530 English-Thai translation pairs
	- Source Domains: Diverse, including technical, conversational, and e-commerce
	- Annotation Framework: Multidimensional Quality Metrics (MQM)
	- Error Categories:
	- Minor: Issues that don't significantly impact meaning or usability
	- Major: Errors that significantly impact meaning but don't render content unusable
	- Critical: Errors that make content unusable or could have serious consequences
	- Claude Augmentation: Claude 3.5 Sonnet was used to generate supplementary quality judgments, enhancing the model's alignment with human evaluations

	## Training Process

	ComeTH was trained using a multi-step process:
	1. Starting from the wmt22-cometkiwi-da checkpoint
	2. Fine-tuning on human MQM annotations for 5 epochs
	3. Using gradient accumulation (8 steps) to simulate larger batch sizes
	4. Utilizing unified metric architecture that combines source and MT embeddings
	5. For the augmented variant: additional training with Claude-assisted annotations, weighted to balance human and machine judgments

	## Performance

	### Correlation with Human Judgments (Spearman's ρ)

	\| Model \| Spearman's ρ \| RMSE \|
	\|-------\|-------------\|------\|
	\| COMET (baseline) \| 0.4570 \| 0.3185 \|
	\| ComeTH (human annotations) \| 0.4639 \| 0.3093 \|
	\| ComeTH-Augmented (human + Claude) \| 0.4795 \| 0.3078 \|

	The Claude-augmented version demonstrates the highest correlation with human judgments, offering a significant improvement over both the baseline and human-only models.

	### Comparison with Other LLM Evaluators

	\| Model \| Spearman's ρ \|
	\|-------\|-------------\|
	\| ComeTH-Augmented \| 0.4795 \|
	\| Claude 3.5 Sonnet \| 0.4383 \|
	\| GPT-4o Mini \| 0.4352 \|
	\| Gemini 2.0 Flash \| 0.3918 \|

	ComeTH-Augmented outperforms direct evaluations from state-of-the-art LLMs, while being more computationally efficient for large-scale translation quality assessments.

	## Advanced Usage Examples

	### Basic Evaluation

	```python
	from comet import download_model, load_from_checkpoint
	model_path = download_model("wasanx/ComeTH")
	model = load_from_checkpoint(model_path)

	translations = [
	{
	"src": "This is an English source text.",
	"mt": "นี่คือข้อความภาษาอังกฤษ",
	}
	]
	results = model.predict(translations, batch_size=8, gpus=1)
	scores = results['scores']
	```

	### Batch Processing With Progress Tracking

	```python
	import pandas as pd
	from tqdm import tqdm

	df = pd.read_csv("translations.csv")
	input_data = df[['src', 'mt']].to_dict('records')

	batch_size = 32
	all_scores = []

	for i in tqdm(range(0, len(input_data), batch_size)):
	batch = input_data[i:i+batch_size]
	results = model.predict(batch, batch_size=len(batch), gpus=1)
	all_scores.extend(results['scores'])

	df['quality_score'] = all_scores
	```

	### System-Level Evaluation

	```python
	import numpy as np

	systems = df.groupby('system_name')['quality_score'].agg(['mean', 'std', 'count']).reset_index()
	systems = systems.sort_values('mean', ascending=False)
	print(systems)
	```

	## Citation

	```
	@misc{
	title = {COMETH: English-Thai Translation Quality Metrics},
	author = {COMETH Team},
	year = {2025},
	howpublished = {Hugging Face Model Repository},
	url = {https://huggingface.co/wasanx/ComeTH}
	}
	```

	## Contact

	For questions or feedback: comethteam@gmail.com

	## License

	```
	The COMETH Reserved License

	Cometh English-to-Thai Translation Data and Model License

	Copyright (C) Cometh Team. All rights reserved.

	This license governs the use of the Cometh English-to-Thai translation data and model ("Cometh Model Data"), including but not limited to MQM scores, human translations, and human rankings from various translation sources.

	Permitted Use
	The Cometh Model Data is licensed exclusively for internal use by the designated Cometh team.

	Prohibited Use
	The following uses are strictly prohibited:
	1. Any usage outside the designated purposes unanimously approved by the Cometh team.
	2. Redistribution, sharing, or distribution of the Cometh Model Data in any form.
	3. Citation or public reference to the Cometh Model Data in any academic, commercial, or non-commercial context.
	4. Any use beyond the internal operations of the Cometh team.

	Legal Enforcement
	Unauthorized use, distribution, or citation of the Cometh Model Data constitutes a violation of this license and may result in legal action, including but not limited to prosecution under applicable laws.

	Reservation of Rights
	All rights to the Cometh Model Data are reserved by the Cometh team. This license does not transfer any ownership rights.

	By accessing or using the Cometh Model Data, you agree to be bound by the terms of this license.
	```