|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- Darsala/english_georgian_corpora |
|
|
language: |
|
|
- ka |
|
|
- en |
|
|
metrics: |
|
|
- comet |
|
|
- bleu |
|
|
- chrf |
|
|
pipeline_tag: translation |
|
|
tags: |
|
|
- translation |
|
|
- Georgian |
|
|
- NMT |
|
|
- MT |
|
|
- encoder-decoder |
|
|
model-index: |
|
|
- name: Georgian-Translation |
|
|
results: |
|
|
- task: |
|
|
type: translation |
|
|
name: Machine Translation |
|
|
dataset: |
|
|
name: FLORES Test Set |
|
|
type: flores |
|
|
metrics: |
|
|
- type: comet |
|
|
value: 0.79 |
|
|
name: COMET Score |
|
|
base_model: bert-base-uncased |
|
|
--- |
|
|
|
|
|
# Georgian Translation Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Model Type**: Encoder-Decoder |
|
|
- **Encoder**: Pretrained BERT model |
|
|
- **Decoder**: Randomly initialized with custom configuration |
|
|
- **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test` |
|
|
- **Parameters**: 266M total parameters |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)) |
|
|
- **Training Duration**: 16 epochs |
|
|
- **Hardware**: Nvidia A100 80GB |
|
|
- **Batch Size**: 128 with 2 gradient accumulation steps |
|
|
- **Scheduler**: Cosine learning rate scheduler |
|
|
- **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **COMET Score**: 0.79 (on FLORES test set) |
|
|
- **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset |
|
|
- **Translation Style**: More literary and natural Georgian compared to Google Translate |
|
|
|
|
|
## Usage |
|
|
|
|
|
**Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it. |
|
|
|
|
|
```python |
|
|
import sys |
|
|
from transformers import EncoderDecoderModel |
|
|
import torch |
|
|
import re |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
# Download the repo to a local folder |
|
|
path_to_downloaded = snapshot_download( |
|
|
repo_id="Darsala/Georgian-Translation", |
|
|
local_dir="./Georgian-Translation", |
|
|
local_dir_use_symlinks=False |
|
|
) |
|
|
|
|
|
# Add the downloaded folder to Python path so we can import the custom tokenizer |
|
|
sys.path.append(path_to_downloaded) |
|
|
from encoder_decoder_tokenizer import EncoderDecoderTokenizer |
|
|
|
|
|
# Load the model and tokenizer from the downloaded folder |
|
|
model = EncoderDecoderModel.from_pretrained(path_to_downloaded) |
|
|
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded) |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
|
|
|
def translate( |
|
|
text: str, |
|
|
num_beams: int = 5, |
|
|
max_length: int = 256, |
|
|
) -> str: |
|
|
""" |
|
|
Translate a single string with the given EncoderDecoderModel. |
|
|
""" |
|
|
text = text.lower() |
|
|
text = re.sub(r'\s+', ' ', text) |
|
|
|
|
|
# tokenize & move to device |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
padding="longest" |
|
|
).to(device) |
|
|
|
|
|
# generation |
|
|
generated_ids = model.generate( |
|
|
input_ids=inputs.input_ids, |
|
|
attention_mask=inputs.attention_mask, |
|
|
num_beams=num_beams, |
|
|
max_length=max_length, |
|
|
early_stopping=True, |
|
|
) |
|
|
|
|
|
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True) |
|
|
print(f"English: {text}") |
|
|
print(f"Translated: {output}") |
|
|
|
|
|
return output |
|
|
|
|
|
# Example usage |
|
|
translation = translate("Hello, how are you?") |
|
|
``` |
|
|
|
|
|
**Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository. |
|
|
|
|
|
## Strengths and Limitations |
|
|
|
|
|
### Strengths |
|
|
- Produces more literary and natural Georgian translations |
|
|
- Good performance on general text translation |
|
|
- Specialized for Georgian language characteristics |
|
|
|
|
|
### Limitations |
|
|
- Struggles with proper names and company names |
|
|
- Issues with terms requiring direct English text copying |
|
|
- Limited by tokenizer coverage for certain English terms |
|
|
|
|
|
## Demo |
|
|
|
|
|
Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@mastersthesis{darsalia2025georgian, |
|
|
title={English Translation Quality Assessment and Computer Translation}, |
|
|
author={Luka Darsalia}, |
|
|
year={2025}, |
|
|
school={Tbilisi University}, |
|
|
note={Bachelor's Thesis - Computer Science} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Resources |
|
|
|
|
|
- **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora) |
|
|
- **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet) |
|
|
- **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation) |