Darsala's picture
Update README.md
382e9a8 verified
---
license: mit
datasets:
- Darsala/english_georgian_corpora
language:
- ka
- en
metrics:
- comet
- bleu
- chrf
pipeline_tag: translation
tags:
- translation
- Georgian
- NMT
- MT
- encoder-decoder
model-index:
- name: Georgian-Translation
results:
- task:
type: translation
name: Machine Translation
dataset:
name: FLORES Test Set
type: flores
metrics:
- type: comet
value: 0.79
name: COMET Score
base_model: bert-base-uncased
---
# Georgian Translation Model
## Model Description
This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.
## Architecture
- **Model Type**: Encoder-Decoder
- **Encoder**: Pretrained BERT model
- **Decoder**: Randomly initialized with custom configuration
- **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test`
- **Parameters**: 266M total parameters
## Training Details
- **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora))
- **Training Duration**: 16 epochs
- **Hardware**: Nvidia A100 80GB
- **Batch Size**: 128 with 2 gradient accumulation steps
- **Scheduler**: Cosine learning rate scheduler
- **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline
## Performance
- **COMET Score**: 0.79 (on FLORES test set)
- **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset
- **Translation Style**: More literary and natural Georgian compared to Google Translate
## Usage
**Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it.
```python
import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download
# Download the repo to a local folder
path_to_downloaded = snapshot_download(
repo_id="Darsala/Georgian-Translation",
local_dir="./Georgian-Translation",
local_dir_use_symlinks=False
)
# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer
# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def translate(
text: str,
num_beams: int = 5,
max_length: int = 256,
) -> str:
"""
Translate a single string with the given EncoderDecoderModel.
"""
text = text.lower()
text = re.sub(r'\s+', ' ', text)
# tokenize & move to device
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding="longest"
).to(device)
# generation
generated_ids = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
num_beams=num_beams,
max_length=max_length,
early_stopping=True,
)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"English: {text}")
print(f"Translated: {output}")
return output
# Example usage
translation = translate("Hello, how are you?")
```
**Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository.
## Strengths and Limitations
### Strengths
- Produces more literary and natural Georgian translations
- Good performance on general text translation
- Specialized for Georgian language characteristics
### Limitations
- Struggles with proper names and company names
- Issues with terms requiring direct English text copying
- Limited by tokenizer coverage for certain English terms
## Demo
Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation)
## Citation
```bibtex
@mastersthesis{darsalia2025georgian,
title={English Translation Quality Assessment and Computer Translation},
author={Luka Darsalia},
year={2025},
school={Tbilisi University},
note={Bachelor's Thesis - Computer Science}
}
```
## Related Resources
- **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)
- **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet)
- **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation)