--- license: mit datasets: - Darsala/english_georgian_corpora language: - ka - en metrics: - comet - bleu - chrf pipeline_tag: translation tags: - translation - Georgian - NMT - MT - encoder-decoder model-index: - name: Georgian-Translation results: - task: type: translation name: Machine Translation dataset: name: FLORES Test Set type: flores metrics: - type: comet value: 0.79 name: COMET Score base_model: bert-base-uncased --- # Georgian Translation Model ## Model Description This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder. ## Architecture - **Model Type**: Encoder-Decoder - **Encoder**: Pretrained BERT model - **Decoder**: Randomly initialized with custom configuration - **Decoder Tokenizer**: `RichNachos/georgian-corpus-tokenizer-test` - **Parameters**: 266M total parameters ## Training Details - **Training Data**: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)) - **Training Duration**: 16 epochs - **Hardware**: Nvidia A100 80GB - **Batch Size**: 128 with 2 gradient accumulation steps - **Scheduler**: Cosine learning rate scheduler - **Training Pipeline**: Complete data cleaning, preprocessing, and augmentation pipeline ## Performance - **COMET Score**: 0.79 (on FLORES test set) - **Comparison**: Google Translate (0.83), Kona (0.84) on same dataset - **Translation Style**: More literary and natural Georgian compared to Google Translate ## Usage **Important**: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it. ```python import sys from transformers import EncoderDecoderModel import torch import re from huggingface_hub import snapshot_download # Download the repo to a local folder path_to_downloaded = snapshot_download( repo_id="Darsala/Georgian-Translation", local_dir="./Georgian-Translation", local_dir_use_symlinks=False ) # Add the downloaded folder to Python path so we can import the custom tokenizer sys.path.append(path_to_downloaded) from encoder_decoder_tokenizer import EncoderDecoderTokenizer # Load the model and tokenizer from the downloaded folder model = EncoderDecoderModel.from_pretrained(path_to_downloaded) tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) def translate( text: str, num_beams: int = 5, max_length: int = 256, ) -> str: """ Translate a single string with the given EncoderDecoderModel. """ text = text.lower() text = re.sub(r'\s+', ' ', text) # tokenize & move to device inputs = tokenizer( text, return_tensors="pt", truncation=True, padding="longest" ).to(device) # generation generated_ids = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, num_beams=num_beams, max_length=max_length, early_stopping=True, ) output = tokenizer.decode(generated_ids[0], skip_special_tokens=True) print(f"English: {text}") print(f"Translated: {output}") return output # Example usage translation = translate("Hello, how are you?") ``` **Note**: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository. ## Strengths and Limitations ### Strengths - Produces more literary and natural Georgian translations - Good performance on general text translation - Specialized for Georgian language characteristics ### Limitations - Struggles with proper names and company names - Issues with terms requiring direct English text copying - Limited by tokenizer coverage for certain English terms ## Demo Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation) ## Citation ```bibtex @mastersthesis{darsalia2025georgian, title={English Translation Quality Assessment and Computer Translation}, author={Luka Darsalia}, year={2025}, school={Tbilisi University}, note={Bachelor's Thesis - Computer Science} } ``` ## Related Resources - **Training Data**: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora) - **Georgian COMET Model**: [georgian_comet](https://huggingface.co/Darsala/georgian_comet) - **Evaluation Data**: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation)