Update README.md

382e9a8 verified 5 months ago

4.76 kB

	---
	license: mit
	datasets:
	- Darsala/english_georgian_corpora
	language:
	- ka
	- en
	metrics:
	- comet
	- bleu
	- chrf
	pipeline_tag: translation
	tags:
	- translation
	- Georgian
	- NMT
	- MT
	- encoder-decoder
	model-index:
	- name: Georgian-Translation
	results:
	- task:
	type: translation
	name: Machine Translation
	dataset:
	name: FLORES Test Set
	type: flores
	metrics:
	- type: comet
	value: 0.79
	name: COMET Score
	base_model: bert-base-uncased
	---

	# Georgian Translation Model

	## Model Description

	This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.

	## Architecture

	- Model Type: Encoder-Decoder
	- Encoder: Pretrained BERT model
	- Decoder: Randomly initialized with custom configuration
	- Decoder Tokenizer: `RichNachos/georgian-corpus-tokenizer-test`
	- Parameters: 266M total parameters

	## Training Details

	- Training Data: English-Georgian parallel corpus (see [Darsala/english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora))
	- Training Duration: 16 epochs
	- Hardware: Nvidia A100 80GB
	- Batch Size: 128 with 2 gradient accumulation steps
	- Scheduler: Cosine learning rate scheduler
	- Training Pipeline: Complete data cleaning, preprocessing, and augmentation pipeline

	## Performance

	- COMET Score: 0.79 (on FLORES test set)
	- Comparison: Google Translate (0.83), Kona (0.84) on same dataset
	- Translation Style: More literary and natural Georgian compared to Google Translate

	## Usage

	Important: This model uses a custom `EncoderDecoderTokenizer` that is included in the repository. You need to download the repo locally to access it.

	```python
	import sys
	from transformers import EncoderDecoderModel
	import torch
	import re
	from huggingface_hub import snapshot_download

	# Download the repo to a local folder
	path_to_downloaded = snapshot_download(
	repo_id="Darsala/Georgian-Translation",
	local_dir="./Georgian-Translation",
	local_dir_use_symlinks=False
	)

	# Add the downloaded folder to Python path so we can import the custom tokenizer
	sys.path.append(path_to_downloaded)
	from encoder_decoder_tokenizer import EncoderDecoderTokenizer

	# Load the model and tokenizer from the downloaded folder
	model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
	tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	def translate(
	text: str,
	num_beams: int = 5,
	max_length: int = 256,
	) -> str:
	"""
	Translate a single string with the given EncoderDecoderModel.
	"""
	text = text.lower()
	text = re.sub(r'\s+', ' ', text)

	# tokenize & move to device
	inputs = tokenizer(
	text,
	return_tensors="pt",
	truncation=True,
	padding="longest"
	).to(device)

	# generation
	generated_ids = model.generate(
	input_ids=inputs.input_ids,
	attention_mask=inputs.attention_mask,
	num_beams=num_beams,
	max_length=max_length,
	early_stopping=True,
	)

	output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
	print(f"English: {text}")
	print(f"Translated: {output}")

	return output

	# Example usage
	translation = translate("Hello, how are you?")
	```

	Note: The model uses a custom `EncoderDecoderTokenizer` that is included in the repository.

	## Strengths and Limitations

	### Strengths
	- Produces more literary and natural Georgian translations
	- Good performance on general text translation
	- Specialized for Georgian language characteristics

	### Limitations
	- Struggles with proper names and company names
	- Issues with terms requiring direct English text copying
	- Limited by tokenizer coverage for certain English terms

	## Demo

	Try the model in the interactive demo: [Georgian Translation Space](https://huggingface.co/spaces/Darsala/Georgian-Translation)

	## Citation

	```bibtex
	@mastersthesis{darsalia2025georgian,
	title={English Translation Quality Assessment and Computer Translation},
	author={Luka Darsalia},
	year={2025},
	school={Tbilisi University},
	note={Bachelor's Thesis - Computer Science}
	}
	```

	## Related Resources

	- Training Data: [english_georgian_corpora](https://huggingface.co/datasets/Darsala/english_georgian_corpora)
	- Georgian COMET Model: [georgian_comet](https://huggingface.co/Darsala/georgian_comet)
	- Evaluation Data: [georgian_metric_evaluation](https://huggingface.co/datasets/Darsala/georgian_metric_evaluation)