aramt5 / README.md

README update

5ccdedd about 2 months ago

8.54 kB

	---
	base_model: t5-small
	license: mit
	language: syc
	tags:
	- text2text-generation
	- transliteration
	- syriac
	- low-resource
	- cultural-nlp
	- t5
	pipeline_tag: text-generation
	model-index:
	- name: AramT5
	results:
	- task:
	name: Transliteration
	type: text-generation
	dataset:
	name: Syriac Transliteration Corpus
	type: custom
	metrics:
	- name: Training Loss
	type: loss
	value: 1.9013
	- name: Evaluation Loss
	type: loss
	value: 2.0293
	- name: CER
	type: cer
	value: 0.1602
	- name: Exact Match
	type: accuracy
	value: 0.6217
	---
	# AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰

	AramT5 is a fine-tuned version of `t5-small`, trained to transliterate Syriac text into latinised Serto (West Syriac) and Madnḥaya (East Syriac).

	⚠️ Current Limitations

	- Occasional under-generation (shorter outputs than expected)
	- Occasional vowel omission or compression
	- Reliability varies on very long, uncommon, or morphologically complex words and sentences

	> Development information
	> - 🚧 Current version: v3.2 (stage 4)
	> - ⏳ Upcoming release: v4 (stage 5)

	> Note: Currently, it is unlikely that I'll be able to trigger runs for stages 5 and 6 due to limited availability. As such, I'll be focusing on smaller, focused improvements to polish the model until I have enough time to dedicate myself to completing all stage runs

	---

	## 🌍 About the Project

	AramT5 is a transformer fine-tuned on Syriac-to-Latin data, with a focus on Serto and Madnḥaya. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation.

	This project:
	- Supports underrepresented languages in AI
	- Offers open access to transliteration tools in the Syriac language
	- Was created with humility, curiosity, and deep care by a Syriac learner and enthusiast

	---

	## 💻 Try it out

	Use prefixes to control output dialect:

	- `Syriac2WestLatin`
	- `Syriac2EastLatin`

	Then, use directly via Hugging Face 🤗 Transformers:

	```python
	from transformers import pipeline

	pipe = pipeline("text2text-generation", model = "crossroderick/aramt5")

	text = "ܒܡܠܟܘܬܐ ܕܐܠܗܐ ܕܐܒܪܗܡ."
	input_text = f"Syriac2WestLatin: {text}"
	output = pipe(input_text, max_length = 128)[0]["generated_text"]

	print(output)
	```

	Example output:

	```
	Input:
	ܐܒܘܢ ܕܒܫܡܝܐ

	Output (West):
	ʾabun d-b-šmayo
	```

	---

	## 🙏 Acknowledgements

	Despite being an independent project, AramT5 makes use of four very important datasets:

	- The Syriac translation of the Bible (Peshitta), obtained from [OPUS' Bible dataset](https://opus.nlpl.eu/datasets/bible-uedin?pair=en&syr)
	- Syriac texts from the [Syriac Digital Corpus](https://syriaccorpus.org/index.html), containing writings from celebrated authors such as Isaac of Nineveh, Saint Ephrem the Syrian, and Aphrahat
	- Beth Mardutho's [Syriac Electronic Data Research Archive (SEDRA)](https://sedra.bethmardutho.org), a comprehensive online linguistic and literary database for the Syriac language
	- The Wikipedia dump of articles in the Aramaic (Syriac) language, obtained via the `wikiextractor` Python package

	---

	## 🤖 Fine-tuning instructions

	> Note: AramT5 as a project makes use of `uv` for project management - although not mandatory, installing it is highly encouraged

	Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune AramT5 yourself, please do the following initial steps:

	1. Run the `get_data.sh` shell script file in the "src/data" folder

	> Observation: If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `syriac_east_corpus.jsonl` and `syriac_west_corpus.jsonl` files, as well as shuffle them. Additionally, be sure to install the `wikiextractor` and `sentencepiece` packages beforehand (the exact versions can be found in the `requirements.txt` file).

	2. Run the `generate_syr_lat_pairs.py` file in the same folder
	3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
	4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora

	The model training process follows a curriculum learning format and is comprised of 6 stages:

	\| Stage \| Samples \| Max. sentence len. \| Mixes shorter sen. \| Objective
	\|-------\|---------\|---------------\|--------------------\|--------------------------
	\| 1 \| 20000 \| 15 \| No \| Expose the base T5 model to Syriac morphology
	\| 2 \| 40000 \| 30 \| Yes \| Introduce short sentences to AramT5
	\| 3 \| 60000 \| 50 \| Yes \| Introduce medium sentences to AramT5
	\| 4 \| 120000 \| 70 \| Yes \| Introduce longer sentences to AramT5
	\| 5 \| 150000 \| 100 \| Yes \| Reinforce longer sentences to AramT5
	\| 6 \| 180000 \| 150 \| Yes \| Introduce the full practical corpus to AramT5

	To do a stage 1-based training run, just run the script directly from your IDE or use the following command:

	```python
	uv run python src/train_t5.py --stage 1
	```

	For stages 2 to 6, use the following command instead:


	```python
	uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
	```

	\* Remember to replace the '2' in the command with '3' for stage 3 etc.

	> Observation: Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning

	Should you wish to do a controlled, fine-tuned run on a specialised corpus, you can add or replace the contents in `correction_dataset.jsonl` and run the fine-tune corrections script:

	```python
	uv run python finetune_corrections.py --epochs x --repetitions y
	```

	\* Remember to replace the `x` and `y` parameters with the desired number of epochs and item repetitions, respectively

	> Observation: Fine-tune correction run results are saved in the `src/checkpoints/correction/final` folder

	---

	## 📋 Version Changelog

	* AramT5 Baseline (May 20, 2026): Base `t5-small` model fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner

	* AramT5 v1 (May 20, 2026): Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity

	* AramT5 v2 (May 20, 2026): Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences

	* AramT5 v3 (May 21, 2026): Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words

	* AramT5 v3.1 (May 22, 2026): Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities

	* AramT5 v.3.2 (May 23, 2026): Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words