Commit ·
0e528c7
1
Parent(s): cf5dfba
v3 update
Browse files- README.md +15 -13
- generation_config.json +2 -2
- model.safetensors +1 -1
- src/train_t5.py +52 -16
- tokenizer.json +16 -2
README.md
CHANGED
|
@@ -22,16 +22,16 @@ model-index:
|
|
| 22 |
metrics:
|
| 23 |
- name: Training Loss
|
| 24 |
type: loss
|
| 25 |
-
value: 3.
|
| 26 |
- name: Evaluation Loss
|
| 27 |
type: loss
|
| 28 |
-
value: 3.
|
| 29 |
- name: CER
|
| 30 |
type: cer
|
| 31 |
-
value: 0.
|
| 32 |
- name: Exact Match
|
| 33 |
type: accuracy
|
| 34 |
-
value: 0.
|
| 35 |
---
|
| 36 |
# AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
|
| 37 |
|
|
@@ -44,8 +44,8 @@ model-index:
|
|
| 44 |
- Less stable on very long or morphologically complex words
|
| 45 |
|
| 46 |
> Development information
|
| 47 |
-
> - 🚧 **Current version:**
|
| 48 |
-
> - ⏳ **Upcoming release:**
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
@@ -116,14 +116,14 @@ Given the total size of the datasets, they haven't been included in this model's
|
|
| 116 |
3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
|
| 117 |
4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
|
| 118 |
|
| 119 |
-
The model training process follows the curriculum learning format and is comprised of
|
| 120 |
|
| 121 |
- **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
|
| 122 |
- **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 123 |
- **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 124 |
-
- **Stage 4:** Fine-tune the HF Hub-hosted model on ~
|
| 125 |
-
- **Stage 5:** Fine-tune the HF Hub-hosted model on ~
|
| 126 |
-
|
| 127 |
|
| 128 |
To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
|
| 129 |
|
|
@@ -131,14 +131,14 @@ To do a stage 1-based training run, just run the script directly from your IDE o
|
|
| 131 |
uv run python src/train_t5.py --stage 1
|
| 132 |
```
|
| 133 |
|
| 134 |
-
For stages 2 to
|
| 135 |
|
| 136 |
|
| 137 |
```python
|
| 138 |
uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
|
| 139 |
```
|
| 140 |
|
| 141 |
-
\* *Remember to replace the '2' in the command with '3' for stage 3
|
| 142 |
|
| 143 |
|
| 144 |
> **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
|
|
@@ -151,4 +151,6 @@ uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
|
|
| 151 |
|
| 152 |
* **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
|
| 153 |
|
| 154 |
-
* **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences
|
|
|
|
|
|
|
|
|
| 22 |
metrics:
|
| 23 |
- name: Training Loss
|
| 24 |
type: loss
|
| 25 |
+
value: 3.5204
|
| 26 |
- name: Evaluation Loss
|
| 27 |
type: loss
|
| 28 |
+
value: 3.7236
|
| 29 |
- name: CER
|
| 30 |
type: cer
|
| 31 |
+
value: 0.2760
|
| 32 |
- name: Exact Match
|
| 33 |
type: accuracy
|
| 34 |
+
value: 0.3655
|
| 35 |
---
|
| 36 |
# AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
|
| 37 |
|
|
|
|
| 44 |
- Less stable on very long or morphologically complex words
|
| 45 |
|
| 46 |
> Development information
|
| 47 |
+
> - 🚧 **Current version:** v3 (stage 4)
|
| 48 |
+
> - ⏳ **Upcoming release:** v4 (stage 5)
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
|
|
| 116 |
3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
|
| 117 |
4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
|
| 118 |
|
| 119 |
+
The model training process follows the curriculum learning format and is comprised of 6 stages:
|
| 120 |
|
| 121 |
- **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
|
| 122 |
- **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 123 |
- **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 124 |
+
- **Stage 4:** Fine-tune the HF Hub-hosted model on ~80k records (by default) with a long sentence limit. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 125 |
+
- **Stage 5:** Fine-tune the HF Hub-hosted model on ~100k records (by default) with a very long sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
|
| 126 |
+
- **Stage 6:** Fine-tune the HF Hub-hosted model on ~120k records (by default) with a very long sentence limit, which exposes the model to the full practical corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository.
|
| 127 |
|
| 128 |
To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
|
| 129 |
|
|
|
|
| 131 |
uv run python src/train_t5.py --stage 1
|
| 132 |
```
|
| 133 |
|
| 134 |
+
For stages 2 to 6, use the following command instead:
|
| 135 |
|
| 136 |
|
| 137 |
```python
|
| 138 |
uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
|
| 139 |
```
|
| 140 |
|
| 141 |
+
\* *Remember to replace the '2' in the command with '3' for stage 3 etc.*
|
| 142 |
|
| 143 |
|
| 144 |
> **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
|
|
|
|
| 151 |
|
| 152 |
* **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
|
| 153 |
|
| 154 |
+
* **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences and words, also exhibiting lower CER and losses, and a noticeably higher exact word match percentage
|
| 155 |
+
|
| 156 |
+
* **AramT5 v3 (May 17, 2026):** AramT5 fine-tuned on 80k records, across 15 epochs, leveraging the stage 4 configuration. A further refinement of the model and an evolution of v3, this version was able to capture several more complex word morphologies, with most mistakes occurring due to bad vowel placement or incorrect word endings. Additionally, loss and CER values decreased further, while exact word match percentage increased
|
generation_config.json
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
{
|
| 2 |
"decoder_start_token_id": 0,
|
| 3 |
"eos_token_id": 3,
|
| 4 |
-
"max_length":
|
| 5 |
"no_repeat_ngram_size": 2,
|
| 6 |
"pad_token_id": 0,
|
| 7 |
-
"repetition_penalty": 1.
|
| 8 |
"transformers_version": "4.51.2"
|
| 9 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"decoder_start_token_id": 0,
|
| 3 |
"eos_token_id": 3,
|
| 4 |
+
"max_length": 64,
|
| 5 |
"no_repeat_ngram_size": 2,
|
| 6 |
"pad_token_id": 0,
|
| 7 |
+
"repetition_penalty": 1.2,
|
| 8 |
"transformers_version": "4.51.2"
|
| 9 |
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 258368552
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0f0c999955d4e7358b5d14b20313b9f7a8b3a7529def26a7a39c4dbbcf92b02c
|
| 3 |
size 258368552
|
src/train_t5.py
CHANGED
|
@@ -5,13 +5,13 @@ Implements a staged training approach inspired by human learning:
|
|
| 5 |
- Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
|
| 6 |
- Stage 2: Medium-short sequences (40k records, ≤30 chars)
|
| 7 |
- Stage 3: Medium-long sequences (60k records, ≤50 chars)
|
| 8 |
-
- Stage 4:
|
| 9 |
-
- Stage 5:
|
|
|
|
| 10 |
|
| 11 |
Features:
|
| 12 |
- Curriculum learning: short → long sequences
|
| 13 |
- Catastrophic forgetting mitigation: mixes short examples into later stages
|
| 14 |
-
- Middle-range oversampling in stage 4 to bridge word→sentence gap
|
| 15 |
- Character Error Rate (CER) evaluation for transliteration quality
|
| 16 |
- Early stopping based on validation loss improvement threshold
|
| 17 |
"""
|
|
@@ -66,21 +66,30 @@ STAGE_CONFIGS = {
|
|
| 66 |
"learning_rate": 9e-5,
|
| 67 |
},
|
| 68 |
4: {
|
| 69 |
-
"description": "
|
| 70 |
-
"num_samples":
|
| 71 |
"max_src_length": 100,
|
| 72 |
"short_mix_ratio": 0.10, # 10% short examples
|
| 73 |
-
"
|
| 74 |
-
"num_epochs": 12,
|
| 75 |
"learning_rate": 7e-5,
|
| 76 |
},
|
| 77 |
5: {
|
| 78 |
-
"description": "
|
| 79 |
-
"num_samples":
|
| 80 |
-
"max_src_length":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
"short_mix_ratio": 0.10, # 10% short examples
|
| 82 |
-
"num_epochs":
|
| 83 |
"learning_rate": 5e-5,
|
|
|
|
| 84 |
},
|
| 85 |
}
|
| 86 |
|
|
@@ -95,8 +104,8 @@ def parse_args():
|
|
| 95 |
"--stage",
|
| 96 |
type=int,
|
| 97 |
default=1,
|
| 98 |
-
choices=[1, 2, 3, 4, 5],
|
| 99 |
-
help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=
|
| 100 |
)
|
| 101 |
parser.add_argument(
|
| 102 |
"--hf-model",
|
|
@@ -472,9 +481,26 @@ def create_compute_metrics(tokeniser):
|
|
| 472 |
f" Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
|
| 473 |
)
|
| 474 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 475 |
return {
|
| 476 |
"cer": np.mean(cer_scores),
|
| 477 |
"exact_match": np.mean(exact_matches),
|
|
|
|
|
|
|
| 478 |
}
|
| 479 |
|
| 480 |
return compute_metrics
|
|
@@ -547,16 +573,26 @@ def train(args):
|
|
| 547 |
greater_is_better=False,
|
| 548 |
report_to="none",
|
| 549 |
predict_with_generate=True,
|
| 550 |
-
generation_max_length=
|
| 551 |
)
|
| 552 |
|
| 553 |
# Configure generation to prevent repetition and control length
|
| 554 |
# Using greedy decoding with repetition penalty (no beam search)
|
| 555 |
-
|
|
|
|
| 556 |
model.generation_config.no_repeat_ngram_size = 2 # Prevent bigram repetition
|
| 557 |
-
|
|
|
|
|
|
|
| 558 |
model.generation_config.eos_token_id = tokeniser.eos_token_id
|
| 559 |
model.generation_config.pad_token_id = tokeniser.pad_token_id
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 560 |
|
| 561 |
# Callbacks
|
| 562 |
callbacks = []
|
|
|
|
| 5 |
- Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
|
| 6 |
- Stage 2: Medium-short sequences (40k records, ≤30 chars)
|
| 7 |
- Stage 3: Medium-long sequences (60k records, ≤50 chars)
|
| 8 |
+
- Stage 4: Longer sequences (80k records, ≤100 chars)
|
| 9 |
+
- Stage 5: Extended sequences (100k records, ≤200 chars)
|
| 10 |
+
- Stage 6: Full practical corpus (120k records, ≤300 chars) — filters document outliers
|
| 11 |
|
| 12 |
Features:
|
| 13 |
- Curriculum learning: short → long sequences
|
| 14 |
- Catastrophic forgetting mitigation: mixes short examples into later stages
|
|
|
|
| 15 |
- Character Error Rate (CER) evaluation for transliteration quality
|
| 16 |
- Early stopping based on validation loss improvement threshold
|
| 17 |
"""
|
|
|
|
| 66 |
"learning_rate": 9e-5,
|
| 67 |
},
|
| 68 |
4: {
|
| 69 |
+
"description": "Extension: longer sequences (natural distribution)",
|
| 70 |
+
"num_samples": 80_000,
|
| 71 |
"max_src_length": 100,
|
| 72 |
"short_mix_ratio": 0.10, # 10% short examples
|
| 73 |
+
"num_epochs": 15,
|
|
|
|
| 74 |
"learning_rate": 7e-5,
|
| 75 |
},
|
| 76 |
5: {
|
| 77 |
+
"description": "Extension: even longer sequences (almost full practical corpus)",
|
| 78 |
+
"num_samples": 100_000,
|
| 79 |
+
"max_src_length": 200, # Filter out document-length outliers
|
| 80 |
+
"short_mix_ratio": 0.10, # 10% short examples
|
| 81 |
+
"num_epochs": 15,
|
| 82 |
+
"learning_rate": 6e-5, # Gradual decrease from stage 4's 7e-5
|
| 83 |
+
"repetition_penalty": 1.2, # Prevent repetitive outputs
|
| 84 |
+
},
|
| 85 |
+
6: {
|
| 86 |
+
"description": "Full practical corpus: sentences and short paragraphs",
|
| 87 |
+
"num_samples": 120_000,
|
| 88 |
+
"max_src_length": 300, # Filter out document-length outliers
|
| 89 |
"short_mix_ratio": 0.10, # 10% short examples
|
| 90 |
+
"num_epochs": 12,
|
| 91 |
"learning_rate": 5e-5,
|
| 92 |
+
"repetition_penalty": 1.2, # Prevent repetitive outputs
|
| 93 |
},
|
| 94 |
}
|
| 95 |
|
|
|
|
| 104 |
"--stage",
|
| 105 |
type=int,
|
| 106 |
default=1,
|
| 107 |
+
choices=[1, 2, 3, 4, 5, 6],
|
| 108 |
+
help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=longer, 5=extended, 6=full practical)",
|
| 109 |
)
|
| 110 |
parser.add_argument(
|
| 111 |
"--hf-model",
|
|
|
|
| 481 |
f" Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
|
| 482 |
)
|
| 483 |
|
| 484 |
+
# Compute length ratio penalty (penalise under-generation)
|
| 485 |
+
# Ratio < 1 means output is shorter than target
|
| 486 |
+
length_ratios = [
|
| 487 |
+
len(pred) / max(len(target), 1)
|
| 488 |
+
for pred, target in zip(pred_strs, label_strs)
|
| 489 |
+
]
|
| 490 |
+
# Penalty: how much shorter outputs are on average (0 = perfect, higher = worse)
|
| 491 |
+
# Only penalise under-generation (ratio < 1), not over-generation
|
| 492 |
+
length_penalties = [
|
| 493 |
+
max(0, 1 - ratio) for ratio in length_ratios
|
| 494 |
+
]
|
| 495 |
+
avg_length_penalty = np.mean(length_penalties)
|
| 496 |
+
avg_length_ratio = np.mean(length_ratios)
|
| 497 |
+
print(f" Avg length ratio: {avg_length_ratio:.3f}, Avg length penalty: {avg_length_penalty:.3f}")
|
| 498 |
+
|
| 499 |
return {
|
| 500 |
"cer": np.mean(cer_scores),
|
| 501 |
"exact_match": np.mean(exact_matches),
|
| 502 |
+
"length_ratio": avg_length_ratio,
|
| 503 |
+
"length_penalty": avg_length_penalty,
|
| 504 |
}
|
| 505 |
|
| 506 |
return compute_metrics
|
|
|
|
| 573 |
greater_is_better=False,
|
| 574 |
report_to="none",
|
| 575 |
predict_with_generate=True,
|
| 576 |
+
generation_max_length=128 if args.stage >= 5 else (64 if args.stage >= 4 else 24),
|
| 577 |
)
|
| 578 |
|
| 579 |
# Configure generation to prevent repetition and control length
|
| 580 |
# Using greedy decoding with repetition penalty (no beam search)
|
| 581 |
+
gen_max_len = 128 if args.stage >= 5 else (64 if args.stage >= 4 else 24)
|
| 582 |
+
model.generation_config.max_length = gen_max_len
|
| 583 |
model.generation_config.no_repeat_ngram_size = 2 # Prevent bigram repetition
|
| 584 |
+
# Use stage-specific repetition penalty if defined, else default
|
| 585 |
+
rep_penalty = stage_config.get("repetition_penalty", 1.5 if args.stage < 4 else 1.3)
|
| 586 |
+
model.generation_config.repetition_penalty = rep_penalty
|
| 587 |
model.generation_config.eos_token_id = tokeniser.eos_token_id
|
| 588 |
model.generation_config.pad_token_id = tokeniser.pad_token_id
|
| 589 |
+
# Minimum length to discourage under-generation in later stages
|
| 590 |
+
if args.stage >= 5:
|
| 591 |
+
model.generation_config.min_length = 3 # At least a few tokens
|
| 592 |
+
# Use beam search with length_penalty to encourage full-length outputs
|
| 593 |
+
model.generation_config.num_beams = 4
|
| 594 |
+
model.generation_config.length_penalty = 1.2 # >1.0 encourages longer sequences
|
| 595 |
+
model.generation_config.early_stopping = True
|
| 596 |
|
| 597 |
# Callbacks
|
| 598 |
callbacks = []
|
tokenizer.json
CHANGED
|
@@ -1,7 +1,21 @@
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
-
"truncation":
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"added_tokens": [
|
| 6 |
{
|
| 7 |
"id": 0,
|
|
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
+
"truncation": {
|
| 4 |
+
"direction": "Right",
|
| 5 |
+
"max_length": 128,
|
| 6 |
+
"strategy": "LongestFirst",
|
| 7 |
+
"stride": 0
|
| 8 |
+
},
|
| 9 |
+
"padding": {
|
| 10 |
+
"strategy": {
|
| 11 |
+
"Fixed": 128
|
| 12 |
+
},
|
| 13 |
+
"direction": "Right",
|
| 14 |
+
"pad_to_multiple_of": null,
|
| 15 |
+
"pad_id": 0,
|
| 16 |
+
"pad_type_id": 0,
|
| 17 |
+
"pad_token": "<pad>"
|
| 18 |
+
},
|
| 19 |
"added_tokens": [
|
| 20 |
{
|
| 21 |
"id": 0,
|