v3 update

Browse files

Files changed (5) hide show

README.md +15 -13
generation_config.json +2 -2
model.safetensors +1 -1
src/train_t5.py +52 -16
tokenizer.json +16 -2

README.md CHANGED Viewed

@@ -22,16 +22,16 @@ model-index:
       metrics:
         - name: Training Loss
           type: loss
-          value: 3.4987
         - name: Evaluation Loss
           type: loss
-          value: 3.8796
         - name: CER
           type: cer
-          value: 0.3543
         - name: Exact Match
           type: accuracy
-          value: 0.2562
 ---
 # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
@@ -44,8 +44,8 @@ model-index:
 - Less stable on very long or morphologically complex words
 > Development information
-> - 🚧 **Current version:** v2 (stage 3)
-> - ⏳ **Upcoming release:** v3 (stage 4)
 ---
@@ -116,14 +116,14 @@ Given the total size of the datasets, they haven't been included in this model's
 3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
 4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
-The model training process follows the curriculum learning format and is comprised of 5 stages:
 - **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
 - **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
 - **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 4:** Fine-tune the HF Hub-hosted model on ~70k records (by default) with a long sentence limit, also oversampling middle-range sequences of the training set. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 5:** Fine-tune the HF Hub-hosted model on ~80k records (by default) without any sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
 To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
@@ -131,14 +131,14 @@ To do a stage 1-based training run, just run the script directly from your IDE o
 uv run python src/train_t5.py --stage 1
 ```
-For stages 2 to 5, use the following command instead:
 ```python
 uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
 ```
-\* *Remember to replace the '2' in the command with '3' for stage 3, '4' for stage 4, or '5' for stage 5*
 > **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
@@ -151,4 +151,6 @@ uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
 * **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
-* **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences, as well as lower CER and losses, and a noticeably higher exact word match percentage

       metrics:
         - name: Training Loss
           type: loss
+          value: 3.5204
         - name: Evaluation Loss
           type: loss
+          value: 3.7236
         - name: CER
           type: cer
+          value: 0.2760
         - name: Exact Match
           type: accuracy
+          value: 0.3655
 ---
 # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
 - Less stable on very long or morphologically complex words
 > Development information
+> - 🚧 **Current version:** v3 (stage 4)
+> - ⏳ **Upcoming release:** v4 (stage 5)
 ---
 3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
 4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
+The model training process follows the curriculum learning format and is comprised of 6 stages:
 - **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
 - **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
 - **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
+- **Stage 4:** Fine-tune the HF Hub-hosted model on ~80k records (by default) with a long sentence limit. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
+- **Stage 5:** Fine-tune the HF Hub-hosted model on ~100k records (by default) with a very long sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
+- **Stage 6:** Fine-tune the HF Hub-hosted model on ~120k records (by default) with a very long sentence limit, which exposes the model to the full practical corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository.
 To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
 uv run python src/train_t5.py --stage 1
 ```
+For stages 2 to 6, use the following command instead:
 ```python
 uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
 ```
+\* *Remember to replace the '2' in the command with '3' for stage 3 etc.*
 > **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
 * **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
+* **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences and words, also exhibiting lower CER and losses, and a noticeably higher exact word match percentage
+* **AramT5 v3 (May 17, 2026):** AramT5 fine-tuned on 80k records, across 15 epochs, leveraging the stage 4 configuration. A further refinement of the model and an evolution of v3, this version was able to capture several more complex word morphologies, with most mistakes occurring due to bad vowel placement or incorrect word endings. Additionally, loss and CER values decreased further, while exact word match percentage increased

generation_config.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
   "decoder_start_token_id": 0,
   "eos_token_id": 3,
-  "max_length": 24,
   "no_repeat_ngram_size": 2,
   "pad_token_id": 0,
-  "repetition_penalty": 1.5,
   "transformers_version": "4.51.2"
 }

 {
   "decoder_start_token_id": 0,
   "eos_token_id": 3,
+  "max_length": 64,
   "no_repeat_ngram_size": 2,
   "pad_token_id": 0,
+  "repetition_penalty": 1.2,
   "transformers_version": "4.51.2"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:962d79a0aa3e35366038407b9d71a0f6383af67cce6a1d533204134de7c611e2
 size 258368552

 version https://git-lfs.github.com/spec/v1
+oid sha256:0f0c999955d4e7358b5d14b20313b9f7a8b3a7529def26a7a39c4dbbcf92b02c
 size 258368552

src/train_t5.py CHANGED Viewed

@@ -5,13 +5,13 @@ Implements a staged training approach inspired by human learning:
 - Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
 - Stage 2: Medium-short sequences (40k records, ≤30 chars)
 - Stage 3: Medium-long sequences (60k records, ≤50 chars)
-- Stage 4: Bridge stage (70k records, ≤100 chars) with oversampling of rare middle-range
-- Stage 5: Full corpus with all sequence lengths (80k records)
 Features:
 - Curriculum learning: short → long sequences
 - Catastrophic forgetting mitigation: mixes short examples into later stages
-- Middle-range oversampling in stage 4 to bridge word→sentence gap
 - Character Error Rate (CER) evaluation for transliteration quality
 - Early stopping based on validation loss improvement threshold
 """
@@ -66,21 +66,30 @@ STAGE_CONFIGS = {
         "learning_rate": 9e-5,
     },
     4: {
-        "description": "Bridge: oversample middle-range sequences",
-        "num_samples": 70_000,
         "max_src_length": 100,
         "short_mix_ratio": 0.10,  # 10% short examples
-        "middle_oversample": True,  # Oversample 15-100 char range
-        "num_epochs": 12,
         "learning_rate": 7e-5,
     },
     5: {
-        "description": "Full corpus: all sequence lengths",
-        "num_samples": 80_000,
-        "max_src_length": None,  # No length filter
         "short_mix_ratio": 0.10,  # 10% short examples
-        "num_epochs": 10,
         "learning_rate": 5e-5,
     },
 }
@@ -95,8 +104,8 @@ def parse_args():
         "--stage",
         type=int,
         default=1,
-        choices=[1, 2, 3, 4, 5],
-        help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=bridge, 5=full)",
     )
     parser.add_argument(
         "--hf-model",
@@ -472,9 +481,26 @@ def create_compute_metrics(tokeniser):
             f"  Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
         )
         return {
             "cer": np.mean(cer_scores),
             "exact_match": np.mean(exact_matches),
         }
     return compute_metrics
@@ -547,16 +573,26 @@ def train(args):
         greater_is_better=False,
         report_to="none",
         predict_with_generate=True,
-        generation_max_length=24,  # Short for word transliteration (avg target ~7)
     )
     # Configure generation to prevent repetition and control length
     # Using greedy decoding with repetition penalty (no beam search)
-    model.generation_config.max_length = 24
     model.generation_config.no_repeat_ngram_size = 2  # Prevent bigram repetition
-    model.generation_config.repetition_penalty = 1.5  # Penalise repeated tokens
     model.generation_config.eos_token_id = tokeniser.eos_token_id
     model.generation_config.pad_token_id = tokeniser.pad_token_id
     # Callbacks
     callbacks = []

 - Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
 - Stage 2: Medium-short sequences (40k records, ≤30 chars)
 - Stage 3: Medium-long sequences (60k records, ≤50 chars)
+- Stage 4: Longer sequences (80k records, ≤100 chars)
+- Stage 5: Extended sequences (100k records, ≤200 chars)
+- Stage 6: Full practical corpus (120k records, ≤300 chars) — filters document outliers
 Features:
 - Curriculum learning: short → long sequences
 - Catastrophic forgetting mitigation: mixes short examples into later stages
 - Character Error Rate (CER) evaluation for transliteration quality
 - Early stopping based on validation loss improvement threshold
 """
         "learning_rate": 9e-5,
     },
     4: {
+        "description": "Extension: longer sequences (natural distribution)",
+        "num_samples": 80_000,
         "max_src_length": 100,
         "short_mix_ratio": 0.10,  # 10% short examples
+        "num_epochs": 15,
         "learning_rate": 7e-5,
     },
     5: {
+        "description": "Extension: even longer sequences (almost full practical corpus)",
+        "num_samples": 100_000,
+        "max_src_length": 200,  # Filter out document-length outliers
+        "short_mix_ratio": 0.10,  # 10% short examples
+        "num_epochs": 15,
+        "learning_rate": 6e-5,  # Gradual decrease from stage 4's 7e-5
+        "repetition_penalty": 1.2,  # Prevent repetitive outputs
+    },
+    6: {
+        "description": "Full practical corpus: sentences and short paragraphs",
+        "num_samples": 120_000,
+        "max_src_length": 300,  # Filter out document-length outliers
         "short_mix_ratio": 0.10,  # 10% short examples
+        "num_epochs": 12,
         "learning_rate": 5e-5,
+        "repetition_penalty": 1.2,  # Prevent repetitive outputs
     },
 }
         "--stage",
         type=int,
         default=1,
+        choices=[1, 2, 3, 4, 5, 6],
+        help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=longer, 5=extended, 6=full practical)",
     )
     parser.add_argument(
         "--hf-model",
             f"  Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
         )
+        # Compute length ratio penalty (penalise under-generation)
+        # Ratio < 1 means output is shorter than target
+        length_ratios = [
+            len(pred) / max(len(target), 1)
+            for pred, target in zip(pred_strs, label_strs)
+        ]
+        # Penalty: how much shorter outputs are on average (0 = perfect, higher = worse)
+        # Only penalise under-generation (ratio < 1), not over-generation
+        length_penalties = [
+            max(0, 1 - ratio) for ratio in length_ratios
+        ]
+        avg_length_penalty = np.mean(length_penalties)
+        avg_length_ratio = np.mean(length_ratios)
+        print(f"  Avg length ratio: {avg_length_ratio:.3f}, Avg length penalty: {avg_length_penalty:.3f}")
         return {
             "cer": np.mean(cer_scores),
             "exact_match": np.mean(exact_matches),
+            "length_ratio": avg_length_ratio,
+            "length_penalty": avg_length_penalty,
         }
     return compute_metrics
         greater_is_better=False,
         report_to="none",
         predict_with_generate=True,
+        generation_max_length=128 if args.stage >= 5 else (64 if args.stage >= 4 else 24),
     )
     # Configure generation to prevent repetition and control length
     # Using greedy decoding with repetition penalty (no beam search)
+    gen_max_len = 128 if args.stage >= 5 else (64 if args.stage >= 4 else 24)
+    model.generation_config.max_length = gen_max_len
     model.generation_config.no_repeat_ngram_size = 2  # Prevent bigram repetition
+    # Use stage-specific repetition penalty if defined, else default
+    rep_penalty = stage_config.get("repetition_penalty", 1.5 if args.stage < 4 else 1.3)
+    model.generation_config.repetition_penalty = rep_penalty
     model.generation_config.eos_token_id = tokeniser.eos_token_id
     model.generation_config.pad_token_id = tokeniser.pad_token_id
+    # Minimum length to discourage under-generation in later stages
+    if args.stage >= 5:
+        model.generation_config.min_length = 3  # At least a few tokens
+        # Use beam search with length_penalty to encourage full-length outputs
+        model.generation_config.num_beams = 4
+        model.generation_config.length_penalty = 1.2  # >1.0 encourages longer sequences
+        model.generation_config.early_stopping = True
     # Callbacks
     callbacks = []

tokenizer.json CHANGED Viewed

@@ -1,7 +1,21 @@
 {
   "version": "1.0",
-  "truncation": null,
-  "padding": null,
   "added_tokens": [
     {
       "id": 0,

 {
   "version": "1.0",
+  "truncation": {
+    "direction": "Right",
+    "max_length": 128,
+    "strategy": "LongestFirst",
+    "stride": 0
+  },
+  "padding": {
+    "strategy": {
+      "Fixed": 128
+    },
+    "direction": "Right",
+    "pad_to_multiple_of": null,
+    "pad_id": 0,
+    "pad_type_id": 0,
+    "pad_token": "<pad>"
+  },
   "added_tokens": [
     {
       "id": 0,