crossroderick
/

aramt5

@@ -118,14 +118,16 @@ Given the total size of the datasets, they haven't been included in this model's
 3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
 4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
-The model training process follows the curriculum learning format and is comprised of 6 stages:
-- **Stage 1:** Fine-tune a baseline T5 model on ~20k max records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
-- **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k max records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k max records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 4:** Fine-tune the HF Hub-hosted model on ~80k max records (by default) with a long sentence limit. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 5:** Fine-tune the HF Hub-hosted model on ~100k max records (by default) with a very long sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
-- **Stage 6:** Fine-tune the HF Hub-hosted model on ~120k max records (by default) with a very long sentence limit, which exposes the model to the full practical corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
 To do a stage 1-based training run, just run the script directly from your IDE or use the following command:

 3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
 4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
+The model training process follows a curriculum learning format and is comprised of 6 stages:
+| Stage | Samples | Max. sentence len. | Mixes shorter sen. | Objective
+|-------|---------|---------------|--------------------|--------------------------
+| 1     | 20000  | 15            | No                 | Expose the base T5 model to Syriac morphology
+| 2     | 40000  | 30            | Yes                | Introduce short sentences to AramT5
+| 3     | 60000  | 50            | Yes                | Introduce medium sentences to AramT5
+| 4     | 80000  | 70            | Yes                | Introduce longer sentences to AramT5
+| 5     | 100000 | 100           | Yes                | Reinforce longer sentences to AramT5
+| 6     | 120000 | 150           | Yes                | Introduce the full practical corpus to AramT5
 To do a stage 1-based training run, just run the script directly from your IDE or use the following command:

src/train_t5.py CHANGED Viewed

@@ -76,35 +76,35 @@ STAGE_CONFIGS = {
         "description": "Extension: longer phrases",
         "num_samples": 80_000,
         "max_src_length": 70,
-        "short_mix_ratio": 0.10,  # 10% short examples from previous stages
         "short_threshold": 50,  # ≤50 chars (Stage 1+2+3)
-        "new_range_ratio": 0.50,  # 50% from new range (51-70 chars)
         "new_range_min": 51,
         "num_epochs": 20,
-        "learning_rate": 8e-5,
     },
     5: {
         "description": "Extension: longer sentences",
         "num_samples": 100_000,
         "max_src_length": 100,
-        "short_mix_ratio": 0.10,  # 10% short examples from previous stages
         "short_threshold": 70,  # ≤70 chars (Stage 1+2+3+4)
-        "new_range_ratio": 0.50,  # 50% from new range (71-100 chars)
         "new_range_min": 71,
         "num_epochs": 20,
-        "learning_rate": 6e-5,
         "repetition_penalty": 1.2,
     },
     6: {
         "description": "Full practical corpus: sentences and short paragraphs",
         "num_samples": 120_000,
         "max_src_length": 150,
-        "short_mix_ratio": 0.10,  # 10% short examples from previous stages
         "short_threshold": 100,  # ≤100 chars (Stage 1+2+3+4+5)
-        "new_range_ratio": 0.50,  # 50% from new range (101-150 chars)
         "new_range_min": 101,
         "num_epochs": 15,
-        "learning_rate": 4e-5,
         "repetition_penalty": 1.2,
     },
 }

         "description": "Extension: longer phrases",
         "num_samples": 80_000,
         "max_src_length": 70,
+        "short_mix_ratio": 0.18,  # 18% short examples from previous stages (boosted for retention)
         "short_threshold": 50,  # ≤50 chars (Stage 1+2+3)
+        "new_range_ratio": 0.45,  # 45% from new range (51-70 chars)
         "new_range_min": 51,
         "num_epochs": 20,
+        "learning_rate": 6e-5,  # Lower LR to prevent forgetting
     },
     5: {
         "description": "Extension: longer sentences",
         "num_samples": 100_000,
         "max_src_length": 100,
+        "short_mix_ratio": 0.18,  # 18% short examples from previous stages (boosted for retention)
         "short_threshold": 70,  # ≤70 chars (Stage 1+2+3+4)
+        "new_range_ratio": 0.45,  # 45% from new range (71-100 chars)
         "new_range_min": 71,
         "num_epochs": 20,
+        "learning_rate": 4e-5,  # Lower LR to prevent forgetting
         "repetition_penalty": 1.2,
     },
     6: {
         "description": "Full practical corpus: sentences and short paragraphs",
         "num_samples": 120_000,
         "max_src_length": 150,
+        "short_mix_ratio": 0.20,  # 20% short examples from previous stages (highest retention)
         "short_threshold": 100,  # ≤100 chars (Stage 1+2+3+4+5)
+        "new_range_ratio": 0.40,  # 40% from new range (101-150 chars)
         "new_range_min": 101,
         "num_epochs": 15,
+        "learning_rate": 3e-5,  # Lower LR to prevent forgetting
         "repetition_penalty": 1.2,
     },
 }