crossroderick commited on
Commit
0e528c7
·
1 Parent(s): cf5dfba

v3 update

Browse files
Files changed (5) hide show
  1. README.md +15 -13
  2. generation_config.json +2 -2
  3. model.safetensors +1 -1
  4. src/train_t5.py +52 -16
  5. tokenizer.json +16 -2
README.md CHANGED
@@ -22,16 +22,16 @@ model-index:
22
  metrics:
23
  - name: Training Loss
24
  type: loss
25
- value: 3.4987
26
  - name: Evaluation Loss
27
  type: loss
28
- value: 3.8796
29
  - name: CER
30
  type: cer
31
- value: 0.3543
32
  - name: Exact Match
33
  type: accuracy
34
- value: 0.2562
35
  ---
36
  # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
37
 
@@ -44,8 +44,8 @@ model-index:
44
  - Less stable on very long or morphologically complex words
45
 
46
  > Development information
47
- > - 🚧 **Current version:** v2 (stage 3)
48
- > - ⏳ **Upcoming release:** v3 (stage 4)
49
 
50
  ---
51
 
@@ -116,14 +116,14 @@ Given the total size of the datasets, they haven't been included in this model's
116
  3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
117
  4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
118
 
119
- The model training process follows the curriculum learning format and is comprised of 5 stages:
120
 
121
  - **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
122
  - **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
123
  - **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
124
- - **Stage 4:** Fine-tune the HF Hub-hosted model on ~70k records (by default) with a long sentence limit, also oversampling middle-range sequences of the training set. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
125
- - **Stage 5:** Fine-tune the HF Hub-hosted model on ~80k records (by default) without any sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
126
-
127
 
128
  To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
129
 
@@ -131,14 +131,14 @@ To do a stage 1-based training run, just run the script directly from your IDE o
131
  uv run python src/train_t5.py --stage 1
132
  ```
133
 
134
- For stages 2 to 5, use the following command instead:
135
 
136
 
137
  ```python
138
  uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
139
  ```
140
 
141
- \* *Remember to replace the '2' in the command with '3' for stage 3, '4' for stage 4, or '5' for stage 5*
142
 
143
 
144
  > **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
@@ -151,4 +151,6 @@ uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
151
 
152
  * **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
153
 
154
- * **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences, as well as lower CER and losses, and a noticeably higher exact word match percentage
 
 
 
22
  metrics:
23
  - name: Training Loss
24
  type: loss
25
+ value: 3.5204
26
  - name: Evaluation Loss
27
  type: loss
28
+ value: 3.7236
29
  - name: CER
30
  type: cer
31
+ value: 0.2760
32
  - name: Exact Match
33
  type: accuracy
34
+ value: 0.3655
35
  ---
36
  # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
37
 
 
44
  - Less stable on very long or morphologically complex words
45
 
46
  > Development information
47
+ > - 🚧 **Current version:** v3 (stage 4)
48
+ > - ⏳ **Upcoming release:** v4 (stage 5)
49
 
50
  ---
51
 
 
116
  3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets
117
  4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora
118
 
119
+ The model training process follows the curriculum learning format and is comprised of 6 stages:
120
 
121
  - **Stage 1:** Fine-tune a baseline T5 model on ~20k records (by default) consisting of shorter sentences. By the end of stage 1, the model should be uploaded to Hugging Face (HF) Hub
122
  - **Stage 2:** Fine-tune the HF Hub-hosted model on ~40k records (by default) consisting of a mix of shorter and longer sentences. This new model should overwrite the one available on HF Hub, or uploaded to a new repository
123
  - **Stage 3:** Fine-tune the HF Hub-hosted model on ~60k records (by default) with a considerable sentence length limit, which also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
124
+ - **Stage 4:** Fine-tune the HF Hub-hosted model on ~80k records (by default) with a long sentence limit. This stage acts as a bridge between stages 3 and 5. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
125
+ - **Stage 5:** Fine-tune the HF Hub-hosted model on ~100k records (by default) with a very long sentence limit, which exposes the model to huge sentences in the corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository
126
+ - **Stage 6:** Fine-tune the HF Hub-hosted model on ~120k records (by default) with a very long sentence limit, which exposes the model to the full practical corpus. Additionally, it also contains a mix of shorter and longer sentences. Once more, this new model should overwrite the one available on HF Hub, or uploaded to a new repository.
127
 
128
  To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
129
 
 
131
  uv run python src/train_t5.py --stage 1
132
  ```
133
 
134
+ For stages 2 to 6, use the following command instead:
135
 
136
 
137
  ```python
138
  uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
139
  ```
140
 
141
+ \* *Remember to replace the '2' in the command with '3' for stage 3 etc.*
142
 
143
 
144
  > **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning
 
151
 
152
  * **AramT5 v1 (May 16, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. The first proper model iteration, it exhibited noticeable improvements in training and evaluation losses, as well as CER and exact word match percentage. Furthermore, this version was the first that was shown to be capable of capturing some word roots and morphemes
153
 
154
+ * **AramT5 v2 (May 17, 2026):** AramT5 fine-tuned on 60k records, across 15 epochs, leveraging the stage 3 configuration. A noticeable evolution from v1, it showcased better transliteration capabilities by handling more complex sentences and words, also exhibiting lower CER and losses, and a noticeably higher exact word match percentage
155
+
156
+ * **AramT5 v3 (May 17, 2026):** AramT5 fine-tuned on 80k records, across 15 epochs, leveraging the stage 4 configuration. A further refinement of the model and an evolution of v3, this version was able to capture several more complex word morphologies, with most mistakes occurring due to bad vowel placement or incorrect word endings. Additionally, loss and CER values decreased further, while exact word match percentage increased
generation_config.json CHANGED
@@ -1,9 +1,9 @@
1
  {
2
  "decoder_start_token_id": 0,
3
  "eos_token_id": 3,
4
- "max_length": 24,
5
  "no_repeat_ngram_size": 2,
6
  "pad_token_id": 0,
7
- "repetition_penalty": 1.5,
8
  "transformers_version": "4.51.2"
9
  }
 
1
  {
2
  "decoder_start_token_id": 0,
3
  "eos_token_id": 3,
4
+ "max_length": 64,
5
  "no_repeat_ngram_size": 2,
6
  "pad_token_id": 0,
7
+ "repetition_penalty": 1.2,
8
  "transformers_version": "4.51.2"
9
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:962d79a0aa3e35366038407b9d71a0f6383af67cce6a1d533204134de7c611e2
3
  size 258368552
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0f0c999955d4e7358b5d14b20313b9f7a8b3a7529def26a7a39c4dbbcf92b02c
3
  size 258368552
src/train_t5.py CHANGED
@@ -5,13 +5,13 @@ Implements a staged training approach inspired by human learning:
5
  - Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
6
  - Stage 2: Medium-short sequences (40k records, ≤30 chars)
7
  - Stage 3: Medium-long sequences (60k records, ≤50 chars)
8
- - Stage 4: Bridge stage (70k records, ≤100 chars) with oversampling of rare middle-range
9
- - Stage 5: Full corpus with all sequence lengths (80k records)
 
10
 
11
  Features:
12
  - Curriculum learning: short → long sequences
13
  - Catastrophic forgetting mitigation: mixes short examples into later stages
14
- - Middle-range oversampling in stage 4 to bridge word→sentence gap
15
  - Character Error Rate (CER) evaluation for transliteration quality
16
  - Early stopping based on validation loss improvement threshold
17
  """
@@ -66,21 +66,30 @@ STAGE_CONFIGS = {
66
  "learning_rate": 9e-5,
67
  },
68
  4: {
69
- "description": "Bridge: oversample middle-range sequences",
70
- "num_samples": 70_000,
71
  "max_src_length": 100,
72
  "short_mix_ratio": 0.10, # 10% short examples
73
- "middle_oversample": True, # Oversample 15-100 char range
74
- "num_epochs": 12,
75
  "learning_rate": 7e-5,
76
  },
77
  5: {
78
- "description": "Full corpus: all sequence lengths",
79
- "num_samples": 80_000,
80
- "max_src_length": None, # No length filter
 
 
 
 
 
 
 
 
 
81
  "short_mix_ratio": 0.10, # 10% short examples
82
- "num_epochs": 10,
83
  "learning_rate": 5e-5,
 
84
  },
85
  }
86
 
@@ -95,8 +104,8 @@ def parse_args():
95
  "--stage",
96
  type=int,
97
  default=1,
98
- choices=[1, 2, 3, 4, 5],
99
- help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=bridge, 5=full)",
100
  )
101
  parser.add_argument(
102
  "--hf-model",
@@ -472,9 +481,26 @@ def create_compute_metrics(tokeniser):
472
  f" Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
473
  )
474
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
475
  return {
476
  "cer": np.mean(cer_scores),
477
  "exact_match": np.mean(exact_matches),
 
 
478
  }
479
 
480
  return compute_metrics
@@ -547,16 +573,26 @@ def train(args):
547
  greater_is_better=False,
548
  report_to="none",
549
  predict_with_generate=True,
550
- generation_max_length=24, # Short for word transliteration (avg target ~7)
551
  )
552
 
553
  # Configure generation to prevent repetition and control length
554
  # Using greedy decoding with repetition penalty (no beam search)
555
- model.generation_config.max_length = 24
 
556
  model.generation_config.no_repeat_ngram_size = 2 # Prevent bigram repetition
557
- model.generation_config.repetition_penalty = 1.5 # Penalise repeated tokens
 
 
558
  model.generation_config.eos_token_id = tokeniser.eos_token_id
559
  model.generation_config.pad_token_id = tokeniser.pad_token_id
 
 
 
 
 
 
 
560
 
561
  # Callbacks
562
  callbacks = []
 
5
  - Stage 1: Train on short sequences (20k records, ≤15 chars) → baseline model
6
  - Stage 2: Medium-short sequences (40k records, ≤30 chars)
7
  - Stage 3: Medium-long sequences (60k records, ≤50 chars)
8
+ - Stage 4: Longer sequences (80k records, ≤100 chars)
9
+ - Stage 5: Extended sequences (100k records, ≤200 chars)
10
+ - Stage 6: Full practical corpus (120k records, ≤300 chars) — filters document outliers
11
 
12
  Features:
13
  - Curriculum learning: short → long sequences
14
  - Catastrophic forgetting mitigation: mixes short examples into later stages
 
15
  - Character Error Rate (CER) evaluation for transliteration quality
16
  - Early stopping based on validation loss improvement threshold
17
  """
 
66
  "learning_rate": 9e-5,
67
  },
68
  4: {
69
+ "description": "Extension: longer sequences (natural distribution)",
70
+ "num_samples": 80_000,
71
  "max_src_length": 100,
72
  "short_mix_ratio": 0.10, # 10% short examples
73
+ "num_epochs": 15,
 
74
  "learning_rate": 7e-5,
75
  },
76
  5: {
77
+ "description": "Extension: even longer sequences (almost full practical corpus)",
78
+ "num_samples": 100_000,
79
+ "max_src_length": 200, # Filter out document-length outliers
80
+ "short_mix_ratio": 0.10, # 10% short examples
81
+ "num_epochs": 15,
82
+ "learning_rate": 6e-5, # Gradual decrease from stage 4's 7e-5
83
+ "repetition_penalty": 1.2, # Prevent repetitive outputs
84
+ },
85
+ 6: {
86
+ "description": "Full practical corpus: sentences and short paragraphs",
87
+ "num_samples": 120_000,
88
+ "max_src_length": 300, # Filter out document-length outliers
89
  "short_mix_ratio": 0.10, # 10% short examples
90
+ "num_epochs": 12,
91
  "learning_rate": 5e-5,
92
+ "repetition_penalty": 1.2, # Prevent repetitive outputs
93
  },
94
  }
95
 
 
104
  "--stage",
105
  type=int,
106
  default=1,
107
+ choices=[1, 2, 3, 4, 5, 6],
108
+ help="Training stage (1=baseline, 2=medium-short, 3=medium-long, 4=longer, 5=extended, 6=full practical)",
109
  )
110
  parser.add_argument(
111
  "--hf-model",
 
481
  f" Avg pred len: {np.mean(pred_lens):.1f}, Avg label len: {np.mean(label_lens):.1f}"
482
  )
483
 
484
+ # Compute length ratio penalty (penalise under-generation)
485
+ # Ratio < 1 means output is shorter than target
486
+ length_ratios = [
487
+ len(pred) / max(len(target), 1)
488
+ for pred, target in zip(pred_strs, label_strs)
489
+ ]
490
+ # Penalty: how much shorter outputs are on average (0 = perfect, higher = worse)
491
+ # Only penalise under-generation (ratio < 1), not over-generation
492
+ length_penalties = [
493
+ max(0, 1 - ratio) for ratio in length_ratios
494
+ ]
495
+ avg_length_penalty = np.mean(length_penalties)
496
+ avg_length_ratio = np.mean(length_ratios)
497
+ print(f" Avg length ratio: {avg_length_ratio:.3f}, Avg length penalty: {avg_length_penalty:.3f}")
498
+
499
  return {
500
  "cer": np.mean(cer_scores),
501
  "exact_match": np.mean(exact_matches),
502
+ "length_ratio": avg_length_ratio,
503
+ "length_penalty": avg_length_penalty,
504
  }
505
 
506
  return compute_metrics
 
573
  greater_is_better=False,
574
  report_to="none",
575
  predict_with_generate=True,
576
+ generation_max_length=128 if args.stage >= 5 else (64 if args.stage >= 4 else 24),
577
  )
578
 
579
  # Configure generation to prevent repetition and control length
580
  # Using greedy decoding with repetition penalty (no beam search)
581
+ gen_max_len = 128 if args.stage >= 5 else (64 if args.stage >= 4 else 24)
582
+ model.generation_config.max_length = gen_max_len
583
  model.generation_config.no_repeat_ngram_size = 2 # Prevent bigram repetition
584
+ # Use stage-specific repetition penalty if defined, else default
585
+ rep_penalty = stage_config.get("repetition_penalty", 1.5 if args.stage < 4 else 1.3)
586
+ model.generation_config.repetition_penalty = rep_penalty
587
  model.generation_config.eos_token_id = tokeniser.eos_token_id
588
  model.generation_config.pad_token_id = tokeniser.pad_token_id
589
+ # Minimum length to discourage under-generation in later stages
590
+ if args.stage >= 5:
591
+ model.generation_config.min_length = 3 # At least a few tokens
592
+ # Use beam search with length_penalty to encourage full-length outputs
593
+ model.generation_config.num_beams = 4
594
+ model.generation_config.length_penalty = 1.2 # >1.0 encourages longer sequences
595
+ model.generation_config.early_stopping = True
596
 
597
  # Callbacks
598
  callbacks = []
tokenizer.json CHANGED
@@ -1,7 +1,21 @@
1
  {
2
  "version": "1.0",
3
- "truncation": null,
4
- "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": {
4
+ "direction": "Right",
5
+ "max_length": 128,
6
+ "strategy": "LongestFirst",
7
+ "stride": 0
8
+ },
9
+ "padding": {
10
+ "strategy": {
11
+ "Fixed": 128
12
+ },
13
+ "direction": "Right",
14
+ "pad_to_multiple_of": null,
15
+ "pad_id": 0,
16
+ "pad_type_id": 0,
17
+ "pad_token": "<pad>"
18
+ },
19
  "added_tokens": [
20
  {
21
  "id": 0,