CidQu commited on
Commit
2912552
·
verified ·
1 Parent(s): 15059d4

v0.2: chrF 26.97 on 200 TR->LZ test pairs (LoRA r=64, 18k steps, A100)

Browse files
Files changed (3) hide show
  1. README.md +20 -19
  2. adapter_config.json +5 -3
  3. adapter_model.safetensors +2 -2
README.md CHANGED
@@ -27,11 +27,11 @@ pipeline_tag: translation
27
 
28
  ---
29
 
30
- LoRA adapter for Gemma 4 E4B. **v0.1 research preview.**
31
 
32
  ## ⚠️ Status: research preview, not production-quality
33
 
34
- - **chrF on 200 held-out test pairs (TR→LZ): 24.66**
35
  - Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
36
  - Built for endangered-language preservation, research, and community use.
37
  - Full training pipeline + iteration log: <https://github.com/CidQu/lazca_ai>
@@ -64,10 +64,10 @@ def translate(text, to="lzz"):
64
  print(translate("Su içmek istiyorum."))
65
  ```
66
 
67
- Pin to a specific release with `revision="v0.1"`:
68
 
69
  ```python
70
- model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.1")
71
  ```
72
 
73
  ## Performance
@@ -77,7 +77,8 @@ chrF computed on 200 held-out TR→LZ pairs from the corpus's test split (5%), w
77
  | Version | chrF (TR→LZ) | Notes |
78
  |---|---:|---|
79
  | baseline Gemma 4 E4B (no adapter) | ≈ 0 | does not translate Laz |
80
- | v0.1 (this release) | **24.66** | LoRA r=32, 10,500 masked-loss steps (~2.15 epochs) |
 
81
 
82
  For context, chrF roughly maps:
83
  - ~10: garbled
@@ -85,29 +86,29 @@ For context, chrF roughly maps:
85
  - ~40+: useful translations
86
  - ~50+: professional-level
87
 
88
- LazuriMT v0.1 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.
89
 
90
  ## Training setup
91
 
92
  - **Base model**: `unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (Gemma 4 E4B, pre-quantized to 4-bit)
93
- - **Adapter**: LoRA on language layers (attention + MLP), `r=32`, `α=32`, dropout 0
94
- - **Trainable params**: 73,400,320 of 8,069,556,768 (0.91 %)
95
  - **Loss masking**: response-only (loss computed on Laz output tokens, instruction prompt masked)
96
- - **Optimizer**: 8-bit AdamW, `lr=2e-4`, linear decay, warmup_ratio 0.03
97
- - **Batch**: 16 effective (8 per-device × 2 grad-accum, set by Unsloth auto-tuning)
98
- - **Steps**: 10,500 ( 2.15 epochs over ~78K bidirectional conversations)
99
- - **Hardware**: 1× NVIDIA Tesla T4 (Kaggle), Unsloth runtime
100
- - **Training time**: ~12 h (run was cut by Kaggle's 12 h limit at step 10,500 of an intended 12,000; the resulting checkpoint is what's released)
101
  - **Bidirectional**: every TR↔LZ pair is presented in both directions during training
102
 
103
- ## Known limitations (and v0.2 roadmap)
104
 
105
- 1. **Dialect conditioning doesn't differentiate output yet.**
106
- "Atina (Pazar)" vs "Xopa (Hopa)" prompts currently produce the same translation. A dialect audit confirmed the *data signal exists* (66-80 % of dialect-tagged pairs have a different LZZ than the "general" entry for the same TR) v0.2 will upweight these pairs ~3× and front-load the dialect label in the prompt.
107
  2. **Short single-word queries collapse onto plausible-wrong tokens** (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
108
- 3. **Long sentences occasionally exhibit list-style repetition.** `no_repeat_ngram_size=3` mitigates this but doesn't fully eliminate it.
109
  4. **Vocabulary edge cases** — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
110
- 5. **Single dialect bias in output** — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar) at ~3,000 pairs; expect output to lean general / Atina.
111
 
112
  ## Bias and intended use
113
 
@@ -130,7 +131,7 @@ The training corpus mixes open-license sources (Wikipedia CC-BY-SA, Mozilla Comm
130
  year = {2026},
131
  publisher = {Hugging Face},
132
  howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
133
- note = {v0.1 research preview, chrF 24.66 on 200 TR→LZ test pairs}
134
  }
135
  ```
136
 
 
27
 
28
  ---
29
 
30
+ LoRA adapter for Gemma 4 E4B. **v0.2 research preview.**
31
 
32
  ## ⚠️ Status: research preview, not production-quality
33
 
34
+ - **chrF on 200 held-out test pairs (TR→LZ): 26.97** (v0.1 was 24.66)
35
  - Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
36
  - Built for endangered-language preservation, research, and community use.
37
  - Full training pipeline + iteration log: <https://github.com/CidQu/lazca_ai>
 
64
  print(translate("Su içmek istiyorum."))
65
  ```
66
 
67
+ Pin to a specific release with `revision="v0.2"` (or `"v0.1"` for the older one):
68
 
69
  ```python
70
+ model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.2")
71
  ```
72
 
73
  ## Performance
 
77
  | Version | chrF (TR→LZ) | Notes |
78
  |---|---:|---|
79
  | baseline Gemma 4 E4B (no adapter) | ≈ 0 | does not translate Laz |
80
+ | v0.1 | 24.66 | LoRA r=32, 10,500 masked-loss steps (~2.15 epochs), Kaggle T4 |
81
+ | **v0.2 (this release)** | **26.97** | LoRA r=64, 18,000 steps (3 epochs), A100, cosine-restart LR, 3× dialect upweight |
82
 
83
  For context, chrF roughly maps:
84
  - ~10: garbled
 
86
  - ~40+: useful translations
87
  - ~50+: professional-level
88
 
89
+ LazuriMT v0.2 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.
90
 
91
  ## Training setup
92
 
93
  - **Base model**: `unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (Gemma 4 E4B, pre-quantized to 4-bit)
94
+ - **Adapter**: LoRA on language layers (attention + MLP), `r=64`, `α=64`, dropout 0
95
+ - **Trainable params**: 146,800,640 of 8,142,957,088 (1.80 %)
96
  - **Loss masking**: response-only (loss computed on Laz output tokens, instruction prompt masked)
97
+ - **Optimizer**: 8-bit AdamW, `lr=2e-4`, cosine-with-restarts (2 cycles), warmup_ratio 0.03, bf16
98
+ - **Batch**: 16 effective (8 per-device × 2 grad-accum)
99
+ - **Steps**: 18,000 (3 epochs over 102,461 conversations, incl. 3× dialect upweighting + grammar examples)
100
+ - **Hardware**: 1× NVIDIA A100-40GB (Modal), Unsloth runtime
101
+ - **Training time**: ~8 h (full run, no timeout)
102
  - **Bidirectional**: every TR↔LZ pair is presented in both directions during training
103
 
104
+ ## Known limitations (and v0.3 roadmap)
105
 
106
+ 1. **Dialect conditioning still doesn't differentiate output.**
107
+ "Atina (Pazar)" vs "Xopa (Hopa)" prompts produce near-identical translations. v0.2 *attempted* a fix — 3× upweighting of dialect-tagged pairs plus a front-loaded `[Laz dialect: X]` label in the prompt but it did not meaningfully change behavior. The likely cause: even at 3×, dialect-tagged pairs are only ~9 % of the training mix, so the model defaults to general-form Laz. v0.3 will try a **dialect-balanced sampler** (equal exposure per dialect rather than blunt upweighting) plus additional dialect-tagged parallel data.
108
  2. **Short single-word queries collapse onto plausible-wrong tokens** (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
109
+ 3. **Long, content-dense sentences degrade** they can diverge substantially from the reference (more a coverage/data-volume issue than a decoding one).
110
  4. **Vocabulary edge cases** — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
111
+ 5. **Single dialect bias in output** — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar); expect output to lean general / Atina.
112
 
113
  ## Bias and intended use
114
 
 
131
  year = {2026},
132
  publisher = {Hugging Face},
133
  howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
134
+ note = {v0.2 research preview, chrF 26.97 on 200 TR→LZ test pairs}
135
  }
136
  ```
137
 
adapter_config.json CHANGED
@@ -20,22 +20,24 @@
20
  "layers_pattern": null,
21
  "layers_to_transform": null,
22
  "loftq_config": {},
23
- "lora_alpha": 32,
24
  "lora_bias": false,
25
  "lora_dropout": 0,
 
26
  "megatron_config": null,
27
  "megatron_core": "megatron.core",
28
  "modules_to_save": null,
29
  "peft_type": "LORA",
30
- "peft_version": "0.18.1",
31
  "qalora_group_size": 16,
32
- "r": 32,
33
  "rank_pattern": {},
34
  "revision": null,
35
  "target_modules": "(?:.*?(?:language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj)))",
36
  "target_parameters": null,
37
  "task_type": "CAUSAL_LM",
38
  "trainable_token_indices": null,
 
39
  "use_dora": false,
40
  "use_qalora": false,
41
  "use_rslora": false
 
20
  "layers_pattern": null,
21
  "layers_to_transform": null,
22
  "loftq_config": {},
23
+ "lora_alpha": 64,
24
  "lora_bias": false,
25
  "lora_dropout": 0,
26
+ "lora_ga_config": null,
27
  "megatron_config": null,
28
  "megatron_core": "megatron.core",
29
  "modules_to_save": null,
30
  "peft_type": "LORA",
31
+ "peft_version": "0.19.1",
32
  "qalora_group_size": 16,
33
+ "r": 64,
34
  "rank_pattern": {},
35
  "revision": null,
36
  "target_modules": "(?:.*?(?:language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj)))",
37
  "target_parameters": null,
38
  "task_type": "CAUSAL_LM",
39
  "trainable_token_indices": null,
40
+ "use_bdlora": null,
41
  "use_dora": false,
42
  "use_qalora": false,
43
  "use_rslora": false
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2b71e6e5be27cf600db658605e0e19fe8e2a52614ef2fd58ea908beab014c1f5
3
- size 293689248
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b55e8108a806a355feaea4b55cba23d55ab7261cc1ee3b75b3c56960a66c3e1
3
+ size 587290752