v0.2: chrF 26.97 on 200 TR->LZ test pairs (LoRA r=64, 18k steps, A100)

Browse files

Files changed (3) hide show

README.md +20 -19
adapter_config.json +5 -3
adapter_model.safetensors +2 -2

README.md CHANGED Viewed

@@ -27,11 +27,11 @@ pipeline_tag: translation
 ---
-LoRA adapter for Gemma 4 E4B. **v0.1 research preview.**
 ## ⚠️ Status: research preview, not production-quality
-- **chrF on 200 held-out test pairs (TR→LZ): 24.66**
 - Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
 - Built for endangered-language preservation, research, and community use.
 - Full training pipeline + iteration log: <https://github.com/CidQu/lazca_ai>
@@ -64,10 +64,10 @@ def translate(text, to="lzz"):
 print(translate("Su içmek istiyorum."))
 ```
-Pin to a specific release with `revision="v0.1"`:
 ```python
-model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.1")
 ```
 ## Performance
@@ -77,7 +77,8 @@ chrF computed on 200 held-out TR→LZ pairs from the corpus's test split (5%), w
 | Version | chrF (TR→LZ) | Notes |
 |---|---:|---|
 | baseline Gemma 4 E4B (no adapter) | ≈ 0 | does not translate Laz |
-| v0.1 (this release) | **24.66** | LoRA r=32, 10,500 masked-loss steps (~2.15 epochs) |
 For context, chrF roughly maps:
 - ~10: garbled
@@ -85,29 +86,29 @@ For context, chrF roughly maps:
 - ~40+: useful translations
 - ~50+: professional-level
-LazuriMT v0.1 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.
 ## Training setup
 - **Base model**: `unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (Gemma 4 E4B, pre-quantized to 4-bit)
-- **Adapter**: LoRA on language layers (attention + MLP), `r=32`, `α=32`, dropout 0
-- **Trainable params**: 73,400,320 of 8,069,556,768 (0.91 %)
 - **Loss masking**: response-only (loss computed on Laz output tokens, instruction prompt masked)
-- **Optimizer**: 8-bit AdamW, `lr=2e-4`, linear decay, warmup_ratio 0.03
-- **Batch**: 16 effective (8 per-device × 2 grad-accum, set by Unsloth auto-tuning)
-- **Steps**: 10,500 (≈ 2.15 epochs over ~78K bidirectional conversations)
-- **Hardware**: 1× NVIDIA Tesla T4 (Kaggle), Unsloth runtime
-- **Training time**: ~12 h (run was cut by Kaggle's 12 h limit at step 10,500 of an intended 12,000; the resulting checkpoint is what's released)
 - **Bidirectional**: every TR↔LZ pair is presented in both directions during training
-## Known limitations (and v0.2 roadmap)
-1. **Dialect conditioning doesn't differentiate output yet.**
-   "Atina (Pazar)" vs "Xopa (Hopa)" prompts currently produce the same translation. A dialect audit confirmed the *data signal exists* (66-80 % of dialect-tagged pairs have a different LZZ than the "general" entry for the same TR) — v0.2 will upweight these pairs ~3× and front-load the dialect label in the prompt.
 2. **Short single-word queries collapse onto plausible-wrong tokens** (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
-3. **Long sentences occasionally exhibit list-style repetition.** `no_repeat_ngram_size=3` mitigates this but doesn't fully eliminate it.
 4. **Vocabulary edge cases** — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
-5. **Single dialect bias in output** — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar) at ~3,000 pairs; expect output to lean general / Atina.
 ## Bias and intended use
@@ -130,7 +131,7 @@ The training corpus mixes open-license sources (Wikipedia CC-BY-SA, Mozilla Comm
   year   = {2026},
   publisher = {Hugging Face},
   howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
-  note   = {v0.1 research preview, chrF 24.66 on 200 TR→LZ test pairs}
 }
 ```

 ---
+LoRA adapter for Gemma 4 E4B. **v0.2 research preview.**
 ## ⚠️ Status: research preview, not production-quality
+- **chrF on 200 held-out test pairs (TR→LZ): 26.97** (v0.1 was 24.66)
 - Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
 - Built for endangered-language preservation, research, and community use.
 - Full training pipeline + iteration log: <https://github.com/CidQu/lazca_ai>
 print(translate("Su içmek istiyorum."))
 ```
+Pin to a specific release with `revision="v0.2"` (or `"v0.1"` for the older one):
 ```python
+model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.2")
 ```
 ## Performance
 | Version | chrF (TR→LZ) | Notes |
 |---|---:|---|
 | baseline Gemma 4 E4B (no adapter) | ≈ 0 | does not translate Laz |
+| v0.1 | 24.66 | LoRA r=32, 10,500 masked-loss steps (~2.15 epochs), Kaggle T4 |
+| **v0.2 (this release)** | **26.97** | LoRA r=64, 18,000 steps (3 epochs), A100, cosine-restart LR, 3× dialect upweight |
 For context, chrF roughly maps:
 - ~10: garbled
 - ~40+: useful translations
 - ~50+: professional-level
+LazuriMT v0.2 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.
 ## Training setup
 - **Base model**: `unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (Gemma 4 E4B, pre-quantized to 4-bit)
+- **Adapter**: LoRA on language layers (attention + MLP), `r=64`, `α=64`, dropout 0
+- **Trainable params**: 146,800,640 of 8,142,957,088 (1.80 %)
 - **Loss masking**: response-only (loss computed on Laz output tokens, instruction prompt masked)
+- **Optimizer**: 8-bit AdamW, `lr=2e-4`, cosine-with-restarts (2 cycles), warmup_ratio 0.03, bf16
+- **Batch**: 16 effective (8 per-device × 2 grad-accum)
+- **Steps**: 18,000 (3 epochs over 102,461 conversations, incl. 3× dialect upweighting + grammar examples)
+- **Hardware**: 1× NVIDIA A100-40GB (Modal), Unsloth runtime
+- **Training time**: ~8 h (full run, no timeout)
 - **Bidirectional**: every TR↔LZ pair is presented in both directions during training
+## Known limitations (and v0.3 roadmap)
+1. **Dialect conditioning still doesn't differentiate output.**
+   "Atina (Pazar)" vs "Xopa (Hopa)" prompts produce near-identical translations. v0.2 *attempted* a fix — 3× upweighting of dialect-tagged pairs plus a front-loaded `[Laz dialect: X]` label in the prompt — but it did not meaningfully change behavior. The likely cause: even at 3×, dialect-tagged pairs are only ~9 % of the training mix, so the model defaults to general-form Laz. v0.3 will try a **dialect-balanced sampler** (equal exposure per dialect rather than blunt upweighting) plus additional dialect-tagged parallel data.
 2. **Short single-word queries collapse onto plausible-wrong tokens** (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
+3. **Long, content-dense sentences degrade** — they can diverge substantially from the reference (more a coverage/data-volume issue than a decoding one).
 4. **Vocabulary edge cases** — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
+5. **Single dialect bias in output** — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar); expect output to lean general / Atina.
 ## Bias and intended use
   year   = {2026},
   publisher = {Hugging Face},
   howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
+  note   = {v0.2 research preview, chrF 26.97 on 200 TR→LZ test pairs}
 }
 ```

adapter_config.json CHANGED Viewed

@@ -20,22 +20,24 @@
   "layers_pattern": null,
   "layers_to_transform": null,
   "loftq_config": {},
-  "lora_alpha": 32,
   "lora_bias": false,
   "lora_dropout": 0,
   "megatron_config": null,
   "megatron_core": "megatron.core",
   "modules_to_save": null,
   "peft_type": "LORA",
-  "peft_version": "0.18.1",
   "qalora_group_size": 16,
-  "r": 32,
   "rank_pattern": {},
   "revision": null,
   "target_modules": "(?:.*?(?:language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj)))",
   "target_parameters": null,
   "task_type": "CAUSAL_LM",
   "trainable_token_indices": null,
   "use_dora": false,
   "use_qalora": false,
   "use_rslora": false

   "layers_pattern": null,
   "layers_to_transform": null,
   "loftq_config": {},
+  "lora_alpha": 64,
   "lora_bias": false,
   "lora_dropout": 0,
+  "lora_ga_config": null,
   "megatron_config": null,
   "megatron_core": "megatron.core",
   "modules_to_save": null,
   "peft_type": "LORA",
+  "peft_version": "0.19.1",
   "qalora_group_size": 16,
+  "r": 64,
   "rank_pattern": {},
   "revision": null,
   "target_modules": "(?:.*?(?:language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj)))",
   "target_parameters": null,
   "task_type": "CAUSAL_LM",
   "trainable_token_indices": null,
+  "use_bdlora": null,
   "use_dora": false,
   "use_qalora": false,
   "use_rslora": false

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2b71e6e5be27cf600db658605e0e19fe8e2a52614ef2fd58ea908beab014c1f5
-size 293689248

 version https://git-lfs.github.com/spec/v1
+oid sha256:2b55e8108a806a355feaea4b55cba23d55ab7261cc1ee3b75b3c56960a66c3e1
+size 587290752