CidQu commited on
Commit
e3423e2
·
verified ·
1 Parent(s): 666e223

v0.1: research preview, chrF 24.66 on 200 TR->LZ test pairs

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: peft
4
+ base_model: unsloth/gemma-4-e4b-it-unsloth-bnb-4bit
5
+ tags:
6
+ - translation
7
+ - laz
8
+ - lazuri
9
+ - turkish
10
+ - lzz
11
+ - endangered-language
12
+ - kartvelian
13
+ - low-resource
14
+ language:
15
+ - tr
16
+ - lzz
17
+ pipeline_tag: translation
18
+ ---
19
+
20
+ # LazuriMT — Turkish ↔ Laz (Lazuri) Translation
21
+
22
+ LoRA adapter for Gemma 4 E4B that translates between Turkish (`tr`) and Laz / Lazuri (`lzz`), an endangered Kartvelian language with ~30,000–250,000 speakers in northeastern Türkiye and parts of Georgia. **v0.1 research preview.**
23
+
24
+ ## ⚠️ Status: research preview, not production-quality
25
+
26
+ - **chrF on 200 held-out test pairs (TR→LZ): 24.66**
27
+ - Real Laz output for natural sentences, but uneven on rare vocabulary and dialect conditioning.
28
+ - Built for endangered-language preservation, research, and community use.
29
+ - Full training pipeline + iteration log: <https://github.com/CidQu/lazca_ai>
30
+
31
+ ## Quick start
32
+
33
+ ```python
34
+ from peft import PeftModel
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer
36
+
37
+ base = "unsloth/gemma-4-e4b-it-unsloth-bnb-4bit"
38
+ model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", load_in_4bit=True)
39
+ model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT")
40
+ tok = AutoTokenizer.from_pretrained("CidQuLimited/LazuriMT")
41
+
42
+ def translate(text, to="lzz"):
43
+ prompt = (f"Translate this Turkish sentence into Laz (Lazuri):\n\n{text}"
44
+ if to == "lzz"
45
+ else f"Translate this Laz (Lazuri) sentence into Turkish:\n\n{text}")
46
+ inputs = tok.apply_chat_template(
47
+ [{"role": "user", "content": prompt}],
48
+ tokenize=True, add_generation_prompt=True, return_tensors="pt",
49
+ ).to(model.device)
50
+ out = model.generate(
51
+ input_ids=inputs, max_new_tokens=128, do_sample=False,
52
+ no_repeat_ngram_size=3, repetition_penalty=1.15, num_beams=4,
53
+ )
54
+ return tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True).strip()
55
+
56
+ print(translate("Su içmek istiyorum."))
57
+ ```
58
+
59
+ Pin to a specific release with `revision="v0.1"`:
60
+
61
+ ```python
62
+ model = PeftModel.from_pretrained(model, "CidQuLimited/LazuriMT", revision="v0.1")
63
+ ```
64
+
65
+ ## Performance
66
+
67
+ chrF computed on 200 held-out TR→LZ pairs from the corpus's test split (5%), with beam-search decoding (`no_repeat_ngram_size=3`, `repetition_penalty=1.15`, `num_beams=4`).
68
+
69
+ | Version | chrF (TR→LZ) | Notes |
70
+ |---|---:|---|
71
+ | baseline Gemma 4 E4B (no adapter) | ≈ 0 | does not translate Laz |
72
+ | v0.1 (this release) | **24.66** | LoRA r=32, 10,500 masked-loss steps (~2.15 epochs) |
73
+
74
+ For context, chrF roughly maps:
75
+ - ~10: garbled
76
+ - ~20–30: readable but flawed
77
+ - ~40+: useful translations
78
+ - ~50+: professional-level
79
+
80
+ LazuriMT v0.1 is in the "readable but flawed" range — a real but early baseline for a language with almost no prior MT.
81
+
82
+ ## Training data
83
+
84
+ ~42,500 unique sentence pairs after deduplication, from a mix of openly-licensed and academically-attributed sources. Heavy bias toward sentence-level prose (~45%) plus dictionary entries (~55%).
85
+
86
+ | Source | Pairs (approx.) | Attribution / License |
87
+ |---|---:|---|
88
+ | Bucaklişi Lazuri Nenapuna (online) | 12,800 | İ. A. Bucaklişi & H. Uzunhasanoğlu / lazuri.com |
89
+ | Lazuri.Com Sözlük 1.0 (2003-2005) | 12,500 | İ. A. Bucaklişi & H. Uzunhasanoğlu |
90
+ | lazcasozluk.org dictionary | 10,800 | Bucaklişi (online dictionary) |
91
+ | MEB Lazuri 5/6/7 textbooks | 5,100 | Türkiye Ministry of National Education (vision-OCR) |
92
+ | Lazuri Wiktionary GitHub | 4,300 | GPL-3.0, Mass-Upload-Lazuri-Wiktionary repo |
93
+ | Aksamaz folktales + lessons + interviews | 2,800 | Ali İhsan Aksamaz, via sonhaber.ch |
94
+ | lazuri.com Didinana memoirs (sentence-split) | 1,400 | Lazuri.Com |
95
+ | Anadolu Dillerinde Küçük Prens (Le Petit Prince in Laz) | 1,250 | Translated by Özlem Durmaz |
96
+ | Mozilla Common Voice UI strings | 1,080 | CC-BY-3.0, contributors of Common Voice |
97
+ | Doviguram lessons + grammar | 1,030 | Lazuri.Com / Atelya Lazuri |
98
+ | lazuri.com paramitepe (folktales, sentence-split) | 590 | Lazuri Paramitepe (Tbilisi, 1982) — public-domain Soviet-era folktale collection |
99
+ | siir (Laz poetry, parallel translations) | 460 | Lazuri.Com siir collection |
100
+ | Tatoeba + Glosbe crowdsourced | 200 | CC-BY |
101
+ | Wikipedia Lazuri Incubator (first sentences) | 56 | CC-BY-SA-3.0 |
102
+ | Other (riddles, prayers, etc.) | < 100 | various |
103
+
104
+ Plus 525 grammar-instruction examples extracted from textbook lesson pages and mixed into training as instruction-tuning conversations.
105
+
106
+ **This is a research / preservation release.** Training data was collected and processed under fair-use academic principles for endangered-language MT research. The adapter encodes patterns from the corpus but does not redistribute source texts. If you are a rights-holder for any listed source and have concerns, please open an issue at <https://github.com/CidQu/lazca_ai/issues>.
107
+
108
+ ## Training setup
109
+
110
+ - **Base model**: `unsloth/gemma-4-e4b-it-unsloth-bnb-4bit` (Gemma 4 E4B, pre-quantized to 4-bit)
111
+ - **Adapter**: LoRA on language layers (attention + MLP), `r=32`, `α=32`, dropout 0
112
+ - **Trainable params**: 73,400,320 of 8,069,556,768 (0.91 %)
113
+ - **Loss masking**: response-only (loss computed on Laz output tokens, instruction prompt masked)
114
+ - **Optimizer**: 8-bit AdamW, `lr=2e-4`, linear decay, warmup_ratio 0.03
115
+ - **Batch**: 16 effective (8 per-device × 2 grad-accum, set by Unsloth auto-tuning)
116
+ - **Steps**: 10,500 (≈ 2.15 epochs over ~78K bidirectional conversations)
117
+ - **Hardware**: 1× NVIDIA Tesla T4 (Kaggle), Unsloth runtime
118
+ - **Training time**: ~12 h (run was cut by Kaggle's 12 h limit at step 10,500 of an intended 12,000; the resulting checkpoint is what's released)
119
+ - **Bidirectional**: every TR↔LZ pair is presented in both directions during training
120
+
121
+ ## Known limitations (and v0.2 roadmap)
122
+
123
+ 1. **Dialect conditioning doesn't differentiate output yet.**
124
+ "Atina (Pazar)" vs "Xopa (Hopa)" prompts currently produce the same translation. A dialect audit confirmed the *data signal exists* (66-80 % of dialect-tagged pairs have a different LZZ than the "general" entry for the same TR) — v0.2 will upweight these pairs ~3× and front-load the dialect label in the prompt.
125
+ 2. **Short single-word queries collapse onto plausible-wrong tokens** (e.g. dictionary-style TR words sometimes yield a wrong Laz lemma). The corpus's still-dominant vocab slice teaches vocabulary lookup imperfectly.
126
+ 3. **Long sentences occasionally exhibit list-style repetition.** `no_repeat_ngram_size=3` mitigates this but doesn't fully eliminate it.
127
+ 4. **Vocabulary edge cases** — some real Laz words are mistranslated (model emits a wrong-but-plausible Laz word).
128
+ 5. **Single dialect bias in output** — the corpus is mostly general-form Laz with the largest single-dialect contribution being Atina (Pazar) at ~3,000 pairs; expect output to lean general / Atina.
129
+
130
+ ## Bias and intended use
131
+
132
+ - **Intended for**: Laz language preservation, research, language-learning aids, accessibility tools, community projects. Not a replacement for human translators in any setting where accuracy matters (legal, medical, etc.).
133
+ - **Bias**: trained on a mix of written sources (dictionaries, school textbooks, folktales, news articles). Will reflect the registers and dialects of those sources.
134
+ - **Out of scope**: code translation, modern colloquial / internet Laz (the corpus is mostly literary/educational).
135
+
136
+ ## License
137
+
138
+ The adapter is derivative work of [Gemma 4](https://ai.google.dev/gemma) and inherits the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) — commercial-friendly but with acceptable-use restrictions. Downstream users must comply with Gemma's terms.
139
+
140
+ The training corpus mixes open-license sources (Wikipedia CC-BY-SA, Mozilla Common Voice CC-BY, GPL-3.0 Wiktionary, public-domain Lazuri Paramitepe 1982) with academically-attributed sources used under fair-use for endangered-language research. The adapter weights are released for research and community use under these combined terms.
141
+
142
+ ## Citation
143
+
144
+ ```bibtex
145
+ @misc{lazurimt2026,
146
+ title = {LazuriMT: A Turkish-Laz Machine Translation Adapter for an Endangered Kartvelian Language},
147
+ author = {Yavuz Selimhan Kaya},
148
+ year = {2026},
149
+ publisher = {Hugging Face},
150
+ howpublished = {\url{https://huggingface.co/CidQuLimited/LazuriMT}},
151
+ note = {v0.1 research preview, chrF 24.66 on 200 TR→LZ test pairs}
152
+ }
153
+ ```
154
+
155
+ ## Acknowledgments
156
+
157
+ The Lazuri community: İsmail Avcı Bucaklişi, Hasan Uzunhasanoğlu (Lazuri.Com), Ali İhsan Aksamaz, Özlem Durmaz (translator of *Anadolu Dillerinde Küçük Prens*), the Laz Institute, contributors to lazcasozluk.org and the Lazuri Wiktionary GitHub project, the Ministry of National Education of Türkiye, the broader Laz language preservation community, and every Laz speaker who has kept this language alive.
158
+
159
+ Tools: [Unsloth](https://unsloth.ai) for the QLoRA training stack, [Google's Gemma 4](https://ai.google.dev/gemma) as the base model, the Hugging Face ecosystem, and Kaggle for the GPU compute.
160
+
161
+ Full reproduction code, training data sources, and iteration history: <https://github.com/CidQu/lazca_ai>
adapter_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": {
6
+ "base_model_class": "Gemma4ForConditionalGeneration",
7
+ "parent_library": "transformers.models.gemma4.modeling_gemma4",
8
+ "unsloth_fixed": true
9
+ },
10
+ "base_model_name_or_path": "unsloth/gemma-4-e4b-it-unsloth-bnb-4bit",
11
+ "bias": "none",
12
+ "corda_config": null,
13
+ "ensure_weight_tying": false,
14
+ "eva_config": null,
15
+ "exclude_modules": null,
16
+ "fan_in_fan_out": false,
17
+ "inference_mode": true,
18
+ "init_lora_weights": true,
19
+ "layer_replication": null,
20
+ "layers_pattern": null,
21
+ "layers_to_transform": null,
22
+ "loftq_config": {},
23
+ "lora_alpha": 32,
24
+ "lora_bias": false,
25
+ "lora_dropout": 0,
26
+ "megatron_config": null,
27
+ "megatron_core": "megatron.core",
28
+ "modules_to_save": null,
29
+ "peft_type": "LORA",
30
+ "peft_version": "0.18.1",
31
+ "qalora_group_size": 16,
32
+ "r": 32,
33
+ "rank_pattern": {},
34
+ "revision": null,
35
+ "target_modules": "(?:.*?(?:language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:k_proj|q_proj|v_proj|o_proj|gate_proj|up_proj|down_proj|per_layer_input_gate|per_layer_projection|linear|embedding_projection|relative_k_proj)))",
36
+ "target_parameters": null,
37
+ "task_type": "CAUSAL_LM",
38
+ "trainable_token_indices": null,
39
+ "use_dora": false,
40
+ "use_qalora": false,
41
+ "use_rslora": false
42
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b71e6e5be27cf600db658605e0e19fe8e2a52614ef2fd58ea908beab014c1f5
3
+ size 293689248
chat_template.jinja ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- macro format_parameters(properties, required, filter_keys=false) -%}
2
+ {%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%}
3
+ {%- set ns = namespace(found_first=false) -%}
4
+ {%- for key, value in properties | dictsort -%}
5
+ {%- set add_comma = false -%}
6
+ {%- if not filter_keys or key not in standard_keys -%}
7
+ {%- if ns.found_first %},{% endif -%}
8
+ {%- set ns.found_first = true -%}
9
+ {{ key }}:{
10
+ {%- if value['description'] -%}
11
+ description:<|"|>{{ value['description'] }}<|"|>
12
+ {%- set add_comma = true -%}
13
+ {%- endif -%}
14
+ {%- if value['type'] | upper == 'STRING' -%}
15
+ {%- if value['enum'] -%}
16
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
17
+ enum:{{ format_argument(value['enum']) }}
18
+ {%- endif -%}
19
+ {%- elif value['type'] | upper == 'ARRAY' -%}
20
+ {%- if value['items'] is mapping and value['items'] -%}
21
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
22
+ items:{
23
+ {%- set ns_items = namespace(found_first=false) -%}
24
+ {%- for item_key, item_value in value['items'] | dictsort -%}
25
+ {%- if item_value is not none -%}
26
+ {%- if ns_items.found_first %},{% endif -%}
27
+ {%- set ns_items.found_first = true -%}
28
+ {%- if item_key == 'properties' -%}
29
+ properties:{
30
+ {%- if item_value is mapping -%}
31
+ {{- format_parameters(item_value, value['items']['required'] | default([])) -}}
32
+ {%- endif -%}
33
+ }
34
+ {%- elif item_key == 'required' -%}
35
+ required:[
36
+ {%- for req_item in item_value -%}
37
+ <|"|>{{- req_item -}}<|"|>
38
+ {%- if not loop.last %},{% endif -%}
39
+ {%- endfor -%}
40
+ ]
41
+ {%- elif item_key == 'type' -%}
42
+ {%- if item_value is string -%}
43
+ type:{{ format_argument(item_value | upper) }}
44
+ {%- else -%}
45
+ type:{{ format_argument(item_value | map('upper') | list) }}
46
+ {%- endif -%}
47
+ {%- else -%}
48
+ {{ item_key }}:{{ format_argument(item_value) }}
49
+ {%- endif -%}
50
+ {%- endif -%}
51
+ {%- endfor -%}
52
+ }
53
+ {%- endif -%}
54
+ {%- endif -%}
55
+ {%- if value['nullable'] %}
56
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
57
+ nullable:true
58
+ {%- endif -%}
59
+ {%- if value['type'] | upper == 'OBJECT' -%}
60
+ {%- if value['properties'] is defined and value['properties'] is mapping -%}
61
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
62
+ properties:{
63
+ {{- format_parameters(value['properties'], value['required'] | default([])) -}}
64
+ }
65
+ {%- elif value is mapping -%}
66
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
67
+ properties:{
68
+ {{- format_parameters(value, value['required'] | default([]), filter_keys=true) -}}
69
+ }
70
+ {%- endif -%}
71
+ {%- if value['required'] -%}
72
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
73
+ required:[
74
+ {%- for item in value['required'] | default([]) -%}
75
+ <|"|>{{- item -}}<|"|>
76
+ {%- if not loop.last %},{% endif -%}
77
+ {%- endfor -%}
78
+ ]
79
+ {%- endif -%}
80
+ {%- endif -%}
81
+ {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
82
+ type:<|"|>{{ value['type'] | upper }}<|"|>}
83
+ {%- endif -%}
84
+ {%- endfor -%}
85
+ {%- endmacro -%}
86
+ {%- macro format_function_declaration(tool_data) -%}
87
+ declaration:{{- tool_data['function']['name'] -}}{description:<|"|>{{- tool_data['function']['description'] -}}<|"|>
88
+ {%- set params = tool_data['function']['parameters'] -%}
89
+ {%- if params -%}
90
+ ,parameters:{
91
+ {%- if params['properties'] -%}
92
+ properties:{ {{- format_parameters(params['properties'], params['required']) -}} },
93
+ {%- endif -%}
94
+ {%- if params['required'] -%}
95
+ required:[
96
+ {%- for item in params['required'] -%}
97
+ <|"|>{{- item -}}<|"|>
98
+ {{- ',' if not loop.last -}}
99
+ {%- endfor -%}
100
+ ],
101
+ {%- endif -%}
102
+ {%- if params['type'] -%}
103
+ type:<|"|>{{- params['type'] | upper -}}<|"|>}
104
+ {%- endif -%}
105
+ {%- endif -%}
106
+ {%- if 'response' in tool_data['function'] -%}
107
+ {%- set response_declaration = tool_data['function']['response'] -%}
108
+ ,response:{
109
+ {%- if response_declaration['description'] -%}
110
+ description:<|"|>{{- response_declaration['description'] -}}<|"|>,
111
+ {%- endif -%}
112
+ {%- if response_declaration['type'] | upper == 'OBJECT' -%}
113
+ type:<|"|>{{- response_declaration['type'] | upper -}}<|"|>}
114
+ {%- endif -%}
115
+ {%- endif -%}
116
+ }
117
+ {%- endmacro -%}
118
+ {%- macro format_argument(argument, escape_keys=True) -%}
119
+ {%- if argument is string -%}
120
+ {{- '<|"|>' + argument + '<|"|>' -}}
121
+ {%- elif argument is boolean -%}
122
+ {{- 'true' if argument else 'false' -}}
123
+ {%- elif argument is mapping -%}
124
+ {{- '{' -}}
125
+ {%- set ns = namespace(found_first=false) -%}
126
+ {%- for key, value in argument | dictsort -%}
127
+ {%- if ns.found_first %},{% endif -%}
128
+ {%- set ns.found_first = true -%}
129
+ {%- if escape_keys -%}
130
+ {{- '<|"|>' + key + '<|"|>' -}}
131
+ {%- else -%}
132
+ {{- key -}}
133
+ {%- endif -%}
134
+ :{{- format_argument(value, escape_keys=escape_keys) -}}
135
+ {%- endfor -%}
136
+ {{- '}' -}}
137
+ {%- elif argument is sequence -%}
138
+ {{- '[' -}}
139
+ {%- for item in argument -%}
140
+ {{- format_argument(item, escape_keys=escape_keys) -}}
141
+ {%- if not loop.last %},{% endif -%}
142
+ {%- endfor -%}
143
+ {{- ']' -}}
144
+ {%- else -%}
145
+ {{- argument -}}
146
+ {%- endif -%}
147
+ {%- endmacro -%}
148
+ {%- macro strip_thinking(text) -%}
149
+ {%- set ns = namespace(result='') -%}
150
+ {%- for part in text.split('<channel|>') -%}
151
+ {%- if '<|channel>' in part -%}
152
+ {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
153
+ {%- else -%}
154
+ {%- set ns.result = ns.result + part -%}
155
+ {%- endif -%}
156
+ {%- endfor -%}
157
+ {{- ns.result | trim -}}
158
+ {%- endmacro -%}
159
+
160
+ {%- macro format_tool_response_block(tool_name, response) -%}
161
+ {{- '<|tool_response>' -}}
162
+ {%- if response is mapping -%}
163
+ {{- 'response:' + tool_name + '{' -}}
164
+ {%- for key, value in response | dictsort -%}
165
+ {{- key -}}:{{- format_argument(value, escape_keys=False) -}}
166
+ {%- if not loop.last %},{% endif -%}
167
+ {%- endfor -%}
168
+ {{- '}' -}}
169
+ {%- else -%}
170
+ {{- 'response:' + tool_name + '{value:' + format_argument(response, escape_keys=False) + '}' -}}
171
+ {%- endif -%}
172
+ {{- '<tool_response|>' -}}
173
+ {%- endmacro -%}
174
+
175
+ {%- set ns = namespace(prev_message_type=None) -%}
176
+ {%- set loop_messages = messages -%}
177
+ {{- bos_token -}}
178
+ {#- Handle System/Tool Definitions Block -#}
179
+ {%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%}
180
+ {{- '<|turn>system\n' -}}
181
+ {#- Inject Thinking token at the very top of the FIRST system turn -#}
182
+ {%- if enable_thinking is defined and enable_thinking -%}
183
+ {{- '<|think|>\n' -}}
184
+ {%- set ns.prev_message_type = 'think' -%}
185
+ {%- endif -%}
186
+ {%- if messages[0]['role'] in ['system', 'developer'] -%}
187
+ {%- if messages[0]['content'] is string -%}
188
+ {{- messages[0]['content'] | trim -}}
189
+ {%- elif messages[0]['content'] is sequence -%}
190
+ {%- for item in messages[0]['content'] -%}
191
+ {{- item['text'] | trim + ' '-}}
192
+ {%- endfor -%}
193
+ {%- endif -%}
194
+ {%- set loop_messages = messages[1:] -%}
195
+ {%- endif -%}
196
+ {%- if tools -%}
197
+ {%- for tool in tools %}
198
+ {{- '<|tool>' -}}
199
+ {{- format_function_declaration(tool) | trim -}}
200
+ {{- '<tool|>' -}}
201
+ {%- endfor %}
202
+ {%- set ns.prev_message_type = 'tool' -%}
203
+ {%- endif -%}
204
+ {{- '<turn|>\n' -}}
205
+ {%- endif %}
206
+
207
+ {#- Pre-scan: find last user message index for reasoning guard -#}
208
+ {%- set ns_turn = namespace(last_user_idx=-1) -%}
209
+ {%- for i in range(loop_messages | length) -%}
210
+ {%- if loop_messages[i]['role'] == 'user' -%}
211
+ {%- set ns_turn.last_user_idx = i -%}
212
+ {%- endif -%}
213
+ {%- endfor -%}
214
+
215
+ {#- Loop through messages -#}
216
+ {%- for message in loop_messages -%}
217
+ {%- if message['role'] != 'tool' -%}
218
+ {%- set ns.prev_message_type = None -%}
219
+ {%- set role = 'model' if message['role'] == 'assistant' else message['role'] -%}
220
+ {#- Detect continuation: suppress duplicate <|turn>model when previous non-tool message was also assistant -#}
221
+ {%- set prev_nt = namespace(role=None, found=false) -%}
222
+ {%- if loop.index0 > 0 -%}
223
+ {%- for j in range(loop.index0 - 1, -1, -1) -%}
224
+ {%- if not prev_nt.found -%}
225
+ {%- if loop_messages[j]['role'] != 'tool' -%}
226
+ {%- set prev_nt.role = loop_messages[j]['role'] -%}
227
+ {%- set prev_nt.found = true -%}
228
+ {%- endif -%}
229
+ {%- endif -%}
230
+ {%- endfor -%}
231
+ {%- endif -%}
232
+ {%- set continue_same_model_turn = (role == 'model' and prev_nt.role == 'assistant') -%}
233
+ {%- if not continue_same_model_turn -%}
234
+ {{- '<|turn>' + role + '\n' }}
235
+ {%- endif -%}
236
+
237
+ {#- Render reasoning/reasoning_content as thinking channel -#}
238
+ {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
239
+ {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
240
+ {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
241
+ {%- endif -%}
242
+
243
+ {%- if message['tool_calls'] -%}
244
+ {%- for tool_call in message['tool_calls'] -%}
245
+ {%- set function = tool_call['function'] -%}
246
+ {{- '<|tool_call>call:' + function['name'] + '{' -}}
247
+ {%- if function['arguments'] is mapping -%}
248
+ {%- set ns_args = namespace(found_first=false) -%}
249
+ {%- for key, value in function['arguments'] | dictsort -%}
250
+ {%- if ns_args.found_first %},{% endif -%}
251
+ {%- set ns_args.found_first = true -%}
252
+ {{- key -}}:{{- format_argument(value, escape_keys=False) -}}
253
+ {%- endfor -%}
254
+ {%- elif function['arguments'] is string -%}
255
+ {{- function['arguments'] -}}
256
+ {%- endif -%}
257
+ {{- '}<tool_call|>' -}}
258
+ {%- endfor -%}
259
+ {%- set ns.prev_message_type = 'tool_call' -%}
260
+ {%- endif -%}
261
+
262
+ {%- set ns_tr_out = namespace(flag=false) -%}
263
+ {%- if message.get('tool_responses') -%}
264
+ {#- Legacy: tool_responses embedded on the assistant message (Google/Gemma native) -#}
265
+ {%- for tool_response in message['tool_responses'] -%}
266
+ {{- format_tool_response_block(tool_response['name'] | default('unknown'), tool_response['response']) -}}
267
+ {%- set ns_tr_out.flag = true -%}
268
+ {%- set ns.prev_message_type = 'tool_response' -%}
269
+ {%- endfor -%}
270
+ {%- elif message.get('tool_calls') -%}
271
+ {#- OpenAI Chat Completions: forward-scan consecutive role:tool messages -#}
272
+ {%- set ns_tool_scan = namespace(stopped=false) -%}
273
+ {%- for k in range(loop.index0 + 1, loop_messages | length) -%}
274
+ {%- if ns_tool_scan.stopped -%}
275
+ {%- elif loop_messages[k]['role'] != 'tool' -%}
276
+ {%- set ns_tool_scan.stopped = true -%}
277
+ {%- else -%}
278
+ {%- set follow = loop_messages[k] -%}
279
+ {#- Resolve tool_call_id to function name -#}
280
+ {%- set ns_tname = namespace(name=follow.get('name') | default('unknown')) -%}
281
+ {%- for tc in message['tool_calls'] -%}
282
+ {%- if tc.get('id') == follow.get('tool_call_id') -%}
283
+ {%- set ns_tname.name = tc['function']['name'] -%}
284
+ {%- endif -%}
285
+ {%- endfor -%}
286
+ {#- Handle content as string or content-parts array -#}
287
+ {%- set tool_body = follow.get('content') -%}
288
+ {%- if tool_body is string -%}
289
+ {{- format_tool_response_block(ns_tname.name, tool_body) -}}
290
+ {%- elif tool_body is sequence and tool_body is not string -%}
291
+ {%- set ns_txt = namespace(s='') -%}
292
+ {%- for part in tool_body -%}
293
+ {%- if part.get('type') == 'text' -%}
294
+ {%- set ns_txt.s = ns_txt.s + (part.get('text') | default('')) -%}
295
+ {%- endif -%}
296
+ {%- endfor -%}
297
+ {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
298
+ {%- else -%}
299
+ {{- format_tool_response_block(ns_tname.name, tool_body) -}}
300
+ {%- endif -%}
301
+ {%- set ns_tr_out.flag = true -%}
302
+ {%- set ns.prev_message_type = 'tool_response' -%}
303
+ {%- endif -%}
304
+ {%- endfor -%}
305
+ {%- endif -%}
306
+
307
+ {%- set captured_content -%}
308
+ {%- if message['content'] is string -%}
309
+ {%- if role == 'model' -%}
310
+ {{- strip_thinking(message['content']) -}}
311
+ {%- else -%}
312
+ {{- message['content'] | trim -}}
313
+ {%- endif -%}
314
+ {%- elif message['content'] is sequence -%}
315
+ {%- for item in message['content'] -%}
316
+ {%- if item['type'] == 'text' -%}
317
+ {%- if role == 'model' -%}
318
+ {{- strip_thinking(item['text']) -}}
319
+ {%- else -%}
320
+ {{- item['text'] | trim -}}
321
+ {%- endif -%}
322
+ {%- elif item['type'] == 'image' -%}
323
+ {{- '<|image|>' -}}
324
+ {%- set ns.prev_message_type = 'image' -%}
325
+ {%- elif item['type'] == 'audio' -%}
326
+ {{- '<|audio|>' -}}
327
+ {%- set ns.prev_message_type = 'audio' -%}
328
+ {%- elif item['type'] == 'video' -%}
329
+ {{- '<|video|>' -}}
330
+ {%- set ns.prev_message_type = 'video' -%}
331
+ {%- endif -%}
332
+ {%- endfor -%}
333
+ {%- endif -%}
334
+ {%- endset -%}
335
+
336
+ {{- captured_content -}}
337
+ {%- set has_content = captured_content | trim | length > 0 -%}
338
+
339
+ {%- if ns.prev_message_type == 'tool_call' and not ns_tr_out.flag -%}
340
+ {{- '<|tool_response>' -}}
341
+ {%- elif not (ns_tr_out.flag and not has_content) -%}
342
+ {{- '<turn|>\n' -}}
343
+ {%- endif -%}
344
+ {%- endif -%}
345
+ {%- endfor -%}
346
+
347
+ {%- if add_generation_prompt -%}
348
+ {%- if ns.prev_message_type != 'tool_response' and ns.prev_message_type != 'tool_call' -%}
349
+ {{- '<|turn>model\n' -}}
350
+ {%- endif -%}
351
+ {%- endif -%}
example.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Minimal example: load LazuriMT and translate Turkish → Laz.
2
+
3
+ pip install transformers peft bitsandbytes accelerate
4
+ python example.py
5
+ """
6
+ from peft import PeftModel
7
+ from transformers import AutoModelForCausalLM, AutoTokenizer
8
+
9
+ BASE = "unsloth/gemma-4-e4b-it-unsloth-bnb-4bit"
10
+ ADAPTER = "CidQuLimited/LazuriMT"
11
+
12
+ print(f"Loading base model: {BASE}")
13
+ model = AutoModelForCausalLM.from_pretrained(BASE, device_map="auto", load_in_4bit=True)
14
+ print(f"Loading adapter: {ADAPTER}")
15
+ model = PeftModel.from_pretrained(model, ADAPTER)
16
+ tok = AutoTokenizer.from_pretrained(ADAPTER)
17
+ model.eval()
18
+
19
+
20
+ def translate(text: str, to: str = "lzz") -> str:
21
+ """Translate text. `to='lzz'` (Turkish → Laz) or `to='tr'` (Laz → Turkish)."""
22
+ if to == "lzz":
23
+ prompt = f"Translate this Turkish sentence into Laz (Lazuri):\n\n{text}"
24
+ else:
25
+ prompt = f"Translate this Laz (Lazuri) sentence into Turkish:\n\n{text}"
26
+ inputs = tok.apply_chat_template(
27
+ [{"role": "user", "content": prompt}],
28
+ tokenize=True, add_generation_prompt=True, return_tensors="pt",
29
+ ).to(model.device)
30
+ out = model.generate(
31
+ input_ids=inputs, max_new_tokens=128, do_sample=False,
32
+ no_repeat_ngram_size=3, repetition_penalty=1.15, num_beams=4,
33
+ )
34
+ return tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True).strip()
35
+
36
+
37
+ if __name__ == "__main__":
38
+ for source in [
39
+ "Merhaba, nasılsın?",
40
+ "Bugün hava çok güzel.",
41
+ "Su içmek istiyorum.",
42
+ ]:
43
+ print(f"\n TR: {source}")
44
+ print(f" LZ: {translate(source)}")
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f
3
+ size 32169626
tokenizer_config.json ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_token": "<|audio|>",
3
+ "backend": "tokenizers",
4
+ "boa_token": "<|audio>",
5
+ "boi_token": "<|image>",
6
+ "bos_token": "<bos>",
7
+ "eoa_token": "<audio|>",
8
+ "eoc_token": "<channel|>",
9
+ "eoi_token": "<image|>",
10
+ "eos_token": "<eos>",
11
+ "eot_token": "<turn|>",
12
+ "escape_token": "<|\"|>",
13
+ "etc_token": "<tool_call|>",
14
+ "etd_token": "<tool|>",
15
+ "etr_token": "<tool_response|>",
16
+ "extra_special_tokens": [
17
+ "<|video|>"
18
+ ],
19
+ "image_token": "<|image|>",
20
+ "is_local": false,
21
+ "mask_token": "<mask>",
22
+ "model_max_length": 131072,
23
+ "model_specific_special_tokens": {
24
+ "audio_token": "<|audio|>",
25
+ "boa_token": "<|audio>",
26
+ "boi_token": "<|image>",
27
+ "eoa_token": "<audio|>",
28
+ "eoc_token": "<channel|>",
29
+ "eoi_token": "<image|>",
30
+ "eot_token": "<turn|>",
31
+ "escape_token": "<|\"|>",
32
+ "etc_token": "<tool_call|>",
33
+ "etd_token": "<tool|>",
34
+ "etr_token": "<tool_response|>",
35
+ "image_token": "<|image|>",
36
+ "soc_token": "<|channel>",
37
+ "sot_token": "<|turn>",
38
+ "stc_token": "<|tool_call>",
39
+ "std_token": "<|tool>",
40
+ "str_token": "<|tool_response>",
41
+ "think_token": "<|think|>"
42
+ },
43
+ "pad_token": "<pad>",
44
+ "padding_side": "left",
45
+ "processor_class": "Gemma4Processor",
46
+ "response_schema": {
47
+ "properties": {
48
+ "content": {
49
+ "type": "string"
50
+ },
51
+ "role": {
52
+ "const": "assistant"
53
+ },
54
+ "thinking": {
55
+ "type": "string"
56
+ },
57
+ "tool_calls": {
58
+ "items": {
59
+ "properties": {
60
+ "function": {
61
+ "properties": {
62
+ "arguments": {
63
+ "additionalProperties": {},
64
+ "type": "object",
65
+ "x-parser": "gemma4-tool-call"
66
+ },
67
+ "name": {
68
+ "type": "string"
69
+ }
70
+ },
71
+ "type": "object",
72
+ "x-regex": "call\\:(?P<name>\\w+)(?P<arguments>\\{.*\\})"
73
+ },
74
+ "type": {
75
+ "const": "function"
76
+ }
77
+ },
78
+ "type": "object"
79
+ },
80
+ "type": "array",
81
+ "x-regex-iterator": "<\\|tool_call>(.*?)<tool_call\\|>"
82
+ }
83
+ },
84
+ "type": "object",
85
+ "x-regex": "(\\<\\|channel\\>thought\\n(?P<thinking>.*?)\\<channel\\|\\>)?(?P<tool_calls>\\<\\|tool_call\\>.*\\<tool_call\\|\\>)?(?P<content>(?:(?!\\<turn\\|\\>)(?!\\<\\|tool_response\\>).)+)?(?:\\<turn\\|\\>|\\<\\|tool_response\\>)?"
86
+ },
87
+ "soc_token": "<|channel>",
88
+ "sot_token": "<|turn>",
89
+ "stc_token": "<|tool_call>",
90
+ "std_token": "<|tool>",
91
+ "str_token": "<|tool_response>",
92
+ "think_token": "<|think|>",
93
+ "tokenizer_class": "GemmaTokenizer",
94
+ "unk_token": "<unk>",
95
+ "added_tokens_decoder": {
96
+ "0": {
97
+ "content": "<pad>",
98
+ "single_word": false,
99
+ "lstrip": false,
100
+ "rstrip": false,
101
+ "normalized": false,
102
+ "special": true
103
+ },
104
+ "1": {
105
+ "content": "<eos>",
106
+ "single_word": false,
107
+ "lstrip": false,
108
+ "rstrip": false,
109
+ "normalized": false,
110
+ "special": true
111
+ },
112
+ "2": {
113
+ "content": "<bos>",
114
+ "single_word": false,
115
+ "lstrip": false,
116
+ "rstrip": false,
117
+ "normalized": false,
118
+ "special": true
119
+ },
120
+ "3": {
121
+ "content": "<unk>",
122
+ "single_word": false,
123
+ "lstrip": false,
124
+ "rstrip": false,
125
+ "normalized": false,
126
+ "special": true
127
+ },
128
+ "4": {
129
+ "content": "<mask>",
130
+ "single_word": false,
131
+ "lstrip": false,
132
+ "rstrip": false,
133
+ "normalized": false,
134
+ "special": true
135
+ },
136
+ "46": {
137
+ "content": "<|tool>",
138
+ "single_word": false,
139
+ "lstrip": false,
140
+ "rstrip": false,
141
+ "normalized": false,
142
+ "special": true
143
+ },
144
+ "47": {
145
+ "content": "<tool|>",
146
+ "single_word": false,
147
+ "lstrip": false,
148
+ "rstrip": false,
149
+ "normalized": false,
150
+ "special": true
151
+ },
152
+ "48": {
153
+ "content": "<|tool_call>",
154
+ "single_word": false,
155
+ "lstrip": false,
156
+ "rstrip": false,
157
+ "normalized": false,
158
+ "special": true
159
+ },
160
+ "49": {
161
+ "content": "<tool_call|>",
162
+ "single_word": false,
163
+ "lstrip": false,
164
+ "rstrip": false,
165
+ "normalized": false,
166
+ "special": true
167
+ },
168
+ "50": {
169
+ "content": "<|tool_response>",
170
+ "single_word": false,
171
+ "lstrip": false,
172
+ "rstrip": false,
173
+ "normalized": false,
174
+ "special": true
175
+ },
176
+ "51": {
177
+ "content": "<tool_response|>",
178
+ "single_word": false,
179
+ "lstrip": false,
180
+ "rstrip": false,
181
+ "normalized": false,
182
+ "special": true
183
+ },
184
+ "52": {
185
+ "content": "<|\"|>",
186
+ "single_word": false,
187
+ "lstrip": false,
188
+ "rstrip": false,
189
+ "normalized": false,
190
+ "special": true
191
+ },
192
+ "98": {
193
+ "content": "<|think|>",
194
+ "single_word": false,
195
+ "lstrip": false,
196
+ "rstrip": false,
197
+ "normalized": false,
198
+ "special": true
199
+ },
200
+ "100": {
201
+ "content": "<|channel>",
202
+ "single_word": false,
203
+ "lstrip": false,
204
+ "rstrip": false,
205
+ "normalized": false,
206
+ "special": true
207
+ },
208
+ "101": {
209
+ "content": "<channel|>",
210
+ "single_word": false,
211
+ "lstrip": false,
212
+ "rstrip": false,
213
+ "normalized": false,
214
+ "special": true
215
+ },
216
+ "105": {
217
+ "content": "<|turn>",
218
+ "single_word": false,
219
+ "lstrip": false,
220
+ "rstrip": false,
221
+ "normalized": false,
222
+ "special": true
223
+ },
224
+ "106": {
225
+ "content": "<turn|>",
226
+ "single_word": false,
227
+ "lstrip": false,
228
+ "rstrip": false,
229
+ "normalized": false,
230
+ "special": true
231
+ },
232
+ "255999": {
233
+ "content": "<|image>",
234
+ "single_word": false,
235
+ "lstrip": false,
236
+ "rstrip": false,
237
+ "normalized": false,
238
+ "special": true
239
+ },
240
+ "256000": {
241
+ "content": "<|audio>",
242
+ "single_word": false,
243
+ "lstrip": false,
244
+ "rstrip": false,
245
+ "normalized": false,
246
+ "special": true
247
+ },
248
+ "258880": {
249
+ "content": "<|image|>",
250
+ "single_word": false,
251
+ "lstrip": false,
252
+ "rstrip": false,
253
+ "normalized": false,
254
+ "special": true
255
+ },
256
+ "258881": {
257
+ "content": "<|audio|>",
258
+ "single_word": false,
259
+ "lstrip": false,
260
+ "rstrip": false,
261
+ "normalized": false,
262
+ "special": true
263
+ },
264
+ "258882": {
265
+ "content": "<image|>",
266
+ "single_word": false,
267
+ "lstrip": false,
268
+ "rstrip": false,
269
+ "normalized": false,
270
+ "special": true
271
+ },
272
+ "258883": {
273
+ "content": "<audio|>",
274
+ "single_word": false,
275
+ "lstrip": false,
276
+ "rstrip": false,
277
+ "normalized": false,
278
+ "special": true
279
+ },
280
+ "258884": {
281
+ "content": "<|video|>",
282
+ "single_word": false,
283
+ "lstrip": false,
284
+ "rstrip": false,
285
+ "normalized": false,
286
+ "special": true
287
+ }
288
+ }
289
+ }