Hishambarakat commited on
Commit
061c5f0
·
verified ·
1 Parent(s): be8e09e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +52 -42
README.md CHANGED
@@ -16,9 +16,11 @@ license: other
16
  # Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
17
 
18
  ## Research Summary
 
19
  **Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
20
 
21
  The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
 
22
  - limited dialect-specific data,
23
  - structured data cleaning,
24
  - and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
@@ -26,7 +28,9 @@ The core goal is not to present a “new model built from scratch,” but to exp
26
  This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
27
 
28
  ## Motivation (Low-Resource Dialect Setting)
 
29
  Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
 
30
  - capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
31
  - reducing drift into Modern Standard Arabic,
32
  - and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
@@ -34,6 +38,7 @@ Bahraini dialect is a **low-resource** variety compared to MSA and many high-res
34
  This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
35
 
36
  ## Model Details
 
37
  - **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
38
  - **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
39
  - **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
@@ -42,7 +47,9 @@ This work is intended as a **research prototype** to understand the training dyn
42
  - **Intended pipeline:** `text-generation`
43
 
44
  ## Intended Behavior (Research Target)
 
45
  The target behavior for evaluation is:
 
46
  - Bahraini dialect phrasing (minimize MSA)
47
  - concise, practical assistant-like answers
48
  - natural everyday tone (avoid overly formal scaffolding unless requested)
@@ -51,6 +58,7 @@ The target behavior for evaluation is:
51
  ## Use & Scope
52
 
53
  ### Direct Use (Recommended)
 
54
  - Research and experimentation on:
55
  - dialect controllability
56
  - low-resource data bootstrapping
@@ -58,22 +66,25 @@ The target behavior for evaluation is:
58
  - evaluating drift, register, and consistency
59
 
60
  ### Commercial Use
 
61
  This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
62
 
63
  ### Out-of-Scope Use
 
64
  - Medical/legal/financial advice beyond general informational guidance
65
  - High-stakes decision-making without expert oversight
66
  - Requests for sensitive personal data, illegal instructions, or harmful content
67
 
68
  ## Bias, Risks, and Limitations
 
69
  - Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
70
  - Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
71
  - The model may inherit biases from the base model and any source material used to build/augment the dataset.
72
 
73
-
74
  ## How to Get Started
75
 
76
  ### Load (merged model)
 
77
  ```python
78
  import torch
79
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -96,51 +107,50 @@ enc = {k:v.to(model.device) for k,v in enc.items()}
96
 
97
  out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
98
  print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
99
- ````
100
 
101
  ## Training Details
102
 
103
  ### Base Model
104
 
105
- * `humain-ai/ALLaM-7B-Instruct-preview`
106
 
107
  ### Training Data (high-level)
108
 
109
  Training was done on a curated Bahraini SFT-style corpus built from:
110
 
111
- * **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
112
- * **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
113
- * **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
114
-
115
 
116
  ### Data Construction Approach
117
 
118
  The dataset was produced through a structured pipeline:
119
 
120
- * Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
121
- * Prompt/response structuring into instruction-style pairs
122
- * Controlled synthetic generation to expand coverage while keeping the same voice
123
- * A dialect rule-set (positive/negative constraints) to:
 
 
 
124
 
125
- * encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
126
- * discourage MSA scaffolding and overly formal connectors
127
- * keep responses short and practical
128
- * Template correctness via the ALLaM chat template, with EOS enforcement
129
 
130
  ### Prompt Format
131
 
132
  Data was formatted using ALLaM’s chat template:
133
 
134
- * system: dialect/style constraints
135
- * user: prompt
136
- * assistant: target response
137
  and EOS was enforced at the end of each sample.
138
 
139
  ### Training Procedure
140
 
141
- * **Method:** SFT with TRL `SFTTrainer`
142
- * **Parameter-efficient fine-tuning:** LoRA via PEFT
143
- * **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
144
 
145
  ### Training Hyperparameters (exact run)
146
 
@@ -186,29 +196,29 @@ The run aligned model config with tokenizer special tokens when needed (pad/bos/
186
 
187
  Evaluation was primarily qualitative via prompt suites comparing:
188
 
189
- * base model outputs vs fine-tuned outputs
190
- * dialect strength, conciseness, task completion, and reduction of MSA drift
191
 
192
  Example prompt suite included:
193
 
194
- * smalltalk
195
- * sleep routine advice (short)
196
- * WhatsApp apology message
197
- * semi-formal request to university
198
- * home internet troubleshooting
199
- * APN setup guidance
200
- * online card rejection reasons
201
- * electricity bill troubleshooting
202
- * late order customer-service ticket phrasing
203
- * clarification questions behavior
204
- * dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
205
- * mixed Arabic/English phrasing (refund/invoice)
206
 
207
  ## Compute / Infrastructure
208
 
209
- * **Training stack:** `transformers`, `trl`, `peft`
210
- * **Hardware:** Single GPU RTX 4090
211
- * **Framework versions:** PEFT 0.18.1 (per metadata)
212
 
213
  ## Citation
214
 
@@ -221,15 +231,15 @@ If you cite this model or derivative work, cite the dataset and include the base
221
  ```bibtex
222
  @dataset{barakat_bahraini_speech_2026,
223
  author = {Hisham Barakat},
224
- title = {Hishambarakat/Bahraini_Speech_Dataset},
225
  year = {2026},
226
  publisher = {Hugging Face},
227
- url = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Speech_Dataset},
228
  note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
229
  }
230
  ```
231
 
232
  ## Contact
233
 
234
- * **Author:** Hisham Barakat
235
- * **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)
 
16
  # Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
17
 
18
  ## Research Summary
19
+
20
  **Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
21
 
22
  The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
23
+
24
  - limited dialect-specific data,
25
  - structured data cleaning,
26
  - and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
 
28
  This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
29
 
30
  ## Motivation (Low-Resource Dialect Setting)
31
+
32
  Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
33
+
34
  - capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
35
  - reducing drift into Modern Standard Arabic,
36
  - and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
 
38
  This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
39
 
40
  ## Model Details
41
+
42
  - **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
43
  - **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
44
  - **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
 
47
  - **Intended pipeline:** `text-generation`
48
 
49
  ## Intended Behavior (Research Target)
50
+
51
  The target behavior for evaluation is:
52
+
53
  - Bahraini dialect phrasing (minimize MSA)
54
  - concise, practical assistant-like answers
55
  - natural everyday tone (avoid overly formal scaffolding unless requested)
 
58
  ## Use & Scope
59
 
60
  ### Direct Use (Recommended)
61
+
62
  - Research and experimentation on:
63
  - dialect controllability
64
  - low-resource data bootstrapping
 
66
  - evaluating drift, register, and consistency
67
 
68
  ### Commercial Use
69
+
70
  This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
71
 
72
  ### Out-of-Scope Use
73
+
74
  - Medical/legal/financial advice beyond general informational guidance
75
  - High-stakes decision-making without expert oversight
76
  - Requests for sensitive personal data, illegal instructions, or harmful content
77
 
78
  ## Bias, Risks, and Limitations
79
+
80
  - Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
81
  - Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
82
  - The model may inherit biases from the base model and any source material used to build/augment the dataset.
83
 
 
84
  ## How to Get Started
85
 
86
  ### Load (merged model)
87
+
88
  ```python
89
  import torch
90
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
107
 
108
  out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
109
  print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
110
+ ```
111
 
112
  ## Training Details
113
 
114
  ### Base Model
115
 
116
+ - `humain-ai/ALLaM-7B-Instruct-preview`
117
 
118
  ### Training Data (high-level)
119
 
120
  Training was done on a curated Bahraini SFT-style corpus built from:
121
 
122
+ - **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
123
+ - **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
124
+ - **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
 
125
 
126
  ### Data Construction Approach
127
 
128
  The dataset was produced through a structured pipeline:
129
 
130
+ - Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
131
+ - Prompt/response structuring into instruction-style pairs
132
+ - Controlled synthetic generation to expand coverage while keeping the same voice
133
+ - A dialect rule-set (positive/negative constraints) to:
134
+ - encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
135
+ - discourage MSA scaffolding and overly formal connectors
136
+ - keep responses short and practical
137
 
138
+ - Template correctness via the ALLaM chat template, with EOS enforcement
 
 
 
139
 
140
  ### Prompt Format
141
 
142
  Data was formatted using ALLaM’s chat template:
143
 
144
+ - system: dialect/style constraints
145
+ - user: prompt
146
+ - assistant: target response
147
  and EOS was enforced at the end of each sample.
148
 
149
  ### Training Procedure
150
 
151
+ - **Method:** SFT with TRL `SFTTrainer`
152
+ - **Parameter-efficient fine-tuning:** LoRA via PEFT
153
+ - **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
154
 
155
  ### Training Hyperparameters (exact run)
156
 
 
196
 
197
  Evaluation was primarily qualitative via prompt suites comparing:
198
 
199
+ - base model outputs vs fine-tuned outputs
200
+ - dialect strength, conciseness, task completion, and reduction of MSA drift
201
 
202
  Example prompt suite included:
203
 
204
+ - smalltalk
205
+ - sleep routine advice (short)
206
+ - WhatsApp apology message
207
+ - semi-formal request to university
208
+ - home internet troubleshooting
209
+ - APN setup guidance
210
+ - online card rejection reasons
211
+ - electricity bill troubleshooting
212
+ - late order customer-service ticket phrasing
213
+ - clarification questions behavior
214
+ - dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
215
+ - mixed Arabic/English phrasing (refund/invoice)
216
 
217
  ## Compute / Infrastructure
218
 
219
+ - **Training stack:** `transformers`, `trl`, `peft`
220
+ - **Hardware:** Single GPU RTX 4090
221
+ - **Framework versions:** PEFT 0.18.1 (per metadata)
222
 
223
  ## Citation
224
 
 
231
  ```bibtex
232
  @dataset{barakat_bahraini_speech_2026,
233
  author = {Hisham Barakat},
234
+ title = {Hishambarakat/Bahraini_Dialect_LLM},
235
  year = {2026},
236
  publisher = {Hugging Face},
237
+ url = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
238
  note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
239
  }
240
  ```
241
 
242
  ## Contact
243
 
244
+ - **Author:** Hisham Barakat
245
+ - **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)