File size: 9,128 Bytes
be76b2a
 
 
 
 
11c6379
 
 
 
 
 
 
 
be76b2a
 
9886486
be76b2a
9886486
061c5f0
9886486
be76b2a
9886486
061c5f0
9886486
 
 
 
 
 
 
061c5f0
9886486
061c5f0
9886486
 
 
 
 
be76b2a
 
061c5f0
9886486
 
 
 
 
11c6379
 
9886486
061c5f0
9886486
061c5f0
9886486
 
 
 
be76b2a
9886486
be76b2a
9886486
061c5f0
9886486
 
 
 
 
 
 
061c5f0
9886486
be76b2a
 
061c5f0
11c6379
9886486
 
be76b2a
 
061c5f0
9886486
 
 
 
11c6379
be76b2a
11c6379
061c5f0
11c6379
 
 
be76b2a
11c6379
 
be76b2a
11c6379
 
 
be76b2a
4ee2864
be76b2a
11c6379
 
 
 
 
 
be76b2a
11c6379
 
061c5f0
be76b2a
 
 
11c6379
be76b2a
061c5f0
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
9886486
be76b2a
11c6379
be76b2a
061c5f0
 
 
 
 
 
 
be76b2a
061c5f0
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
62e8962
 
 
 
 
 
 
 
 
 
 
 
 
be8e09e
62e8962
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
be76b2a
11c6379
be76b2a
061c5f0
 
 
 
 
 
 
 
 
 
 
 
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
 
 
061c5f0
11c6379
 
061c5f0
11c6379
 
 
be76b2a
11c6379
be76b2a
061c5f0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
base_model: humain-ai/ALLaM-7B-Instruct-preview
library_name: peft
pipeline_tag: text-generation
tags:
  - base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
  - lora
  - sft
  - transformers
  - trl
language:
  - ar
license: other
---

# Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)

## Research Summary

**Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.

The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:

- limited dialect-specific data,
- structured data cleaning,
- and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.

This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.

## Motivation (Low-Resource Dialect Setting)

Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:

- capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
- reducing drift into Modern Standard Arabic,
- and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.

This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.

## Model Details

- **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
- **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
- **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
- **Language:** Arabic (Bahraini dialect focus)
- **Training method:** SFT with LoRA (PEFT), then merged
- **Intended pipeline:** `text-generation`

## Intended Behavior (Research Target)

The target behavior for evaluation is:

- Bahraini dialect phrasing (minimize MSA)
- concise, practical assistant-like answers
- natural everyday tone (avoid overly formal scaffolding unless requested)
- broad everyday domains (customer-service style replies, basic troubleshooting, admin writing when asked)

## Use & Scope

### Direct Use (Recommended)

- Research and experimentation on:
  - dialect controllability
  - low-resource data bootstrapping
  - prompt/style constraints for dialect steering
  - evaluating drift, register, and consistency

### Commercial Use

This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.

### Out-of-Scope Use

- Medical/legal/financial advice beyond general informational guidance
- High-stakes decision-making without expert oversight
- Requests for sensitive personal data, illegal instructions, or harmful content

## Bias, Risks, and Limitations

- Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
- Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
- The model may inherit biases from the base model and any source material used to build/augment the dataset.

## How to Get Started

### Load (merged model)

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16

tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
model.eval()

SYSTEM = "أنت مساعد يتكلم باللهجة البحرينية بشكل طبيعي. خلك مختصر وعملي، وتجنب الفصحى واللغة الرسمية إلا إذا المستخدم طلب. إذا السؤال يحتاج توضيح عشان تجاوب صح، اسأل سؤال واحد أو اثنين بالكثير. حاول تخلي الأسلوب بحريني مو خليجي عام. افترض المخاطب ذكر إلا إذا واضح من كلام المستخدم غير جذي."

messages = [
  {"role":"system","content":SYSTEM},
  {"role":"user","content":"إذا نومي خربان شسوي؟"}
]
enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
enc = {k:v.to(model.device) for k,v in enc.items()}

out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
```

## Training Details

### Base Model

- `humain-ai/ALLaM-7B-Instruct-preview`

### Training Data (high-level)

Training was done on a curated Bahraini SFT-style corpus built from:

- **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
- **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
- **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints

### Data Construction Approach

The dataset was produced through a structured pipeline:

- Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
- Prompt/response structuring into instruction-style pairs
- Controlled synthetic generation to expand coverage while keeping the same voice
- A dialect rule-set (positive/negative constraints) to:
  - encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
  - discourage MSA scaffolding and overly formal connectors
  - keep responses short and practical

- Template correctness via the ALLaM chat template, with EOS enforcement

### Prompt Format

Data was formatted using ALLaM’s chat template:

- system: dialect/style constraints
- user: prompt
- assistant: target response
  and EOS was enforced at the end of each sample.

### Training Procedure

- **Method:** SFT with TRL `SFTTrainer`
- **Parameter-efficient fine-tuning:** LoRA via PEFT
- **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.

### Training Hyperparameters (exact run)

Base configuration used during the run:

```yaml
max_seq_length: 2048
optimizer: adamw_torch
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 1.0
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
num_train_epochs: 4
packing: false
seed: 42
precision: bf16
attention_implementation: eager
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
lora:
  r: 16
  alpha: 32
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
```

### Notes on Tokenizer / Special Tokens

The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses `pad_token_id = eos_token_id` with explicit attention masks during inference to avoid warnings and instability when pad==eos.

## Evaluation

Evaluation was primarily qualitative via prompt suites comparing:

- base model outputs vs fine-tuned outputs
- dialect strength, conciseness, task completion, and reduction of MSA drift

Example prompt suite included:

- smalltalk
- sleep routine advice (short)
- WhatsApp apology message
- semi-formal request to university
- home internet troubleshooting
- APN setup guidance
- online card rejection reasons
- electricity bill troubleshooting
- late order customer-service ticket phrasing
- clarification questions behavior
- dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
- mixed Arabic/English phrasing (refund/invoice)

## Compute / Infrastructure

- **Training stack:** `transformers`, `trl`, `peft`
- **Hardware:** Single GPU RTX 4090
- **Framework versions:** PEFT 0.18.1 (per metadata)

## Citation

### Model

If you cite this model or derivative work, cite the dataset and include the base model reference.

### Dataset (provided by author)

```bibtex
@dataset{barakat_bahraini_speech_2026,
  author       = {Hisham Barakat},
  title        = {Hishambarakat/Bahraini_Dialect_LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
  note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
}
```

## Contact

- **Author:** Hisham Barakat
- **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)