Hishambarakat
/

Bahraini_Dialect_LLM

@@ -16,9 +16,11 @@ license: other
 # Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
 ## Research Summary
 **Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
 The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
 - limited dialect-specific data,
 - structured data cleaning,
 - and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
@@ -26,7 +28,9 @@ The core goal is not to present a “new model built from scratch,” but to exp
 This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
 ## Motivation (Low-Resource Dialect Setting)
 Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
 - capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
 - reducing drift into Modern Standard Arabic,
 - and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
@@ -34,6 +38,7 @@ Bahraini dialect is a **low-resource** variety compared to MSA and many high-res
 This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
 ## Model Details
 - **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
 - **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
 - **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
@@ -42,7 +47,9 @@ This work is intended as a **research prototype** to understand the training dyn
 - **Intended pipeline:** `text-generation`
 ## Intended Behavior (Research Target)
 The target behavior for evaluation is:
 - Bahraini dialect phrasing (minimize MSA)
 - concise, practical assistant-like answers
 - natural everyday tone (avoid overly formal scaffolding unless requested)
@@ -51,6 +58,7 @@ The target behavior for evaluation is:
 ## Use & Scope
 ### Direct Use (Recommended)
 - Research and experimentation on:
   - dialect controllability
   - low-resource data bootstrapping
@@ -58,22 +66,25 @@ The target behavior for evaluation is:
   - evaluating drift, register, and consistency
 ### Commercial Use
 This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
 ### Out-of-Scope Use
 - Medical/legal/financial advice beyond general informational guidance
 - High-stakes decision-making without expert oversight
 - Requests for sensitive personal data, illegal instructions, or harmful content
 ## Bias, Risks, and Limitations
 - Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
 - Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
 - The model may inherit biases from the base model and any source material used to build/augment the dataset.
 ## How to Get Started
 ### Load (merged model)
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -96,51 +107,50 @@ enc = {k:v.to(model.device) for k,v in enc.items()}
 out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
 print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
-````
 ## Training Details
 ### Base Model
-* `humain-ai/ALLaM-7B-Instruct-preview`
 ### Training Data (high-level)
 Training was done on a curated Bahraini SFT-style corpus built from:
-* **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
-* **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
-* **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
 ### Data Construction Approach
 The dataset was produced through a structured pipeline:
-* Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
-* Prompt/response structuring into instruction-style pairs
-* Controlled synthetic generation to expand coverage while keeping the same voice
-* A dialect rule-set (positive/negative constraints) to:
-  * encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
-  * discourage MSA scaffolding and overly formal connectors
-  * keep responses short and practical
-* Template correctness via the ALLaM chat template, with EOS enforcement
 ### Prompt Format
 Data was formatted using ALLaM’s chat template:
-* system: dialect/style constraints
-* user: prompt
-* assistant: target response
   and EOS was enforced at the end of each sample.
 ### Training Procedure
-* **Method:** SFT with TRL `SFTTrainer`
-* **Parameter-efficient fine-tuning:** LoRA via PEFT
-* **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
 ### Training Hyperparameters (exact run)
@@ -186,29 +196,29 @@ The run aligned model config with tokenizer special tokens when needed (pad/bos/
 Evaluation was primarily qualitative via prompt suites comparing:
-* base model outputs vs fine-tuned outputs
-* dialect strength, conciseness, task completion, and reduction of MSA drift
 Example prompt suite included:
-* smalltalk
-* sleep routine advice (short)
-* WhatsApp apology message
-* semi-formal request to university
-* home internet troubleshooting
-* APN setup guidance
-* online card rejection reasons
-* electricity bill troubleshooting
-* late order customer-service ticket phrasing
-* clarification questions behavior
-* dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
-* mixed Arabic/English phrasing (refund/invoice)
 ## Compute / Infrastructure
-* **Training stack:** `transformers`, `trl`, `peft`
-* **Hardware:** Single GPU RTX 4090
-* **Framework versions:** PEFT 0.18.1 (per metadata)
 ## Citation
@@ -221,15 +231,15 @@ If you cite this model or derivative work, cite the dataset and include the base
 ```bibtex
 @dataset{barakat_bahraini_speech_2026,
   author       = {Hisham Barakat},
-  title        = {Hishambarakat/Bahraini_Speech_Dataset},
   year         = {2026},
   publisher    = {Hugging Face},
-  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Speech_Dataset},
   note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
 }
 ```
 ## Contact
-* **Author:** Hisham Barakat
-* **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)

 # Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
 ## Research Summary
 **Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
 The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
 - limited dialect-specific data,
 - structured data cleaning,
 - and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
 This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
 ## Motivation (Low-Resource Dialect Setting)
 Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
 - capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
 - reducing drift into Modern Standard Arabic,
 - and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
 This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
 ## Model Details
 - **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
 - **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
 - **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
 - **Intended pipeline:** `text-generation`
 ## Intended Behavior (Research Target)
 The target behavior for evaluation is:
 - Bahraini dialect phrasing (minimize MSA)
 - concise, practical assistant-like answers
 - natural everyday tone (avoid overly formal scaffolding unless requested)
 ## Use & Scope
 ### Direct Use (Recommended)
 - Research and experimentation on:
   - dialect controllability
   - low-resource data bootstrapping
   - evaluating drift, register, and consistency
 ### Commercial Use
 This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
 ### Out-of-Scope Use
 - Medical/legal/financial advice beyond general informational guidance
 - High-stakes decision-making without expert oversight
 - Requests for sensitive personal data, illegal instructions, or harmful content
 ## Bias, Risks, and Limitations
 - Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
 - Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
 - The model may inherit biases from the base model and any source material used to build/augment the dataset.
 ## How to Get Started
 ### Load (merged model)
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
 print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
+```
 ## Training Details
 ### Base Model
+- `humain-ai/ALLaM-7B-Instruct-preview`
 ### Training Data (high-level)
 Training was done on a curated Bahraini SFT-style corpus built from:
+- **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
+- **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
+- **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
 ### Data Construction Approach
 The dataset was produced through a structured pipeline:
+- Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
+- Prompt/response structuring into instruction-style pairs
+- Controlled synthetic generation to expand coverage while keeping the same voice
+- A dialect rule-set (positive/negative constraints) to:
+  - encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
+  - discourage MSA scaffolding and overly formal connectors
+  - keep responses short and practical
+- Template correctness via the ALLaM chat template, with EOS enforcement
 ### Prompt Format
 Data was formatted using ALLaM’s chat template:
+- system: dialect/style constraints
+- user: prompt
+- assistant: target response
   and EOS was enforced at the end of each sample.
 ### Training Procedure
+- **Method:** SFT with TRL `SFTTrainer`
+- **Parameter-efficient fine-tuning:** LoRA via PEFT
+- **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
 ### Training Hyperparameters (exact run)
 Evaluation was primarily qualitative via prompt suites comparing:
+- base model outputs vs fine-tuned outputs
+- dialect strength, conciseness, task completion, and reduction of MSA drift
 Example prompt suite included:
+- smalltalk
+- sleep routine advice (short)
+- WhatsApp apology message
+- semi-formal request to university
+- home internet troubleshooting
+- APN setup guidance
+- online card rejection reasons
+- electricity bill troubleshooting
+- late order customer-service ticket phrasing
+- clarification questions behavior
+- dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
+- mixed Arabic/English phrasing (refund/invoice)
 ## Compute / Infrastructure
+- **Training stack:** `transformers`, `trl`, `peft`
+- **Hardware:** Single GPU RTX 4090
+- **Framework versions:** PEFT 0.18.1 (per metadata)
 ## Citation
 ```bibtex
 @dataset{barakat_bahraini_speech_2026,
   author       = {Hisham Barakat},
+  title        = {Hishambarakat/Bahraini_Dialect_LLM},
   year         = {2026},
   publisher    = {Hugging Face},
+  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
   note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
 }
 ```
 ## Contact
+- **Author:** Hisham Barakat
+- **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)