Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,9 +16,11 @@ license: other
|
|
| 16 |
# Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
|
| 17 |
|
| 18 |
## Research Summary
|
|
|
|
| 19 |
**Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
|
| 20 |
|
| 21 |
The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
|
|
|
|
| 22 |
- limited dialect-specific data,
|
| 23 |
- structured data cleaning,
|
| 24 |
- and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
|
|
@@ -26,7 +28,9 @@ The core goal is not to present a “new model built from scratch,” but to exp
|
|
| 26 |
This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
|
| 27 |
|
| 28 |
## Motivation (Low-Resource Dialect Setting)
|
|
|
|
| 29 |
Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
|
|
|
|
| 30 |
- capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
|
| 31 |
- reducing drift into Modern Standard Arabic,
|
| 32 |
- and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
|
|
@@ -34,6 +38,7 @@ Bahraini dialect is a **low-resource** variety compared to MSA and many high-res
|
|
| 34 |
This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
|
| 35 |
|
| 36 |
## Model Details
|
|
|
|
| 37 |
- **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
|
| 38 |
- **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
|
| 39 |
- **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
|
|
@@ -42,7 +47,9 @@ This work is intended as a **research prototype** to understand the training dyn
|
|
| 42 |
- **Intended pipeline:** `text-generation`
|
| 43 |
|
| 44 |
## Intended Behavior (Research Target)
|
|
|
|
| 45 |
The target behavior for evaluation is:
|
|
|
|
| 46 |
- Bahraini dialect phrasing (minimize MSA)
|
| 47 |
- concise, practical assistant-like answers
|
| 48 |
- natural everyday tone (avoid overly formal scaffolding unless requested)
|
|
@@ -51,6 +58,7 @@ The target behavior for evaluation is:
|
|
| 51 |
## Use & Scope
|
| 52 |
|
| 53 |
### Direct Use (Recommended)
|
|
|
|
| 54 |
- Research and experimentation on:
|
| 55 |
- dialect controllability
|
| 56 |
- low-resource data bootstrapping
|
|
@@ -58,22 +66,25 @@ The target behavior for evaluation is:
|
|
| 58 |
- evaluating drift, register, and consistency
|
| 59 |
|
| 60 |
### Commercial Use
|
|
|
|
| 61 |
This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
|
| 62 |
|
| 63 |
### Out-of-Scope Use
|
|
|
|
| 64 |
- Medical/legal/financial advice beyond general informational guidance
|
| 65 |
- High-stakes decision-making without expert oversight
|
| 66 |
- Requests for sensitive personal data, illegal instructions, or harmful content
|
| 67 |
|
| 68 |
## Bias, Risks, and Limitations
|
|
|
|
| 69 |
- Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
|
| 70 |
- Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
|
| 71 |
- The model may inherit biases from the base model and any source material used to build/augment the dataset.
|
| 72 |
|
| 73 |
-
|
| 74 |
## How to Get Started
|
| 75 |
|
| 76 |
### Load (merged model)
|
|
|
|
| 77 |
```python
|
| 78 |
import torch
|
| 79 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
@@ -96,51 +107,50 @@ enc = {k:v.to(model.device) for k,v in enc.items()}
|
|
| 96 |
|
| 97 |
out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
|
| 98 |
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
|
| 99 |
-
```
|
| 100 |
|
| 101 |
## Training Details
|
| 102 |
|
| 103 |
### Base Model
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
### Training Data (high-level)
|
| 108 |
|
| 109 |
Training was done on a curated Bahraini SFT-style corpus built from:
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
|
| 116 |
### Data Construction Approach
|
| 117 |
|
| 118 |
The dataset was produced through a structured pipeline:
|
| 119 |
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
| 126 |
-
* discourage MSA scaffolding and overly formal connectors
|
| 127 |
-
* keep responses short and practical
|
| 128 |
-
* Template correctness via the ALLaM chat template, with EOS enforcement
|
| 129 |
|
| 130 |
### Prompt Format
|
| 131 |
|
| 132 |
Data was formatted using ALLaM’s chat template:
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
and EOS was enforced at the end of each sample.
|
| 138 |
|
| 139 |
### Training Procedure
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
|
| 145 |
### Training Hyperparameters (exact run)
|
| 146 |
|
|
@@ -186,29 +196,29 @@ The run aligned model config with tokenizer special tokens when needed (pad/bos/
|
|
| 186 |
|
| 187 |
Evaluation was primarily qualitative via prompt suites comparing:
|
| 188 |
|
| 189 |
-
|
| 190 |
-
|
| 191 |
|
| 192 |
Example prompt suite included:
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
|
| 207 |
## Compute / Infrastructure
|
| 208 |
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
|
| 213 |
## Citation
|
| 214 |
|
|
@@ -221,15 +231,15 @@ If you cite this model or derivative work, cite the dataset and include the base
|
|
| 221 |
```bibtex
|
| 222 |
@dataset{barakat_bahraini_speech_2026,
|
| 223 |
author = {Hisham Barakat},
|
| 224 |
-
title = {Hishambarakat/
|
| 225 |
year = {2026},
|
| 226 |
publisher = {Hugging Face},
|
| 227 |
-
url = {https://huggingface.co/datasets/Hishambarakat/
|
| 228 |
note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
|
| 229 |
}
|
| 230 |
```
|
| 231 |
|
| 232 |
## Contact
|
| 233 |
|
| 234 |
-
|
| 235 |
-
|
|
|
|
| 16 |
# Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
|
| 17 |
|
| 18 |
## Research Summary
|
| 19 |
+
|
| 20 |
**Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.
|
| 21 |
|
| 22 |
The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:
|
| 23 |
+
|
| 24 |
- limited dialect-specific data,
|
| 25 |
- structured data cleaning,
|
| 26 |
- and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
|
|
|
|
| 28 |
This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.
|
| 29 |
|
| 30 |
## Motivation (Low-Resource Dialect Setting)
|
| 31 |
+
|
| 32 |
Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
|
| 33 |
+
|
| 34 |
- capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
|
| 35 |
- reducing drift into Modern Standard Arabic,
|
| 36 |
- and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.
|
|
|
|
| 38 |
This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.
|
| 39 |
|
| 40 |
## Model Details
|
| 41 |
+
|
| 42 |
- **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
|
| 43 |
- **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
|
| 44 |
- **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
|
|
|
|
| 47 |
- **Intended pipeline:** `text-generation`
|
| 48 |
|
| 49 |
## Intended Behavior (Research Target)
|
| 50 |
+
|
| 51 |
The target behavior for evaluation is:
|
| 52 |
+
|
| 53 |
- Bahraini dialect phrasing (minimize MSA)
|
| 54 |
- concise, practical assistant-like answers
|
| 55 |
- natural everyday tone (avoid overly formal scaffolding unless requested)
|
|
|
|
| 58 |
## Use & Scope
|
| 59 |
|
| 60 |
### Direct Use (Recommended)
|
| 61 |
+
|
| 62 |
- Research and experimentation on:
|
| 63 |
- dialect controllability
|
| 64 |
- low-resource data bootstrapping
|
|
|
|
| 66 |
- evaluating drift, register, and consistency
|
| 67 |
|
| 68 |
### Commercial Use
|
| 69 |
+
|
| 70 |
This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.
|
| 71 |
|
| 72 |
### Out-of-Scope Use
|
| 73 |
+
|
| 74 |
- Medical/legal/financial advice beyond general informational guidance
|
| 75 |
- High-stakes decision-making without expert oversight
|
| 76 |
- Requests for sensitive personal data, illegal instructions, or harmful content
|
| 77 |
|
| 78 |
## Bias, Risks, and Limitations
|
| 79 |
+
|
| 80 |
- Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
|
| 81 |
- Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
|
| 82 |
- The model may inherit biases from the base model and any source material used to build/augment the dataset.
|
| 83 |
|
|
|
|
| 84 |
## How to Get Started
|
| 85 |
|
| 86 |
### Load (merged model)
|
| 87 |
+
|
| 88 |
```python
|
| 89 |
import torch
|
| 90 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
| 107 |
|
| 108 |
out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
|
| 109 |
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
|
| 110 |
+
```
|
| 111 |
|
| 112 |
## Training Details
|
| 113 |
|
| 114 |
### Base Model
|
| 115 |
|
| 116 |
+
- `humain-ai/ALLaM-7B-Instruct-preview`
|
| 117 |
|
| 118 |
### Training Data (high-level)
|
| 119 |
|
| 120 |
Training was done on a curated Bahraini SFT-style corpus built from:
|
| 121 |
|
| 122 |
+
- **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
|
| 123 |
+
- **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
|
| 124 |
+
- **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
|
|
|
|
| 125 |
|
| 126 |
### Data Construction Approach
|
| 127 |
|
| 128 |
The dataset was produced through a structured pipeline:
|
| 129 |
|
| 130 |
+
- Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
|
| 131 |
+
- Prompt/response structuring into instruction-style pairs
|
| 132 |
+
- Controlled synthetic generation to expand coverage while keeping the same voice
|
| 133 |
+
- A dialect rule-set (positive/negative constraints) to:
|
| 134 |
+
- encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
|
| 135 |
+
- discourage MSA scaffolding and overly formal connectors
|
| 136 |
+
- keep responses short and practical
|
| 137 |
|
| 138 |
+
- Template correctness via the ALLaM chat template, with EOS enforcement
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
### Prompt Format
|
| 141 |
|
| 142 |
Data was formatted using ALLaM’s chat template:
|
| 143 |
|
| 144 |
+
- system: dialect/style constraints
|
| 145 |
+
- user: prompt
|
| 146 |
+
- assistant: target response
|
| 147 |
and EOS was enforced at the end of each sample.
|
| 148 |
|
| 149 |
### Training Procedure
|
| 150 |
|
| 151 |
+
- **Method:** SFT with TRL `SFTTrainer`
|
| 152 |
+
- **Parameter-efficient fine-tuning:** LoRA via PEFT
|
| 153 |
+
- **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
|
| 154 |
|
| 155 |
### Training Hyperparameters (exact run)
|
| 156 |
|
|
|
|
| 196 |
|
| 197 |
Evaluation was primarily qualitative via prompt suites comparing:
|
| 198 |
|
| 199 |
+
- base model outputs vs fine-tuned outputs
|
| 200 |
+
- dialect strength, conciseness, task completion, and reduction of MSA drift
|
| 201 |
|
| 202 |
Example prompt suite included:
|
| 203 |
|
| 204 |
+
- smalltalk
|
| 205 |
+
- sleep routine advice (short)
|
| 206 |
+
- WhatsApp apology message
|
| 207 |
+
- semi-formal request to university
|
| 208 |
+
- home internet troubleshooting
|
| 209 |
+
- APN setup guidance
|
| 210 |
+
- online card rejection reasons
|
| 211 |
+
- electricity bill troubleshooting
|
| 212 |
+
- late order customer-service ticket phrasing
|
| 213 |
+
- clarification questions behavior
|
| 214 |
+
- dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
|
| 215 |
+
- mixed Arabic/English phrasing (refund/invoice)
|
| 216 |
|
| 217 |
## Compute / Infrastructure
|
| 218 |
|
| 219 |
+
- **Training stack:** `transformers`, `trl`, `peft`
|
| 220 |
+
- **Hardware:** Single GPU RTX 4090
|
| 221 |
+
- **Framework versions:** PEFT 0.18.1 (per metadata)
|
| 222 |
|
| 223 |
## Citation
|
| 224 |
|
|
|
|
| 231 |
```bibtex
|
| 232 |
@dataset{barakat_bahraini_speech_2026,
|
| 233 |
author = {Hisham Barakat},
|
| 234 |
+
title = {Hishambarakat/Bahraini_Dialect_LLM},
|
| 235 |
year = {2026},
|
| 236 |
publisher = {Hugging Face},
|
| 237 |
+
url = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
|
| 238 |
note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
|
| 239 |
}
|
| 240 |
```
|
| 241 |
|
| 242 |
## Contact
|
| 243 |
|
| 244 |
+
- **Author:** Hisham Barakat
|
| 245 |
+
- **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)
|