Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)
Research Summary
Bahraini_Dialect_LLM is a research-oriented fine-tune of humain-ai/ALLaM-7B-Instruct-preview aimed at studying Bahraini Arabic dialect controllability and low-resource dialect modeling.
The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward more natural Bahraini conversational behavior using:
- limited dialect-specific data,
- structured data cleaning,
- and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.
This repo contains merged weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard transformers model.
Motivation (Low-Resource Dialect Setting)
Bahraini dialect is a low-resource variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:
- capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
- reducing drift into Modern Standard Arabic,
- and testing whether rule-based style constraints + LLM-based paraphrasing can produce training data that improves dialect fidelity without requiring large-scale native corpora.
This work is intended as a research prototype to understand the training dynamics, limitations, and trade-offs of dialect steering.
Model Details
- Fine-tuned by: Hisham Barakat (research fine-tune; base model ownership remains with original authors)
- Base model:
humain-ai/ALLaM-7B-Instruct-preview - Model type: Causal LM (LLaMA-family architecture via ALLaM)
- Language: Arabic (Bahraini dialect focus)
- Training method: SFT with LoRA (PEFT), then merged
- Intended pipeline:
text-generation
Intended Behavior (Research Target)
The target behavior for evaluation is:
- Bahraini dialect phrasing (minimize MSA)
- concise, practical assistant-like answers
- natural everyday tone (avoid overly formal scaffolding unless requested)
- broad everyday domains (customer-service style replies, basic troubleshooting, admin writing when asked)
Use & Scope
Direct Use (Recommended)
- Research and experimentation on:
- dialect controllability
- low-resource data bootstrapping
- prompt/style constraints for dialect steering
- evaluating drift, register, and consistency
Commercial Use
This repository is shared primarily for research and reproducibility. If you intend commercial use, review the base model license and verify compatibility with your intended deployment.
Out-of-Scope Use
- Medical/legal/financial advice beyond general informational guidance
- High-stakes decision-making without expert oversight
- Requests for sensitive personal data, illegal instructions, or harmful content
Bias, Risks, and Limitations
- Dialect coverage is strongest for a Bahraini conversational assistant style; it may still drift into Gulf-general or more formal Arabic in edge cases.
- Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
- The model may inherit biases from the base model and any source material used to build/augment the dataset.
How to Get Started
Load (merged model)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
model.eval()
SYSTEM = "أنت مساعد يتكلم باللهجة البحرينية بشكل طبيعي. خلك مختصر وعملي، وتجنب الفصحى واللغة الرسمية إلا إذا المستخدم طلب. إذا السؤال يحتاج توضيح عشان تجاوب صح، اسأل سؤال واحد أو اثنين بالكثير. حاول تخلي الأسلوب بحريني مو خليجي عام. افترض المخاطب ذكر إلا إذا واضح من كلام المستخدم غير جذي."
messages = [
{"role":"system","content":SYSTEM},
{"role":"user","content":"إذا نومي خربان شسوي؟"}
]
enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
enc = {k:v.to(model.device) for k,v in enc.items()}
out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
Training Details
Base Model
humain-ai/ALLaM-7B-Instruct-preview
Training Data (high-level)
Training was done on a curated Bahraini SFT-style corpus built from:
- Single-speaker Bahraini transcript corpus (cleaned and normalized)
- Synthetic-but-close-to-real conversational expansions, generated from the base style/voice and guided by strict rules to stay Bahraini
- Domain-targeted assistant Q&A (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
Data Construction Approach
The dataset was produced through a structured pipeline:
Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
Prompt/response structuring into instruction-style pairs
Controlled synthetic generation to expand coverage while keeping the same voice
A dialect rule-set (positive/negative constraints) to:
- encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
- discourage MSA scaffolding and overly formal connectors
- keep responses short and practical
Template correctness via the ALLaM chat template, with EOS enforcement
Prompt Format
Data was formatted using ALLaM’s chat template:
- system: dialect/style constraints
- user: prompt
- assistant: target response and EOS was enforced at the end of each sample.
Training Procedure
- Method: SFT with TRL
SFTTrainer - Parameter-efficient fine-tuning: LoRA via PEFT
- Final artifact: LoRA adapter was merged into the base model (
merge_and_unload) and saved as a standalone model for standard loading.
Training Hyperparameters (exact run)
Base configuration used during the run:
max_seq_length: 2048
optimizer: adamw_torch
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 1.0
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
num_train_epochs: 4
packing: false
seed: 42
precision: bf16
attention_implementation: eager
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
lora:
r: 16
alpha: 32
dropout: 0.05
target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
Notes on Tokenizer / Special Tokens
The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses pad_token_id = eos_token_id with explicit attention masks during inference to avoid warnings and instability when pad==eos.
Evaluation
Evaluation was primarily qualitative via prompt suites comparing:
- base model outputs vs fine-tuned outputs
- dialect strength, conciseness, task completion, and reduction of MSA drift
Example prompt suite included:
- smalltalk
- sleep routine advice (short)
- WhatsApp apology message
- semi-formal request to university
- home internet troubleshooting
- APN setup guidance
- online card rejection reasons
- electricity bill troubleshooting
- late order customer-service ticket phrasing
- clarification questions behavior
- dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
- mixed Arabic/English phrasing (refund/invoice)
Compute / Infrastructure
- Training stack:
transformers,trl,peft - Hardware: Single GPU RTX 4090
- Framework versions: PEFT 0.18.1 (per metadata)
Citation
Model
If you cite this model or derivative work, cite the dataset and include the base model reference.
Dataset (provided by author)
@dataset{barakat_bahraini_speech_2026,
author = {Hisham Barakat},
title = {Hishambarakat/Bahraini_Dialect_LLM},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
}
Contact
- Author: Hisham Barakat
- LinkedIn: https://www.linkedin.com/in/hishambarakat/
- Downloads last month
- 156
Model tree for Hishambarakat/Bahraini_Dialect_LLM
Base model
humain-ai/ALLaM-7B-Instruct-preview