Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)

Research Summary

Bahraini_Dialect_LLM is a research-oriented fine-tune of humain-ai/ALLaM-7B-Instruct-preview aimed at studying Bahraini Arabic dialect controllability and low-resource dialect modeling.

The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward more natural Bahraini conversational behavior using:

  • limited dialect-specific data,
  • structured data cleaning,
  • and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.

This repo contains merged weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard transformers model.

Motivation (Low-Resource Dialect Setting)

Bahraini dialect is a low-resource variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:

  • capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
  • reducing drift into Modern Standard Arabic,
  • and testing whether rule-based style constraints + LLM-based paraphrasing can produce training data that improves dialect fidelity without requiring large-scale native corpora.

This work is intended as a research prototype to understand the training dynamics, limitations, and trade-offs of dialect steering.

Model Details

  • Fine-tuned by: Hisham Barakat (research fine-tune; base model ownership remains with original authors)
  • Base model: humain-ai/ALLaM-7B-Instruct-preview
  • Model type: Causal LM (LLaMA-family architecture via ALLaM)
  • Language: Arabic (Bahraini dialect focus)
  • Training method: SFT with LoRA (PEFT), then merged
  • Intended pipeline: text-generation

Intended Behavior (Research Target)

The target behavior for evaluation is:

  • Bahraini dialect phrasing (minimize MSA)
  • concise, practical assistant-like answers
  • natural everyday tone (avoid overly formal scaffolding unless requested)
  • broad everyday domains (customer-service style replies, basic troubleshooting, admin writing when asked)

Use & Scope

Direct Use (Recommended)

  • Research and experimentation on:
    • dialect controllability
    • low-resource data bootstrapping
    • prompt/style constraints for dialect steering
    • evaluating drift, register, and consistency

Commercial Use

This repository is shared primarily for research and reproducibility. If you intend commercial use, review the base model license and verify compatibility with your intended deployment.

Out-of-Scope Use

  • Medical/legal/financial advice beyond general informational guidance
  • High-stakes decision-making without expert oversight
  • Requests for sensitive personal data, illegal instructions, or harmful content

Bias, Risks, and Limitations

  • Dialect coverage is strongest for a Bahraini conversational assistant style; it may still drift into Gulf-general or more formal Arabic in edge cases.
  • Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
  • The model may inherit biases from the base model and any source material used to build/augment the dataset.

How to Get Started

Load (merged model)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16

tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
model.eval()

SYSTEM = "أنت مساعد يتكلم باللهجة البحرينية بشكل طبيعي. خلك مختصر وعملي، وتجنب الفصحى واللغة الرسمية إلا إذا المستخدم طلب. إذا السؤال يحتاج توضيح عشان تجاوب صح، اسأل سؤال واحد أو اثنين بالكثير. حاول تخلي الأسلوب بحريني مو خليجي عام. افترض المخاطب ذكر إلا إذا واضح من كلام المستخدم غير جذي."

messages = [
  {"role":"system","content":SYSTEM},
  {"role":"user","content":"إذا نومي خربان شسوي؟"}
]
enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
enc = {k:v.to(model.device) for k,v in enc.items()}

out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())

Training Details

Base Model

  • humain-ai/ALLaM-7B-Instruct-preview

Training Data (high-level)

Training was done on a curated Bahraini SFT-style corpus built from:

  • Single-speaker Bahraini transcript corpus (cleaned and normalized)
  • Synthetic-but-close-to-real conversational expansions, generated from the base style/voice and guided by strict rules to stay Bahraini
  • Domain-targeted assistant Q&A (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints

Data Construction Approach

The dataset was produced through a structured pipeline:

  • Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)

  • Prompt/response structuring into instruction-style pairs

  • Controlled synthetic generation to expand coverage while keeping the same voice

  • A dialect rule-set (positive/negative constraints) to:

    • encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
    • discourage MSA scaffolding and overly formal connectors
    • keep responses short and practical
  • Template correctness via the ALLaM chat template, with EOS enforcement

Prompt Format

Data was formatted using ALLaM’s chat template:

  • system: dialect/style constraints
  • user: prompt
  • assistant: target response and EOS was enforced at the end of each sample.

Training Procedure

  • Method: SFT with TRL SFTTrainer
  • Parameter-efficient fine-tuning: LoRA via PEFT
  • Final artifact: LoRA adapter was merged into the base model (merge_and_unload) and saved as a standalone model for standard loading.

Training Hyperparameters (exact run)

Base configuration used during the run:

max_seq_length: 2048
optimizer: adamw_torch
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 1.0
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
num_train_epochs: 4
packing: false
seed: 42
precision: bf16
attention_implementation: eager
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
lora:
  r: 16
  alpha: 32
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

Notes on Tokenizer / Special Tokens

The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses pad_token_id = eos_token_id with explicit attention masks during inference to avoid warnings and instability when pad==eos.

Evaluation

Evaluation was primarily qualitative via prompt suites comparing:

  • base model outputs vs fine-tuned outputs
  • dialect strength, conciseness, task completion, and reduction of MSA drift

Example prompt suite included:

  • smalltalk
  • sleep routine advice (short)
  • WhatsApp apology message
  • semi-formal request to university
  • home internet troubleshooting
  • APN setup guidance
  • online card rejection reasons
  • electricity bill troubleshooting
  • late order customer-service ticket phrasing
  • clarification questions behavior
  • dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
  • mixed Arabic/English phrasing (refund/invoice)

Compute / Infrastructure

  • Training stack: transformers, trl, peft
  • Hardware: Single GPU RTX 4090
  • Framework versions: PEFT 0.18.1 (per metadata)

Citation

Model

If you cite this model or derivative work, cite the dataset and include the base model reference.

Dataset (provided by author)

@dataset{barakat_bahraini_speech_2026,
  author       = {Hisham Barakat},
  title        = {Hishambarakat/Bahraini_Dialect_LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
  note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
}

Contact

Downloads last month
156
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Hishambarakat/Bahraini_Dialect_LLM

Adapter
(9)
this model