File size: 9,128 Bytes

be76b2a
 
 
 
 
11c6379
 
 
 
 
 
 
 
be76b2a
 
9886486
be76b2a
9886486
061c5f0
9886486
be76b2a
9886486
061c5f0
9886486
 
 
 
 
 
 
061c5f0
9886486
061c5f0
9886486
 
 
 
 
be76b2a
 
061c5f0
9886486
 
 
 
 
11c6379
 
9886486
061c5f0
9886486
061c5f0
9886486
 
 
 
be76b2a
9886486
be76b2a
9886486
061c5f0
9886486
 
 
 
 
 
 
061c5f0
9886486
be76b2a
 
061c5f0
11c6379
9886486
 
be76b2a
 
061c5f0
9886486
 
 
 
11c6379
be76b2a
11c6379
061c5f0
11c6379
 
 
be76b2a
11c6379
 
be76b2a
11c6379
 
 
be76b2a
4ee2864
be76b2a
11c6379
 
 
 
 
 
be76b2a
11c6379
 
061c5f0
be76b2a
 
 
11c6379
be76b2a
061c5f0
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
9886486
be76b2a
11c6379
be76b2a
061c5f0
 
 
 
 
 
 
be76b2a
061c5f0
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
62e8962
 
 
 
 
 
 
 
 
 
 
 
 
be8e09e
62e8962
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
061c5f0
 
be76b2a
11c6379
be76b2a
061c5f0
 
 
 
 
 
 
 
 
 
 
 
be76b2a
11c6379
be76b2a
061c5f0
 
 
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
be76b2a
11c6379
 
 
061c5f0
11c6379
 
061c5f0
11c6379
 
 
be76b2a
11c6379
be76b2a
061c5f0

---
base_model: humain-ai/ALLaM-7B-Instruct-preview
library_name: peft
pipeline_tag: text-generation
tags:
  - base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
  - lora
  - sft
  - transformers
  - trl
language:
  - ar
license: other
---

# Bahraini_Dialect_LLM (Research Fine-Tune on ALLaM-7B Instruct)

## Research Summary

**Bahraini_Dialect_LLM** is a research-oriented fine-tune of **humain-ai/ALLaM-7B-Instruct-preview** aimed at studying **Bahraini Arabic dialect controllability** and **low-resource dialect modeling**.

The core goal is not to present a “new model built from scratch,” but to explore how far we can push a strong Arabic instruction model toward **more natural Bahraini conversational behavior** using:

- limited dialect-specific data,
- structured data cleaning,
- and controlled synthetic augmentation (rule-guided generation) that stays close to real conversational patterns.

This repo contains **merged** weights (base + LoRA adapter merged into a standalone model) so it can be loaded like a standard `transformers` model.

## Motivation (Low-Resource Dialect Setting)

Bahraini dialect is a **low-resource** variety compared to MSA and many high-resource English tasks. This project is a practical experiment in:

- capturing dialectal phrasing and pragmatics (tone, brevity, everyday wording),
- reducing drift into Modern Standard Arabic,
- and testing whether **rule-based style constraints + LLM-based paraphrasing** can produce training data that improves dialect fidelity without requiring large-scale native corpora.

This work is intended as a **research prototype** to understand the training dynamics, limitations, and trade-offs of dialect steering.

## Model Details

- **Fine-tuned by:** Hisham Barakat (research fine-tune; base model ownership remains with original authors)
- **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
- **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
- **Language:** Arabic (Bahraini dialect focus)
- **Training method:** SFT with LoRA (PEFT), then merged
- **Intended pipeline:** `text-generation`

## Intended Behavior (Research Target)

The target behavior for evaluation is:

- Bahraini dialect phrasing (minimize MSA)
- concise, practical assistant-like answers
- natural everyday tone (avoid overly formal scaffolding unless requested)
- broad everyday domains (customer-service style replies, basic troubleshooting, admin writing when asked)

## Use & Scope

### Direct Use (Recommended)

- Research and experimentation on:
  - dialect controllability
  - low-resource data bootstrapping
  - prompt/style constraints for dialect steering
  - evaluating drift, register, and consistency

### Commercial Use

This repository is shared primarily for **research and reproducibility**. If you intend commercial use, review the **base model license** and verify compatibility with your intended deployment.

### Out-of-Scope Use

- Medical/legal/financial advice beyond general informational guidance
- High-stakes decision-making without expert oversight
- Requests for sensitive personal data, illegal instructions, or harmful content

## Bias, Risks, and Limitations

- Dialect coverage is strongest for a **Bahraini conversational assistant** style; it may still drift into Gulf-general or more formal Arabic in edge cases.
- Rule-guided synthetic data can imprint patterns (e.g., structure repetition, over-regular phrasing).
- The model may inherit biases from the base model and any source material used to build/augment the dataset.

## How to Get Started

### Load (merged model)

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16

tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
model.eval()

SYSTEM = "أنت مساعد يتكلم باللهجة البحرينية بشكل طبيعي. خلك مختصر وعملي، وتجنب الفصحى واللغة الرسمية إلا إذا المستخدم طلب. إذا السؤال يحتاج توضيح عشان تجاوب صح، اسأل سؤال واحد أو اثنين بالكثير. حاول تخلي الأسلوب بحريني مو خليجي عام. افترض المخاطب ذكر إلا إذا واضح من كلام المستخدم غير جذي."

messages = [
  {"role":"system","content":SYSTEM},
  {"role":"user","content":"إذا نومي خربان شسوي؟"}
]
enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
enc = {k:v.to(model.device) for k,v in enc.items()}

out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
```

## Training Details

### Base Model

- `humain-ai/ALLaM-7B-Instruct-preview`

### Training Data (high-level)

Training was done on a curated Bahraini SFT-style corpus built from:

- **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
- **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
- **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints

### Data Construction Approach

The dataset was produced through a structured pipeline:

- Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
- Prompt/response structuring into instruction-style pairs
- Controlled synthetic generation to expand coverage while keeping the same voice
- A dialect rule-set (positive/negative constraints) to:
  - encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها)
  - discourage MSA scaffolding and overly formal connectors
  - keep responses short and practical

- Template correctness via the ALLaM chat template, with EOS enforcement

### Prompt Format

Data was formatted using ALLaM’s chat template:

- system: dialect/style constraints
- user: prompt
- assistant: target response
  and EOS was enforced at the end of each sample.

### Training Procedure

- **Method:** SFT with TRL `SFTTrainer`
- **Parameter-efficient fine-tuning:** LoRA via PEFT
- **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.

### Training Hyperparameters (exact run)

Base configuration used during the run:

```yaml
max_seq_length: 2048
optimizer: adamw_torch
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.1
weight_decay: 0.01
max_grad_norm: 1.0
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
num_train_epochs: 4
packing: false
seed: 42
precision: bf16
attention_implementation: eager
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
lora:
  r: 16
  alpha: 32
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
```

### Notes on Tokenizer / Special Tokens

The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses `pad_token_id = eos_token_id` with explicit attention masks during inference to avoid warnings and instability when pad==eos.

## Evaluation

Evaluation was primarily qualitative via prompt suites comparing:

- base model outputs vs fine-tuned outputs
- dialect strength, conciseness, task completion, and reduction of MSA drift

Example prompt suite included:

- smalltalk
- sleep routine advice (short)
- WhatsApp apology message
- semi-formal request to university
- home internet troubleshooting
- APN setup guidance
- online card rejection reasons
- electricity bill troubleshooting
- late order customer-service ticket phrasing
- clarification questions behavior
- dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
- mixed Arabic/English phrasing (refund/invoice)

## Compute / Infrastructure

- **Training stack:** `transformers`, `trl`, `peft`
- **Hardware:** Single GPU RTX 4090
- **Framework versions:** PEFT 0.18.1 (per metadata)

## Citation

### Model

If you cite this model or derivative work, cite the dataset and include the base model reference.

### Dataset (provided by author)

```bibtex
@dataset{barakat_bahraini_speech_2026,
  author       = {Hisham Barakat},
  title        = {Hishambarakat/Bahraini_Dialect_LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Dialect_LLM},
  note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
}
```

## Contact

- **Author:** Hisham Barakat
- **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)