Hishambarakat
/

Bahraini_Dialect_LLM

@@ -3,207 +3,206 @@ base_model: humain-ai/ALLaM-7B-Instruct-preview
 library_name: peft
 pipeline_tag: text-generation
 tags:
-- base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
-- lora
-- sft
-- transformers
-- trl
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.18.1

 library_name: peft
 pipeline_tag: text-generation
 tags:
+  - base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
+  - lora
+  - sft
+  - transformers
+  - trl
+language:
+  - ar
+license: other
 ---
+# Bahraini_Dialect_LLM (ALLaM-7B Instruct + Bahraini SFT)
+## Model Summary
+**Bahraini_Dialect_LLM** is a Bahraini Arabic dialect fine-tune of **humain-ai/ALLaM-7B-Instruct-preview**, trained to produce **short, natural Bahraini responses** (avoiding Modern Standard Arabic), with stronger dialectal phrasing and domain coverage for everyday Q&A and practical assistant-style tasks.
+This repo contains the **merged** weights (base + LoRA adapter merged into a standalone model) suitable for standard `transformers` loading.
 ## Model Details
+- **Developed by:** Hisham Barakat
+- **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
+- **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
+- **Language:** Arabic (Bahraini dialect focus)
+- **Training method:** Supervised Fine-Tuning (SFT) with LoRA (PEFT), then **merged**
+- **Intended pipeline:** `text-generation`
+## Intended Behavior
+The target behavior is:
+- Bahraini dialect (not MSA)
+- concise and clear
+- practical and grounded answers for daily-life and assistant-like queries
+- avoids overly formal greetings/phrasing unless explicitly requested
 ## Uses
 ### Direct Use
+- Bahraini dialect assistant-style responses for:
+  - everyday chat / smalltalk
+  - short customer-service style replies
+  - practical troubleshooting (internet/APN/basic devices)
+  - simple admin writing (short “semi-formal” when requested)
+  - general Q&A
 ### Out-of-Scope Use
+- Medical/legal/financial advice beyond general informational guidance
+- Generating sensitive personal data, illegal instructions, or harmful content
+- High-stakes decision making without human review
 ## Bias, Risks, and Limitations
+- Dialect coverage is strongest for **Bahraini conversational assistant** style; it may still drift to Gulf-general or more formal Arabic in some cases.
+- Synthetic paraphrasing and rule-driven generation can imprint patterns (over-structured answers, repeated phrasing).
+- The model may inherit biases present in the base model and any source material used to generate/clean the dataset.
+## How to Get Started
+### Load (merged model)
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
+DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
+tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
+model.eval()
+SYSTEM = "تكلم بحريني طبيعي. تجنب الفصحى و(تمام/أرجو/عادة). استخدم: وايد، جذي، هني، شلون، عقبها/بعدها، ما ضبط. افترض مخاطب ذكر إلا إذا في مؤشرات أنثى."
+messages = [
+  {"role":"system","content":SYSTEM},
+  {"role":"user","content":"إذا نومي خربان شسوي؟"}
+]
+enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
+enc = {k:v.to(model.device) for k,v in enc.items()}
+out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
+print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
+````
 ## Training Details
+### Base Model
+* `humain-ai/ALLaM-7B-Instruct-preview`
+### Training Data (high-level)
+Training was done on a curated Bahraini SFT-style corpus built from:
+* **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
+* **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
+* **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
+### Data Construction Approach (what makes it “Bahraini”)
+The dataset was produced through a structured pipeline:
+* Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
+* Prompt/response structuring into instruction-style pairs
+* Controlled synthetic generation to expand coverage while keeping the same voice
+* A dialect rule-set (positive/negative constraints) to:
+  * encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها، ما ضبط)
+  * discourage MSA scaffolding and overly formal connectors
+  * keep responses short and practical
+* Template correctness via the ALLaM chat template, with EOS enforcement
+### Prompt Format
+Data was formatted using ALLaM’s chat template:
+* system: dialect/style constraints
+* user: prompt
+* assistant: target response
+  and EOS was enforced at the end of each sample.
+### Training Procedure
+* **Method:** SFT with TRL `SFTTrainer`
+* **Parameter-efficient fine-tuning:** LoRA via PEFT
+* **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
+### Training Hyperparameters (exact run)
+Base configuration used during the run:
+* **Max sequence length:** 2048
+* **Optimizer:** `adamw_torch`
+* **LR:** 2e-5
+* **Scheduler:** cosine
+* **Warmup:** 0.1 of optimizer steps (computed as `warmup_steps`)
+* **Weight decay:** 0.01
+* **Max grad norm:** 1.0
+* **Batching:** `per_device_train_batch_size=4`, `gradient_accumulation_steps=16`
+* **Epochs:** 4
+* **Packing:** False
+* **Seed:** 42
+* **Precision:** fp16 on T4; bf16 on Ampere+
+* **Attention impl:** eager
+* **Gradient checkpointing:** enabled (`use_reentrant=False`)
+* **LoRA:**
+  * r=16
+  * alpha=32
+  * dropout=0.05
+  * target modules: `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
+### Notes on Tokenizer / Special Tokens
+The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses `pad_token_id = eos_token_id` with explicit attention masks during inference to avoid warnings and instability when pad==eos.
+## Evaluation
+Evaluation was primarily qualitative via prompt suites comparing:
+* base model outputs vs fine-tuned outputs
+* dialect strength, conciseness, task completion, and reduction of MSA drift
+Example prompt suite included:
+* smalltalk
+* sleep routine advice (short)
+* WhatsApp apology message
+* semi-formal request to university
+* home internet troubleshooting
+* APN setup guidance
+* online card rejection reasons
+* electricity bill troubleshooting
+* late order customer-service ticket phrasing
+* clarification questions behavior
+* dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
+* mixed Arabic/English phrasing (refund/invoice)
+## Compute / Infrastructure
+* **Training stack:** `transformers`, `trl`, `peft`
+* **Hardware:** single GPU (T4-class during development), fp16 used on T4
+* **Framework versions:** PEFT 0.18.1 (per metadata)
+## Citation
+### Model
+If you cite this model or derivative work, cite the dataset and include the base model reference.
+### Dataset (provided by author)
+```bibtex
+@dataset{barakat_bahraini_speech_2026,
+  author       = {Hisham Barakat},
+  title        = {Hishambarakat/Bahraini_Speech_Dataset},
+  year         = {2026},
+  publisher    = {Hugging Face},
+  url          = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Speech_Dataset},
+  note         = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
+}
+```
+## Contact
+* **Author:** Hisham Barakat
+* **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)