---
library_name: transformers
datasets:
- HeshamHaroon/saudi-dialect-conversations
base_model:
- LiquidAI/LFM2.5-1.2B-Instruct
---

# Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model

## Model Description

This model is a fine-tuned version of **Liquid AI**’s **LFM2.5‑1.2B‑Instruct**, adapted for Saudi dialect conversational generation.

The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for **fast on-device inference**,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic.

This fine-tuned variant specializes the model for **Saudi dialect conversational patterns**, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases.

---

## Intended Use

### Primary Use Cases

* Saudi dialect chatbots
* Customer support assistants
* Conversational agents
* Arabic NLP research
* Dialect-aware RAG pipelines
* Dialogue generation systems

### Out-of-Scope Uses

* Legal/medical advice
* Safety-critical decision making
* High-precision knowledge tasks without retrieval
* Sensitive content generation

---

## Training Details

### Base Model

* Architecture: Hybrid state-space + attention
* Parameters: ~1.17B
* Context length: 32,768 tokens
* Training tokens: ~28T
* Languages: Multilingual including Arabic

---

### Dataset

Fine-tuned on:

**Dataset:**
`HeshamHaroon/saudi-dialect-conversations`

**Domain:**
Conversational dialogue

**Language:**
Saudi dialect Arabic

**Format:**
Instruction → Response pairs

**Purpose:**
Increase dialect authenticity and conversational naturalness.

---

### Training Configuration

(Extracted from training notebook)

| Parameter             | Value                        |
| --------------------- | ---------------------------- |
| Epochs                | 4                            |
| Learning Rate         | 2e-4                         |
| Batch Size            | 16                            |
| Gradient Accumulation | 4                            |
| Optimizer             | AdamW                        |
| LR Scheduler          | Linear                       |
| Warmup Ratio          | 0.03                         |
| Sequence Length       | 8096                         |
| Precision             | FP16                         |
| Training Type         | Supervised Fine-Tuning (SFT) |

---

### Training Procedure

Training was performed using:

* Transformers
* TRL SFTTrainer
* LoRA fine-tuning
* Mixed precision
* Gradient accumulation

The base model weights were adapted rather than retrained from scratch.

---

## Evaluation


Qualitative evaluation indicates:

* Improved dialect fluency
* Reduced MSA leakage
* Better conversational tone
* Higher lexical authenticity

Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs.

---

## Performance Characteristics

**Strengths**

* Very fast inference
* Low memory footprint
* Strong conversational coherence
* Good instruction following

**Limitations**

* Smaller model → limited factual depth
* May hallucinate
* Less capable for complex reasoning vs larger models
* Dialect bias toward Saudi Arabic

---

## Bias, Risks, and Safety

Potential risks:

* Dialect bias
* Cultural bias from dataset
* Toxic outputs if prompted maliciously
* Hallucinated facts

Mitigations:

* Filtering dataset
* Instruction alignment
* Moderation layers recommended

---

## Hardware Requirements

Runs efficiently on:

* CPU inference (<1GB memory quantized)
* Mobile NPUs
* Edge devices

---

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "AyoubChLin/lfm2.5-saudi-dialect"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "تكلم باللهجة السعودية عن القهوة"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Training Compute

* **GPU:** 1 × NVIDIA A100 (40 GB VRAM)
* **CPU:** 8 cores
* **RAM:** 16 GiB
* **Compute Environment:** Cloud training instance

---

## License

Same as base model license unless otherwise specified.

---

## Citation

If you use this model:

```
@misc{saudi-dialect-lfm2.5,
  author = {Cherguelaine Ayoub},
  title = {Saudi Dialect LFM2.5},
  year = {2026},
  publisher = {Hugging Face}
}
```

---

## Acknowledgments

* Liquid AI for base model
* Dataset creators
* Open-source tooling ecosystem

---