AyoubChLin's picture
Update README.md
3d628c3 verified
---
library_name: transformers
datasets:
- HeshamHaroon/saudi-dialect-conversations
base_model:
- LiquidAI/LFM2.5-1.2B-Instruct
---
# Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model
## Model Description
This model is a fine-tuned version of **Liquid AI**’s **LFM2.5‑1.2B‑Instruct**, adapted for Saudi dialect conversational generation.
The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for **fast on-device inference**,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic.
This fine-tuned variant specializes the model for **Saudi dialect conversational patterns**, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases.
---
## Intended Use
### Primary Use Cases
* Saudi dialect chatbots
* Customer support assistants
* Conversational agents
* Arabic NLP research
* Dialect-aware RAG pipelines
* Dialogue generation systems
### Out-of-Scope Uses
* Legal/medical advice
* Safety-critical decision making
* High-precision knowledge tasks without retrieval
* Sensitive content generation
---
## Training Details
### Base Model
* Architecture: Hybrid state-space + attention
* Parameters: ~1.17B
* Context length: 32,768 tokens
* Training tokens: ~28T
* Languages: Multilingual including Arabic
---
### Dataset
Fine-tuned on:
**Dataset:**
`HeshamHaroon/saudi-dialect-conversations`
**Domain:**
Conversational dialogue
**Language:**
Saudi dialect Arabic
**Format:**
Instruction → Response pairs
**Purpose:**
Increase dialect authenticity and conversational naturalness.
---
### Training Configuration
(Extracted from training notebook)
| Parameter | Value |
| --------------------- | ---------------------------- |
| Epochs | 4 |
| Learning Rate | 2e-4 |
| Batch Size | 16 |
| Gradient Accumulation | 4 |
| Optimizer | AdamW |
| LR Scheduler | Linear |
| Warmup Ratio | 0.03 |
| Sequence Length | 8096 |
| Precision | FP16 |
| Training Type | Supervised Fine-Tuning (SFT) |
---
### Training Procedure
Training was performed using:
* Transformers
* TRL SFTTrainer
* LoRA fine-tuning
* Mixed precision
* Gradient accumulation
The base model weights were adapted rather than retrained from scratch.
---
## Evaluation
Qualitative evaluation indicates:
* Improved dialect fluency
* Reduced MSA leakage
* Better conversational tone
* Higher lexical authenticity
Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs.
---
## Performance Characteristics
**Strengths**
* Very fast inference
* Low memory footprint
* Strong conversational coherence
* Good instruction following
**Limitations**
* Smaller model → limited factual depth
* May hallucinate
* Less capable for complex reasoning vs larger models
* Dialect bias toward Saudi Arabic
---
## Bias, Risks, and Safety
Potential risks:
* Dialect bias
* Cultural bias from dataset
* Toxic outputs if prompted maliciously
* Hallucinated facts
Mitigations:
* Filtering dataset
* Instruction alignment
* Moderation layers recommended
---
## Hardware Requirements
Runs efficiently on:
* CPU inference (<1GB memory quantized)
* Mobile NPUs
* Edge devices
---
## Example Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "AyoubChLin/lfm2.5-saudi-dialect"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "تكلم باللهجة السعودية عن القهوة"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Training Compute
* **GPU:** 1 × NVIDIA A100 (40 GB VRAM)
* **CPU:** 8 cores
* **RAM:** 16 GiB
* **Compute Environment:** Cloud training instance
---
## License
Same as base model license unless otherwise specified.
---
## Citation
If you use this model:
```
@misc{saudi-dialect-lfm2.5,
author = {Cherguelaine Ayoub},
title = {Saudi Dialect LFM2.5},
year = {2026},
publisher = {Hugging Face}
}
```
---
## Acknowledgments
* Liquid AI for base model
* Dataset creators
* Open-source tooling ecosystem
---