Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model
Model Description
This model is a fine-tuned version of Liquid AI’s LFM2.5‑1.2B‑Instruct, adapted for Saudi dialect conversational generation.
The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for fast on-device inference,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic.
This fine-tuned variant specializes the model for Saudi dialect conversational patterns, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases.
Intended Use
Primary Use Cases
- Saudi dialect chatbots
- Customer support assistants
- Conversational agents
- Arabic NLP research
- Dialect-aware RAG pipelines
- Dialogue generation systems
Out-of-Scope Uses
- Legal/medical advice
- Safety-critical decision making
- High-precision knowledge tasks without retrieval
- Sensitive content generation
Training Details
Base Model
- Architecture: Hybrid state-space + attention
- Parameters: ~1.17B
- Context length: 32,768 tokens
- Training tokens: ~28T
- Languages: Multilingual including Arabic
Dataset
Fine-tuned on:
Dataset:
HeshamHaroon/saudi-dialect-conversations
Domain: Conversational dialogue
Language: Saudi dialect Arabic
Format: Instruction → Response pairs
Purpose: Increase dialect authenticity and conversational naturalness.
Training Configuration
(Extracted from training notebook)
| Parameter | Value |
|---|---|
| Epochs | 4 |
| Learning Rate | 2e-4 |
| Batch Size | 16 |
| Gradient Accumulation | 4 |
| Optimizer | AdamW |
| LR Scheduler | Linear |
| Warmup Ratio | 0.03 |
| Sequence Length | 8096 |
| Precision | FP16 |
| Training Type | Supervised Fine-Tuning (SFT) |
Training Procedure
Training was performed using:
- Transformers
- TRL SFTTrainer
- LoRA fine-tuning
- Mixed precision
- Gradient accumulation
The base model weights were adapted rather than retrained from scratch.
Evaluation
Qualitative evaluation indicates:
- Improved dialect fluency
- Reduced MSA leakage
- Better conversational tone
- Higher lexical authenticity
Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs.
Performance Characteristics
Strengths
- Very fast inference
- Low memory footprint
- Strong conversational coherence
- Good instruction following
Limitations
- Smaller model → limited factual depth
- May hallucinate
- Less capable for complex reasoning vs larger models
- Dialect bias toward Saudi Arabic
Bias, Risks, and Safety
Potential risks:
- Dialect bias
- Cultural bias from dataset
- Toxic outputs if prompted maliciously
- Hallucinated facts
Mitigations:
- Filtering dataset
- Instruction alignment
- Moderation layers recommended
Hardware Requirements
Runs efficiently on:
- CPU inference (<1GB memory quantized)
- Mobile NPUs
- Edge devices
Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "AyoubChLin/lfm2.5-saudi-dialect"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "تكلم باللهجة السعودية عن القهوة"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Compute
- GPU: 1 × NVIDIA A100 (40 GB VRAM)
- CPU: 8 cores
- RAM: 16 GiB
- Compute Environment: Cloud training instance
License
Same as base model license unless otherwise specified.
Citation
If you use this model:
@misc{saudi-dialect-lfm2.5,
author = {Cherguelaine Ayoub},
title = {Saudi Dialect LFM2.5},
year = {2026},
publisher = {Hugging Face}
}
Acknowledgments
- Liquid AI for base model
- Dataset creators
- Open-source tooling ecosystem
Model tree for AyoubChLin/lfm2.5-saudi-dialect
Base model
LiquidAI/LFM2.5-1.2B-Base