|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- HeshamHaroon/saudi-dialect-conversations |
|
|
base_model: |
|
|
- LiquidAI/LFM2.5-1.2B-Instruct |
|
|
--- |
|
|
|
|
|
# Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of **Liquid AI**’s **LFM2.5‑1.2B‑Instruct**, adapted for Saudi dialect conversational generation. |
|
|
|
|
|
The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for **fast on-device inference**,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic. |
|
|
|
|
|
This fine-tuned variant specializes the model for **Saudi dialect conversational patterns**, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
* Saudi dialect chatbots |
|
|
* Customer support assistants |
|
|
* Conversational agents |
|
|
* Arabic NLP research |
|
|
* Dialect-aware RAG pipelines |
|
|
* Dialogue generation systems |
|
|
|
|
|
### Out-of-Scope Uses |
|
|
|
|
|
* Legal/medical advice |
|
|
* Safety-critical decision making |
|
|
* High-precision knowledge tasks without retrieval |
|
|
* Sensitive content generation |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Base Model |
|
|
|
|
|
* Architecture: Hybrid state-space + attention |
|
|
* Parameters: ~1.17B |
|
|
* Context length: 32,768 tokens |
|
|
* Training tokens: ~28T |
|
|
* Languages: Multilingual including Arabic |
|
|
|
|
|
--- |
|
|
|
|
|
### Dataset |
|
|
|
|
|
Fine-tuned on: |
|
|
|
|
|
**Dataset:** |
|
|
`HeshamHaroon/saudi-dialect-conversations` |
|
|
|
|
|
**Domain:** |
|
|
Conversational dialogue |
|
|
|
|
|
**Language:** |
|
|
Saudi dialect Arabic |
|
|
|
|
|
**Format:** |
|
|
Instruction → Response pairs |
|
|
|
|
|
**Purpose:** |
|
|
Increase dialect authenticity and conversational naturalness. |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
(Extracted from training notebook) |
|
|
|
|
|
| Parameter | Value | |
|
|
| --------------------- | ---------------------------- | |
|
|
| Epochs | 4 | |
|
|
| Learning Rate | 2e-4 | |
|
|
| Batch Size | 16 | |
|
|
| Gradient Accumulation | 4 | |
|
|
| Optimizer | AdamW | |
|
|
| LR Scheduler | Linear | |
|
|
| Warmup Ratio | 0.03 | |
|
|
| Sequence Length | 8096 | |
|
|
| Precision | FP16 | |
|
|
| Training Type | Supervised Fine-Tuning (SFT) | |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
Training was performed using: |
|
|
|
|
|
* Transformers |
|
|
* TRL SFTTrainer |
|
|
* LoRA fine-tuning |
|
|
* Mixed precision |
|
|
* Gradient accumulation |
|
|
|
|
|
The base model weights were adapted rather than retrained from scratch. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
|
|
|
|
|
|
Qualitative evaluation indicates: |
|
|
|
|
|
* Improved dialect fluency |
|
|
* Reduced MSA leakage |
|
|
* Better conversational tone |
|
|
* Higher lexical authenticity |
|
|
|
|
|
Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance Characteristics |
|
|
|
|
|
**Strengths** |
|
|
|
|
|
* Very fast inference |
|
|
* Low memory footprint |
|
|
* Strong conversational coherence |
|
|
* Good instruction following |
|
|
|
|
|
**Limitations** |
|
|
|
|
|
* Smaller model → limited factual depth |
|
|
* May hallucinate |
|
|
* Less capable for complex reasoning vs larger models |
|
|
* Dialect bias toward Saudi Arabic |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Safety |
|
|
|
|
|
Potential risks: |
|
|
|
|
|
* Dialect bias |
|
|
* Cultural bias from dataset |
|
|
* Toxic outputs if prompted maliciously |
|
|
* Hallucinated facts |
|
|
|
|
|
Mitigations: |
|
|
|
|
|
* Filtering dataset |
|
|
* Instruction alignment |
|
|
* Moderation layers recommended |
|
|
|
|
|
--- |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
Runs efficiently on: |
|
|
|
|
|
* CPU inference (<1GB memory quantized) |
|
|
* Mobile NPUs |
|
|
* Edge devices |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "AyoubChLin/lfm2.5-saudi-dialect" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
prompt = "تكلم باللهجة السعودية عن القهوة" |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=200) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Compute |
|
|
|
|
|
* **GPU:** 1 × NVIDIA A100 (40 GB VRAM) |
|
|
* **CPU:** 8 cores |
|
|
* **RAM:** 16 GiB |
|
|
* **Compute Environment:** Cloud training instance |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
Same as base model license unless otherwise specified. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model: |
|
|
|
|
|
``` |
|
|
@misc{saudi-dialect-lfm2.5, |
|
|
author = {Cherguelaine Ayoub}, |
|
|
title = {Saudi Dialect LFM2.5}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
* Liquid AI for base model |
|
|
* Dataset creators |
|
|
* Open-source tooling ecosystem |
|
|
|
|
|
--- |