| --- |
| library_name: transformers |
| datasets: |
| - HeshamHaroon/saudi-dialect-conversations |
| base_model: |
| - LiquidAI/LFM2.5-1.2B-Instruct |
| --- |
| |
| # Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model |
|
|
| ## Model Description |
|
|
| This model is a fine-tuned version of **Liquid AI**’s **LFM2.5‑1.2B‑Instruct**, adapted for Saudi dialect conversational generation. |
|
|
| The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for **fast on-device inference**,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic. |
|
|
| This fine-tuned variant specializes the model for **Saudi dialect conversational patterns**, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| ### Primary Use Cases |
|
|
| * Saudi dialect chatbots |
| * Customer support assistants |
| * Conversational agents |
| * Arabic NLP research |
| * Dialect-aware RAG pipelines |
| * Dialogue generation systems |
|
|
| ### Out-of-Scope Uses |
|
|
| * Legal/medical advice |
| * Safety-critical decision making |
| * High-precision knowledge tasks without retrieval |
| * Sensitive content generation |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Base Model |
|
|
| * Architecture: Hybrid state-space + attention |
| * Parameters: ~1.17B |
| * Context length: 32,768 tokens |
| * Training tokens: ~28T |
| * Languages: Multilingual including Arabic |
|
|
| --- |
|
|
| ### Dataset |
|
|
| Fine-tuned on: |
|
|
| **Dataset:** |
| `HeshamHaroon/saudi-dialect-conversations` |
|
|
| **Domain:** |
| Conversational dialogue |
|
|
| **Language:** |
| Saudi dialect Arabic |
|
|
| **Format:** |
| Instruction → Response pairs |
|
|
| **Purpose:** |
| Increase dialect authenticity and conversational naturalness. |
|
|
| --- |
|
|
| ### Training Configuration |
|
|
| (Extracted from training notebook) |
|
|
| | Parameter | Value | |
| | --------------------- | ---------------------------- | |
| | Epochs | 4 | |
| | Learning Rate | 2e-4 | |
| | Batch Size | 16 | |
| | Gradient Accumulation | 4 | |
| | Optimizer | AdamW | |
| | LR Scheduler | Linear | |
| | Warmup Ratio | 0.03 | |
| | Sequence Length | 8096 | |
| | Precision | FP16 | |
| | Training Type | Supervised Fine-Tuning (SFT) | |
|
|
| --- |
|
|
| ### Training Procedure |
|
|
| Training was performed using: |
|
|
| * Transformers |
| * TRL SFTTrainer |
| * LoRA fine-tuning |
| * Mixed precision |
| * Gradient accumulation |
|
|
| The base model weights were adapted rather than retrained from scratch. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
|
|
|
|
| Qualitative evaluation indicates: |
|
|
| * Improved dialect fluency |
| * Reduced MSA leakage |
| * Better conversational tone |
| * Higher lexical authenticity |
|
|
| Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs. |
|
|
| --- |
|
|
| ## Performance Characteristics |
|
|
| **Strengths** |
|
|
| * Very fast inference |
| * Low memory footprint |
| * Strong conversational coherence |
| * Good instruction following |
|
|
| **Limitations** |
|
|
| * Smaller model → limited factual depth |
| * May hallucinate |
| * Less capable for complex reasoning vs larger models |
| * Dialect bias toward Saudi Arabic |
|
|
| --- |
|
|
| ## Bias, Risks, and Safety |
|
|
| Potential risks: |
|
|
| * Dialect bias |
| * Cultural bias from dataset |
| * Toxic outputs if prompted maliciously |
| * Hallucinated facts |
|
|
| Mitigations: |
|
|
| * Filtering dataset |
| * Instruction alignment |
| * Moderation layers recommended |
|
|
| --- |
|
|
| ## Hardware Requirements |
|
|
| Runs efficiently on: |
|
|
| * CPU inference (<1GB memory quantized) |
| * Mobile NPUs |
| * Edge devices |
|
|
| --- |
|
|
| ## Example Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_id = "AyoubChLin/lfm2.5-saudi-dialect" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| |
| prompt = "تكلم باللهجة السعودية عن القهوة" |
| |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=200) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| --- |
|
|
| ## Training Compute |
|
|
| * **GPU:** 1 × NVIDIA A100 (40 GB VRAM) |
| * **CPU:** 8 cores |
| * **RAM:** 16 GiB |
| * **Compute Environment:** Cloud training instance |
|
|
| --- |
|
|
| ## License |
|
|
| Same as base model license unless otherwise specified. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model: |
|
|
| ``` |
| @misc{saudi-dialect-lfm2.5, |
| author = {Cherguelaine Ayoub}, |
| title = {Saudi Dialect LFM2.5}, |
| year = {2026}, |
| publisher = {Hugging Face} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Acknowledgments |
|
|
| * Liquid AI for base model |
| * Dataset creators |
| * Open-source tooling ecosystem |
|
|
| --- |