--- library_name: transformers datasets: - HeshamHaroon/saudi-dialect-conversations base_model: - LiquidAI/LFM2.5-1.2B-Instruct --- # Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model ## Model Description This model is a fine-tuned version of **Liquid AI**’s **LFM2.5‑1.2B‑Instruct**, adapted for Saudi dialect conversational generation. The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for **fast on-device inference**,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic. This fine-tuned variant specializes the model for **Saudi dialect conversational patterns**, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases. --- ## Intended Use ### Primary Use Cases * Saudi dialect chatbots * Customer support assistants * Conversational agents * Arabic NLP research * Dialect-aware RAG pipelines * Dialogue generation systems ### Out-of-Scope Uses * Legal/medical advice * Safety-critical decision making * High-precision knowledge tasks without retrieval * Sensitive content generation --- ## Training Details ### Base Model * Architecture: Hybrid state-space + attention * Parameters: ~1.17B * Context length: 32,768 tokens * Training tokens: ~28T * Languages: Multilingual including Arabic --- ### Dataset Fine-tuned on: **Dataset:** `HeshamHaroon/saudi-dialect-conversations` **Domain:** Conversational dialogue **Language:** Saudi dialect Arabic **Format:** Instruction → Response pairs **Purpose:** Increase dialect authenticity and conversational naturalness. --- ### Training Configuration (Extracted from training notebook) | Parameter | Value | | --------------------- | ---------------------------- | | Epochs | 4 | | Learning Rate | 2e-4 | | Batch Size | 16 | | Gradient Accumulation | 4 | | Optimizer | AdamW | | LR Scheduler | Linear | | Warmup Ratio | 0.03 | | Sequence Length | 8096 | | Precision | FP16 | | Training Type | Supervised Fine-Tuning (SFT) | --- ### Training Procedure Training was performed using: * Transformers * TRL SFTTrainer * LoRA fine-tuning * Mixed precision * Gradient accumulation The base model weights were adapted rather than retrained from scratch. --- ## Evaluation Qualitative evaluation indicates: * Improved dialect fluency * Reduced MSA leakage * Better conversational tone * Higher lexical authenticity Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs. --- ## Performance Characteristics **Strengths** * Very fast inference * Low memory footprint * Strong conversational coherence * Good instruction following **Limitations** * Smaller model → limited factual depth * May hallucinate * Less capable for complex reasoning vs larger models * Dialect bias toward Saudi Arabic --- ## Bias, Risks, and Safety Potential risks: * Dialect bias * Cultural bias from dataset * Toxic outputs if prompted maliciously * Hallucinated facts Mitigations: * Filtering dataset * Instruction alignment * Moderation layers recommended --- ## Hardware Requirements Runs efficiently on: * CPU inference (<1GB memory quantized) * Mobile NPUs * Edge devices --- ## Example Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "AyoubChLin/lfm2.5-saudi-dialect" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "تكلم باللهجة السعودية عن القهوة" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Training Compute * **GPU:** 1 × NVIDIA A100 (40 GB VRAM) * **CPU:** 8 cores * **RAM:** 16 GiB * **Compute Environment:** Cloud training instance --- ## License Same as base model license unless otherwise specified. --- ## Citation If you use this model: ``` @misc{saudi-dialect-lfm2.5, author = {Cherguelaine Ayoub}, title = {Saudi Dialect LFM2.5}, year = {2026}, publisher = {Hugging Face} } ``` --- ## Acknowledgments * Liquid AI for base model * Dataset creators * Open-source tooling ecosystem ---