Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model

Model Description

This model is a fine-tuned version of Liquid AI’s LFM2.5‑1.2B‑Instruct, adapted for Saudi dialect conversational generation.

The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for fast on-device inference,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic.

This fine-tuned variant specializes the model for Saudi dialect conversational patterns, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases.


Intended Use

Primary Use Cases

  • Saudi dialect chatbots
  • Customer support assistants
  • Conversational agents
  • Arabic NLP research
  • Dialect-aware RAG pipelines
  • Dialogue generation systems

Out-of-Scope Uses

  • Legal/medical advice
  • Safety-critical decision making
  • High-precision knowledge tasks without retrieval
  • Sensitive content generation

Training Details

Base Model

  • Architecture: Hybrid state-space + attention
  • Parameters: ~1.17B
  • Context length: 32,768 tokens
  • Training tokens: ~28T
  • Languages: Multilingual including Arabic

Dataset

Fine-tuned on:

Dataset: HeshamHaroon/saudi-dialect-conversations

Domain: Conversational dialogue

Language: Saudi dialect Arabic

Format: Instruction → Response pairs

Purpose: Increase dialect authenticity and conversational naturalness.


Training Configuration

(Extracted from training notebook)

Parameter Value
Epochs 4
Learning Rate 2e-4
Batch Size 16
Gradient Accumulation 4
Optimizer AdamW
LR Scheduler Linear
Warmup Ratio 0.03
Sequence Length 8096
Precision FP16
Training Type Supervised Fine-Tuning (SFT)

Training Procedure

Training was performed using:

  • Transformers
  • TRL SFTTrainer
  • LoRA fine-tuning
  • Mixed precision
  • Gradient accumulation

The base model weights were adapted rather than retrained from scratch.


Evaluation

Qualitative evaluation indicates:

  • Improved dialect fluency
  • Reduced MSA leakage
  • Better conversational tone
  • Higher lexical authenticity

Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs.


Performance Characteristics

Strengths

  • Very fast inference
  • Low memory footprint
  • Strong conversational coherence
  • Good instruction following

Limitations

  • Smaller model → limited factual depth
  • May hallucinate
  • Less capable for complex reasoning vs larger models
  • Dialect bias toward Saudi Arabic

Bias, Risks, and Safety

Potential risks:

  • Dialect bias
  • Cultural bias from dataset
  • Toxic outputs if prompted maliciously
  • Hallucinated facts

Mitigations:

  • Filtering dataset
  • Instruction alignment
  • Moderation layers recommended

Hardware Requirements

Runs efficiently on:

  • CPU inference (<1GB memory quantized)
  • Mobile NPUs
  • Edge devices

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "AyoubChLin/lfm2.5-saudi-dialect"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "تكلم باللهجة السعودية عن القهوة"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Compute

  • GPU: 1 × NVIDIA A100 (40 GB VRAM)
  • CPU: 8 cores
  • RAM: 16 GiB
  • Compute Environment: Cloud training instance

License

Same as base model license unless otherwise specified.


Citation

If you use this model:

@misc{saudi-dialect-lfm2.5,
  author = {Cherguelaine Ayoub},
  title = {Saudi Dialect LFM2.5},
  year = {2026},
  publisher = {Hugging Face}
}

Acknowledgments

  • Liquid AI for base model
  • Dataset creators
  • Open-source tooling ecosystem

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AyoubChLin/lfm2.5-saudi-dialect

Finetuned
(44)
this model

Dataset used to train AyoubChLin/lfm2.5-saudi-dialect