Update README.md

3d628c3 verified about 2 months ago

4.73 kB

	---
	library_name: transformers
	datasets:
	- HeshamHaroon/saudi-dialect-conversations
	base_model:
	- LiquidAI/LFM2.5-1.2B-Instruct
	---

	# Saudi Dialect LFM2.5 — Instruction-Tuned Arabic Dialect Model

	## Model Description

	This model is a fine-tuned version of Liquid AI’s LFM2.5‑1.2B‑Instruct, adapted for Saudi dialect conversational generation.

	The base model belongs to the LFM2.5 family — hybrid state-space + attention language models designed for fast on-device inference,low memory usage, and strong performance relative to size. It contains ~1.17B parameters, 32k context length, and supports multilingual generation including Arabic.

	This fine-tuned variant specializes the model for Saudi dialect conversational patterns, improving fluency, dialect authenticity, and instruction following for regional Arabic use cases.

	---

	## Intended Use

	### Primary Use Cases

	* Saudi dialect chatbots
	* Customer support assistants
	* Conversational agents
	* Arabic NLP research
	* Dialect-aware RAG pipelines
	* Dialogue generation systems

	### Out-of-Scope Uses

	* Legal/medical advice
	* Safety-critical decision making
	* High-precision knowledge tasks without retrieval
	* Sensitive content generation

	---

	## Training Details

	### Base Model

	* Architecture: Hybrid state-space + attention
	* Parameters: ~1.17B
	* Context length: 32,768 tokens
	* Training tokens: ~28T
	* Languages: Multilingual including Arabic

	---

	### Dataset

	Fine-tuned on:

	Dataset:
	`HeshamHaroon/saudi-dialect-conversations`

	Domain:
	Conversational dialogue

	Language:
	Saudi dialect Arabic

	Format:
	Instruction → Response pairs

	Purpose:
	Increase dialect authenticity and conversational naturalness.

	---

	### Training Configuration

	(Extracted from training notebook)

	\| Parameter \| Value \|
	\| --------------------- \| ---------------------------- \|
	\| Epochs \| 4 \|
	\| Learning Rate \| 2e-4 \|
	\| Batch Size \| 16 \|
	\| Gradient Accumulation \| 4 \|
	\| Optimizer \| AdamW \|
	\| LR Scheduler \| Linear \|
	\| Warmup Ratio \| 0.03 \|
	\| Sequence Length \| 8096 \|
	\| Precision \| FP16 \|
	\| Training Type \| Supervised Fine-Tuning (SFT) \|

	---

	### Training Procedure

	Training was performed using:

	* Transformers
	* TRL SFTTrainer
	* LoRA fine-tuning
	* Mixed precision
	* Gradient accumulation

	The base model weights were adapted rather than retrained from scratch.

	---

	## Evaluation



	Qualitative evaluation indicates:

	* Improved dialect fluency
	* Reduced MSA leakage
	* Better conversational tone
	* Higher lexical authenticity

	Dialect-specific fine-tuning is known to significantly increase dialect generation accuracy and reduce standard-Arabic drift in Arabic LLMs.

	---

	## Performance Characteristics

	Strengths

	* Very fast inference
	* Low memory footprint
	* Strong conversational coherence
	* Good instruction following

	Limitations

	* Smaller model → limited factual depth
	* May hallucinate
	* Less capable for complex reasoning vs larger models
	* Dialect bias toward Saudi Arabic

	---

	## Bias, Risks, and Safety

	Potential risks:

	* Dialect bias
	* Cultural bias from dataset
	* Toxic outputs if prompted maliciously
	* Hallucinated facts

	Mitigations:

	* Filtering dataset
	* Instruction alignment
	* Moderation layers recommended

	---

	## Hardware Requirements

	Runs efficiently on:

	* CPU inference (<1GB memory quantized)
	* Mobile NPUs
	* Edge devices

	---

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "AyoubChLin/lfm2.5-saudi-dialect"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	prompt = "تكلم باللهجة السعودية عن القهوة"

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## Training Compute

	* GPU: 1 × NVIDIA A100 (40 GB VRAM)
	* CPU: 8 cores
	* RAM: 16 GiB
	* Compute Environment: Cloud training instance

	---

	## License

	Same as base model license unless otherwise specified.

	---

	## Citation

	If you use this model:

	```
	@misc{saudi-dialect-lfm2.5,
	author = {Cherguelaine Ayoub},
	title = {Saudi Dialect LFM2.5},
	year = {2026},
	publisher = {Hugging Face}
	}
	```

	---

	## Acknowledgments

	* Liquid AI for base model
	* Dataset creators
	* Open-source tooling ecosystem

	---