Text Generation
Safetensors
qwen2
conversational

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NileChat-3B (Moroccan & Egyptian Arabic Dialectal LLM)

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

This is the continued pre-trained version of NileChat-3B (Base). For the instruction-tuned version, please check this version.

NileChat is a 3-billion parameter Large Language Model (LLM) adapted for Egyptian and Moroccan communities. It is designed to incorporate their specific language dialects, cultural heritage, and values. The model demonstrates proficiency in both Egyptian and Moroccan dialectal Arabic (using Arabic script and Arabizi), while also maintaining strong performance in Modern Standard Arabic (MSA), French, and English.

This model is the proof-of-concept resulting from the research paper "NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities".

Model Description

NileChat was developed to address the underrepresentation of low-resource languages and local cultures in existing LLMs. Current models often rely on translating English corpora, leading to an alignment with the source language's culture rather than the target local communities.

The NileChat methodology focuses on creating synthetic and retrieval-based pre-training data tailored to a specific community by considering its:

  • (i) Language: Dialectal nuances, idiomatic expressions, and unique linguistic structures.
  • (ii) Cultural Heritage: Customs, traditions, social norms, historical context, and common knowledge.
  • (iii) Cultural Values: Ethical standpoints, belief systems, and societal priorities. These are referred to as the Language-Heritage-Values (LHV) dimensions.

The project provides:

  • A novel framework for augmenting pre-training corpora for local communities.
  • New datasets for Egyptian and Moroccan Arabic dialects.
  • The NileChat model itself.

Intended Uses

NileChat is intended to improve LLM accessibility and relevance for Egyptian and Moroccan Arabic-speaking communities. It can be used for tasks requiring:

  • Understanding and generation in Egyptian and Moroccan dialects (Arabic script and Arabizi).
  • Translation between these dialects, MSA, English, and French.
  • Culturally aware interactions and content generation relevant to Egyptian and Moroccan contexts.
  • Applications requiring alignment with local societal values.

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "UBC-NLP/NileChat-3B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
prompt = "باش نصاوبو الطاجين خاص"

# Encode the input
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate output
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.1
)

response = tokenizer.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: {prompt}")
print(f"Completion: {response}")

Training Data

NileChat's pre-training and fine-tuning datasets were specifically curated to imbue linguistic and cultural competence for Egyptian (EGY) and Moroccan (MOR) Arabic.

Pre-training Data

A novel data augmentation pipeline was used, combining:

  1. Machine Translation (MT) for Knowledge and Fluency:
    • English educational content (5.5 million texts from Fineweb-edu) was translated into EGY and MOR dialects using the Command R+ teacher model. Focus on educational domain for topical breadth.
  2. Controlled Generation for Cultural Heritage and Values:
    • Diverse texts (stories, personal essays, blog posts, reviews, conversations) were generated in the target language.
    • Components:
      • Local Contextual Information: From local news websites (approx. 1.5M EGY, 800k MOR articles in MSA).
      • Core Cultural Heritage Concepts: Extracted from country-specific Wikipedia portals (25k EGY, 49k MOR articles).
      • Linguistic and Cultural Expressions: Common expressions, proverbs, idioms, TV dialogues (600 utterances), and local terminology (4,000 dialect-to-English word pairs per dialect from Gatitos dictionary).
      • Representative Personas: 1,200 descriptions based on World Values Survey (WVS) data for Egyptian and Moroccan participants.
    • Generated ~300k samples per genre for EGY, and ~150k samples per genre for MOR.
  3. Retrieval for Local Cultural Heritage:
    • Brave Search API queried with 6,500 Moroccan and 4,500 Egyptian cultural concepts across ten categories (food, clothes, landmarks, etc.).
    • Collected 110k articles for EGY and 30k for MOR.

Arabizi Data: 1.5M generated educational/LHV samples for EGY and 0.5M for MOR were converted to Arabizi.

Final Pre-training Mixture: The generated and retrieved data were combined with pre-existing public data for EGY, MOR, MSA, English, French, Math, and Code to mitigate catastrophic forgetting. The resulting dataset comprises 98.57 billion words.

Type Name Hugging Face Link
Data Fineweb-edu-Morocco Open In HF
Data Fineweb-edu-Egypt Open In HF
Data Arabizi-Egypt Open In HF
Data Arabizi-Morocco Open In HF
Data LHV-Egypt Open In HF
Data LHV-Morocco Open In HF
Model NileChat-3B Open In HF

Training Procedure

Teacher Model

  • Command R+ (104B) was used as the teacher model for translation and controlled generation due to its reasonable text-generation capabilities in target dialects and open weights.

Continued Pre-training

  • Base Model: Qwen-2.5-3B was selected due to competitive performance and good tokenizer compression on MSA.
  • The full 3.1B parameter model was pre-trained for one epoch on the curated dataset (98.57 billion words).
  • Sequence Length: 4,096.
  • Learning Rate: Linearly decayed from $5 \times 10^{-6}$ to $5 \times 10^{-7}$.
  • Weight Decay: 0.1.
  • Gradient Clipping: Norms clipped at 1.0.
  • Compute: Data augmentation took 1,096 hours on 4x A100 80GB GPUs.Continued pre-training took 750 hours on 4x A100 80GB GPUs.

Ethical Considerations

  • The work aims to develop inclusive, linguistically, and culturally diverse LLMs.
  • Pre-training and instruction-tuning data generation, while using a teacher LLM, was critically informed by ground-truth cultural values survey data (WVS) and local context.
  • Evaluations show reasonable alignment with the cultural heritage and values of the target communities.
  • No explicit safety alignment procedures were conducted. The authors strongly recommend thorough testing and further safety evaluations before any real-world deployment.

Limitations

As a smaller Large Language Model, NileChat-3B shares common limitations with other LLMs. These can include generating plausible yet incorrect information (hallucinations), sensitivity to prompt phrasing, and inconsistent performance with very long inputs. Although NileChat-3B aims to mitigate these issues, particularly for Arabic tasks, users should exercise critical judgment when evaluating its outputs, especially in crucial or fact-dependent situations.

Citation

If you use NileChat or the associated methodology, please cite the original paper:

@inproceedings{el-mekki-etal-2025-nilechat,
    title = "{N}ile{C}hat: Towards Linguistically Diverse and Culturally Aware {LLM}s for Local Communities",
    author = "El Mekki, Abdellah  and
      Atou, Houdaifa  and
      Nacar, Omer  and
      Shehata, Shady  and
      Abdul-Mageed, Muhammad",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.556/",
    doi = "10.18653/v1/2025.emnlp-main.556",
    pages = "10978--11002",
    ISBN = "979-8-89176-332-6"
}
Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UBC-NLP/NileChat-3B-Base

Base model

Qwen/Qwen2.5-3B
Finetuned
(393)
this model

Datasets used to train UBC-NLP/NileChat-3B-Base

Collection including UBC-NLP/NileChat-3B-Base