TounsiLM-8b

TounsiLM-8b is a Tunisian Arabic supervised fine-tuning (SFT) adapter built on top of CohereLabs/aya-expanse-8b. It is trained to understand and answer in Tunisian دارجة,answers are direct, on topic and sized to the question — short when brevity is enough, detailed when the topic requires it.

The adapter was fine-tuned on top of a prior CPT checkpoint: alabenayed/improved-aya-expanse-8b-cpt-tunisian, which itself extends the base model with continued pre-training on raw Tunisian dialect text.


Model details

Property Value
Base model CohereLabs/aya-expanse-8b
CPT checkpoint alabenayed/improved-aya-expanse-8b-cpt-tunisian
Fine-tuning method PEFT / LoRA SFT adapter
Format Adapter only — not a merged standalone model

Training details

Property Value
Dataset Syrinesmati/tunisian-question-response-dataset
Train rows 25,340
Eval rows 6,336
Input fields instruction → user turn, response → assistant turn
Trainer TRL SFTTrainer
Epochs 2
Max sequence length 1,024
Learning rate 1e-5
Batch size (per device) 8
Gradient accumulation 4
Effective batch size 32
Precision bf16

Training metrics

Metric Value
Training loss 1.1876
Mean token accuracy 0.7578
Training runtime 50,353 seconds (~14 hours)
Total steps 1,584
Total tokens seen 9,585,534

How to use

Load the adapter on the base model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "CohereLabs/aya-expanse-8b"
adapter_dir = "alabenayed/TounsiLM-8b"  # update with your HF repo path

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_dir)

messages = [
    {"role": "system", "content": "أنت مساعد تونسي تجاوب بالتونسي الدارج فقط."},
    {"role": "user", "content": "شنوة تعمل كان الواحد يحس روحو تعبان؟"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    repetition_penalty=1.1,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Recommended inference settings

  • do_sample=False — more stable, less hallucination
  • max_new_tokens=128 — keeps answers short and on-topic
  • repetition_penalty=1.1 — reduces repetitive output

Intended use

Suitable for:

  • Tunisian Arabic question answering
  • Chat-style assistant replies in Tunisian دارجة
  • daily life conversational responses
  • Translation to/from Tunisian Arabic dialect
  • Responding to questions asked in other languages, answered in Tunisian Arabic
  • Medical, legal, religion
  • General knowledge about Tunisian food, places, history, proverbs ...

Files in this repository

  • adapter_model.safetensors — fine-tuned LoRA weights
  • adapter_config.json — LoRA configuration
  • chat_template.jinja — patched chat template used during training
  • Tokenizer files
  • training_metrics.json — full training log history

Framework versions

Library Version
PEFT 0.19.1
TRL 1.3.0
Transformers 4.57.6
PyTorch 2.11.0
Datasets 4.8.5
Tokenizers 0.22.2

Citation

If you use this model, please cite the base model and the TRL training framework.

@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}
Downloads last month
126
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alabenayed/TounsiLM-8b

Adapter
(27)
this model