IdioleX-Llama-3.1-8B-AR — Dialectally Aligned Arabic LLM
IdioleX-Llama-3.1-8B-AR is a fine-tuned version of Llama-3.1-8B-Instruct trained with an additional IdioleX embedding alignment objective to improve dialectal fidelity across Arabic varieties.
Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX
Training
Standard supervised fine-tuning (SFT) cross-entropy loss is augmented with a cosine similarity alignment loss between the LLM's pooled response hidden states (projected into IdioleX space) and the IdioleX embedding of the ground-truth response:
L = L_CE + α · (1 − cosine_sim(proj(h̄), e_idiolex))
The IdioleX encoder (your-username/idiolex-arabert-ar)
is kept frozen throughout training. A two-layer projection head maps from the
LLM's hidden dimension (4096) to the IdioleX embedding dimension (768) and is also
frozen after the first epoch.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | meta-llama/Llama-3.1-8B-Instruct |
| IdioleX encoder | AraBERT v2 trained under IDIOLEX framework |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| LoRA target modules | all-linear |
| IdioleX alignment weight (α) | 10.0 |
| Batch size (per device) | 32 |
| Gradient accumulation steps | 1 |
| Learning rate | 2 × 10⁻⁴ |
| LR schedule | Cosine annealing |
| Max sequence length | 512 |
| Epochs | 20 |
| Optimizer | AdamW (weight decay 0.01) |
| Precision | bfloat16 |
| Distributed training | DeepSpeed ZeRO-2, 8 GPUs |
Training data
Instruction–response pairs in dialectal Arabic constructed from publicly available corpora spanning Egyptian, Moroccan, Palestinian, Saudi, and Syrian Arabic, augmented into instruction-following format using GPT-5-mini. See the paper (§6.1, Appendix B.2.3) for full data details.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"your-username/idiolex-llama-3.1-8b-ar",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("your-username/idiolex-llama-3.1-8b-ar")
messages = [{"role": "user", "content": "ترجم للعربية المصرية: How are you doing today?"}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.95, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
Citation
@article{kantharuban2025idiolex,
title = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
author = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
year = {2025},
note = {Preprint, under review}
}
- Downloads last month
- 11
Model tree for AnjaliRuban/Llama-8B-idiolex
Base model
meta-llama/Llama-3.1-8B