IdioleX-Llama-3.1-8B-AR — Dialectally Aligned Arabic LLM

IdioleX-Llama-3.1-8B-AR is a fine-tuned version of Llama-3.1-8B-Instruct trained with an additional IdioleX embedding alignment objective to improve dialectal fidelity across Arabic varieties.

Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX

Training

Standard supervised fine-tuning (SFT) cross-entropy loss is augmented with a cosine similarity alignment loss between the LLM's pooled response hidden states (projected into IdioleX space) and the IdioleX embedding of the ground-truth response:

L = L_CE + α · (1 − cosine_sim(proj(h̄), e_idiolex))

The IdioleX encoder (your-username/idiolex-arabert-ar) is kept frozen throughout training. A two-layer projection head maps from the LLM's hidden dimension (4096) to the IdioleX embedding dimension (768) and is also frozen after the first epoch.

Hyperparameters

Parameter	Value
Base model	`meta-llama/Llama-3.1-8B-Instruct`
IdioleX encoder	AraBERT v2 trained under IDIOLEX framework
LoRA rank	32
LoRA alpha	32
LoRA target modules	all-linear
IdioleX alignment weight (α)	10.0
Batch size (per device)	32
Gradient accumulation steps	1
Learning rate	2 × 10⁻⁴
LR schedule	Cosine annealing
Max sequence length	512
Epochs	20
Optimizer	AdamW (weight decay 0.01)
Precision	bfloat16
Distributed training	DeepSpeed ZeRO-2, 8 GPUs

Training data

Instruction–response pairs in dialectal Arabic constructed from publicly available corpora spanning Egyptian, Moroccan, Palestinian, Saudi, and Syrian Arabic, augmented into instruction-following format using GPT-5-mini. See the paper (§6.1, Appendix B.2.3) for full data details.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "your-username/idiolex-llama-3.1-8b-ar",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("your-username/idiolex-llama-3.1-8b-ar")

messages = [{"role": "user", "content": "ترجم للعربية المصرية: How are you doing today?"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.95, do_sample=True)

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

Citation

@article{kantharuban2025idiolex,
  title   = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
  author  = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
             and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
  year    = {2025},
  note    = {Preprint, under review}
}