Llama 3.1 8B Ita โ€” Italian Cultural Alignment [V1]

Llama 3.1 8B Ita [V1] is a LoRA adapter fine-tuned on top of DeepMount00/Llama-3.1-8b-ITA to improve Italian cultural alignment. It was trained on the Mult-IT dataset and evaluated on the ITALIC benchmark. Unlike Qwen3, Llama 3.1 is a standard causal language model without a hybrid reasoning architecture, so no thinking-mode considerations apply.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models


Model Summary

Property Value
Base model DeepMount00/Llama-3.1-8b-ITA
PEFT type LoRA
Task Causal language modelling (Italian Q&A / instruction following)
Training dataset Mult-IT (~86,929 samples)
Evaluation benchmark ITALIC (10,000 questions)
ITALIC accuracy (V1) 73.91% (+3.42 pp over baseline)
Trainable parameters See research report

Intended Use

This model is intended for:

  • Italian language understanding โ€” multiple-choice Q&A, cultural knowledge, and general instruction following in Italian.
  • Research โ€” comparing the effect of SFT on Italian cultural alignment across model families.
  • Benchmarking โ€” comparing Italian-specific models against multilingual and fine-tuned baselines.

Not recommended for:

  • High-stakes or safety-critical applications.
  • Languages other than Italian.

Key Finding โ€” Cultural Alignment

Training on the Italian cultural Q&A dataset (Mult-IT) improves performance across almost all ITALIC categories:

Metric Baseline V1 Delta
Total 70.49% 73.91% +3.42 pp
Culture 72.96% 75.45% +2.49 pp
Language 66.83% 71.63% +4.80 pp

Language competence improved more than culture knowledge. The largest gains were in Synonyms (+8.76 pp), Morphology (+8.29 pp), Orthography (+7.03 pp), and Civic (+6.07 pp). Events remained flat (0.00 pp change). As Llama 3.1 does not have a hybrid reasoning architecture, fine-tuning carries no risk of reasoning-mode degradation.


Training Details

LoRA Configuration

Parameter Value
LoRA rank (r) 24
LoRA alpha 48
LoRA dropout 0.1
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias none

Training Hyperparameters

Parameter Value
Sequence packing Yes (max 2,048 tokens per slot)
Max sequence length 2,048 tokens

Note: full training hyperparameters are detailed in the research report.

Framework & Hardware

Component Version / Spec
TRL 0.21.0
PEFT 0.17.0
Transformers 4.55.0
PyTorch 2.5.1+cu121
Hardware NVIDIA GeForce RTX 3090

Training Dataset โ€” Mult-IT

  • Dataset: Mult-IT โ€” Multiple Choice Questions on Multiple Topics in Italian
  • Source: CALAMITA Shared Task @ CLiC-it 2024
  • Language: Italian
  • Size: ~86,929 training samples
  • Format: JSONL, multiple-choice Q&A
  • Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) โ€” Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

V1 vs Baseline

Category Baseline V1 ฮ”
Art 70.10 71.31 +1.21
Civic 71.22 77.29 +6.07
Events 82.61 82.61 0.00
Geography 79.26 80.90 +1.64
History 77.40 79.28 +1.88
Literature 67.17 71.24 +4.07
Tourism 71.73 72.04 +0.31
Lexicon 81.51 83.76 +2.25
Morphology 52.14 60.43 +8.29
Orthography 53.04 60.07 +7.03
Synonyms 81.15 89.91 +8.76
Syntax 53.65 54.31 +0.66
Culture (subtotal) 72.96 75.45 +2.49
Language (subtotal) 66.83 71.63 +4.80
Total 70.49 73.91 +3.42

Comparison with Other Models (ITALIC Total)

Model Total Parameters
Llama 3.1 70B 83.61% 70B
GPT-4o Mini 82.22% ~8B
Magistral Small (No Thinking) 76.06% 24B
Qwen3 8B (No Thinking) [V3] 73.81% 8B
Qwen3 8B (No Thinking) [V1] 73.77% 8B
Llama 3.1 8B Ita [V1] 73.91% 8B
Qwen3 8B (No Thinking) baseline 70.17% 8B
Llama 3.1 8B Ita (baseline) 70.49% 8B
LLaMAntino-3 8B 68.37% 8B
Llama 3.1 8B 66.38% 8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.


Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "DeepMount00/Llama-3.1-8b-ITA"
adapter_id = "maruf-bepary/llama-3.1-8b-ita-italian-v1"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual รจ la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply LLaMA-3 chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)
print(response)
# Expected output: "B"

Limitations

  • Morphology (60.43%) and Syntax (54.31%) remain the weakest categories despite improvement.
  • Benchmark scope โ€” evaluation was conducted solely on ITALIC; performance on other Italian benchmarks is unverified.
  • Single-GPU training โ€” training used one RTX 3090; multi-GPU configurations may yield different results.
  • Dataset bias โ€” Mult-IT is a multiple-choice dataset; generalisation to open-ended Italian generation tasks is unverified.
  • Events category showed no improvement (0.00 pp), suggesting the training data may lack current-events coverage.

References

Related resources:

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for m-beps/llama31-8b-finetune-multit

Adapter
(1)
this model