Instructions to use m-beps/llama31-8b-finetune-multit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-beps/llama31-8b-finetune-multit with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("DeepMount00/Llama-3.1-8b-ITA")
model = PeftModel.from_pretrained(base_model, "m-beps/llama31-8b-finetune-multit")

Transformers

How to use m-beps/llama31-8b-finetune-multit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-beps/llama31-8b-finetune-multit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("m-beps/llama31-8b-finetune-multit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use m-beps/llama31-8b-finetune-multit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-beps/llama31-8b-finetune-multit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/llama31-8b-finetune-multit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/m-beps/llama31-8b-finetune-multit

SGLang

How to use m-beps/llama31-8b-finetune-multit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-beps/llama31-8b-finetune-multit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/llama31-8b-finetune-multit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-beps/llama31-8b-finetune-multit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-beps/llama31-8b-finetune-multit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use m-beps/llama31-8b-finetune-multit with Docker Model Runner:
```
docker model run hf.co/m-beps/llama31-8b-finetune-multit
```

Llama 3.1 8B Ita — Italian Cultural Alignment [V1]

Llama 3.1 8B Ita [V1] is a LoRA adapter fine-tuned on top of DeepMount00/Llama-3.1-8b-ITA to improve Italian cultural alignment. It was trained on the Mult-IT dataset and evaluated on the ITALIC benchmark. Unlike Qwen3, Llama 3.1 is a standard causal language model without a hybrid reasoning architecture, so no thinking-mode considerations apply.

Author: Maruf Bepary, King's College London
Research report: Alignment in Large Language Models

Model Summary

Property	Value
Base model	`DeepMount00/Llama-3.1-8b-ITA`
PEFT type	LoRA
Task	Causal language modelling (Italian Q&A / instruction following)
Training dataset	Mult-IT (~86,929 samples)
Evaluation benchmark	ITALIC (10,000 questions)
ITALIC accuracy (V1)	73.91% (+3.42 pp over baseline)
Trainable parameters	See research report

Intended Use

This model is intended for:

Italian language understanding — multiple-choice Q&A, cultural knowledge, and general instruction following in Italian.
Research — comparing the effect of SFT on Italian cultural alignment across model families.
Benchmarking — comparing Italian-specific models against multilingual and fine-tuned baselines.

Not recommended for:

High-stakes or safety-critical applications.
Languages other than Italian.

Key Finding — Cultural Alignment

Training on the Italian cultural Q&A dataset (Mult-IT) improves performance across almost all ITALIC categories:

Metric	Baseline	V1	Delta
Total	70.49%	73.91%	+3.42 pp
Culture	72.96%	75.45%	+2.49 pp
Language	66.83%	71.63%	+4.80 pp

Language competence improved more than culture knowledge. The largest gains were in Synonyms (+8.76 pp), Morphology (+8.29 pp), Orthography (+7.03 pp), and Civic (+6.07 pp). Events remained flat (0.00 pp change). As Llama 3.1 does not have a hybrid reasoning architecture, fine-tuning carries no risk of reasoning-mode degradation.

Training Details

LoRA Configuration

Parameter	Value
LoRA rank (`r`)	24
LoRA alpha	48
LoRA dropout	0.1
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Bias	none

Training Hyperparameters

Parameter	Value
Sequence packing	Yes (max 2,048 tokens per slot)
Max sequence length	2,048 tokens

Note: full training hyperparameters are detailed in the research report.

Framework & Hardware

Component	Version / Spec
TRL	0.21.0
PEFT	0.17.0
Transformers	4.55.0
PyTorch	2.5.1+cu121
Hardware	NVIDIA GeForce RTX 3090

Training Dataset — Mult-IT

Dataset: Mult-IT — Multiple Choice Questions on Multiple Topics in Italian
Source: CALAMITA Shared Task @ CLiC-it 2024
Language: Italian
Size: ~86,929 training samples
Format: JSONL, multiple-choice Q&A
Reference: Mult-IT: Multiple Choice Questions on Multiple Topics in Italian (2024)

ITALIC Benchmark Results

Benchmark: ITALIC (NAACL 2025) — Italian Culture-Aware Natural Language Benchmark
Format: Zero-shot, multiple-choice (12 categories, 10,000 questions)
System prompt: "Sei un assistente utile."

V1 vs Baseline

Category	Baseline	V1	Δ
Art	70.10	71.31	+1.21
Civic	71.22	77.29	+6.07
Events	82.61	82.61	0.00
Geography	79.26	80.90	+1.64
History	77.40	79.28	+1.88
Literature	67.17	71.24	+4.07
Tourism	71.73	72.04	+0.31
Lexicon	81.51	83.76	+2.25
Morphology	52.14	60.43	+8.29
Orthography	53.04	60.07	+7.03
Synonyms	81.15	89.91	+8.76
Syntax	53.65	54.31	+0.66
Culture (subtotal)	72.96	75.45	+2.49
Language (subtotal)	66.83	71.63	+4.80
Total	70.49	73.91	+3.42

Comparison with Other Models (ITALIC Total)

Model	Total	Parameters
Llama 3.1 70B	83.61%	70B
GPT-4o Mini	82.22%	~8B
Magistral Small (No Thinking)	76.06%	24B
Qwen3 8B (No Thinking) [V3]	73.81%	8B
Qwen3 8B (No Thinking) [V1]	73.77%	8B
Llama 3.1 8B Ita [V1]	73.91%	8B
Qwen3 8B (No Thinking) baseline	70.17%	8B
Llama 3.1 8B Ita (baseline)	70.49%	8B
LLaMAntino-3 8B	68.37%	8B
Llama 3.1 8B	66.38%	8B

All scores evaluated under identical zero-shot conditions on the ITALIC benchmark.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "DeepMount00/Llama-3.1-8b-ITA"
adapter_id = "maruf-bepary/llama-3.1-8b-ita-italian-v1"

# Load tokeniser and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# Example: Italian multiple-choice question
messages = [
    {"role": "system", "content": "Sei un assistente utile."},
    {
        "role": "user",
        "content": (
            "Qual è la capitale d'Italia?\n"
            "A) Milano\nB) Roma\nC) Napoli\nD) Torino\n\n"
            "Rispondi con la lettera della risposta corretta."
        ),
    },
]

# Apply LLaMA-3 chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=False,
        temperature=None,
        top_p=None,
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)
print(response)
# Expected output: "B"

Limitations

Morphology (60.43%) and Syntax (54.31%) remain the weakest categories despite improvement.
Benchmark scope — evaluation was conducted solely on ITALIC; performance on other Italian benchmarks is unverified.
Single-GPU training — training used one RTX 3090; multi-GPU configurations may yield different results.
Dataset bias — Mult-IT is a multiple-choice dataset; generalisation to open-ended Italian generation tasks is unverified.
Events category showed no improvement (0.00 pp), suggesting the training data may lack current-events coverage.

References

Related resources:

Research report: Alignment in Large Language Models
Base model: DeepMount00/Llama-3.1-8b-ITA
ITALIC benchmark: RiTA-nlp/ITALIC
Mult-IT dataset: sapienzanlp/Mult-IT
PEFT documentation: huggingface.co/docs/peft

Downloads last month: -

Model tree for m-beps/llama31-8b-finetune-multit

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

DeepMount00/Llama-3.1-8b-ITA

Adapter

(1)

this model