XLM-RoBERTa Base Fine-tuned on XNLI for Natural Language Inference

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for 3-way Natural Language Inference using the facebook/xnli dataset.

The model predicts the logical relationship between a premise and a hypothesis:

Label ID	Label
0	entailment
1	neutral
2	contradiction

Model Details

Base model: FacebookAI/xlm-roberta-base
Architecture: XLM-RoBERTa for sequence classification
Task: Natural Language Inference / Textual Entailment
Number of labels: 3
Parameters: 278,045,955
Tokenizer: Fast XLM-RoBERTa tokenizer
Maximum sequence length: 512
Fine-tuning dataset: facebook/xnli
Dataset config used: all_languages

Although the all_languages configuration was loaded, the main training and validation/test preprocessing selected the English text when multilingual translation dictionaries were present. Therefore, the primary fine-tuning run is best described as English XNLI fine-tuning using the XNLI all_languages source, with additional per-language evaluation performed afterward.

Intended Use

This model is intended for Natural Language Inference. Given a premise and a hypothesis, it predicts whether the hypothesis is:

entailed by the premise,
neutral with respect to the premise, or
contradicted by the premise.

Example use cases include:

entailment detection,
contradiction detection,
claim verification pipelines,
semantic consistency checking,
zero-shot-style classification pipelines that rely on NLI.

Training Data

The model was fine-tuned on the facebook/xnli dataset.

Dataset split sizes used in the notebook:

Split	Rows
Train	392,702
Validation	2,490
Test	5,010

The dataset is balanced across the three labels.

Training split label distribution:

Label	Count
entailment	130,899
neutral	130,900
contradiction	130,903

Validation split label distribution:

Label	Count
entailment	830
neutral	830
contradiction	830

Test split label distribution:

Label	Count
entailment	1,670
neutral	1,670
contradiction	1,670

Training Procedure

The model was fine-tuned using Hugging Face Trainer.

Hyperparameters

Hyperparameter	Value
Epochs	3
Learning rate	2e-5
Train batch size	64
Eval batch size	64
Gradient accumulation steps	1
Weight decay	0.01
Warmup ratio	0.06
LR scheduler	linear
Max sequence length	512
Max gradient norm	1.0
Seed	42
Mixed precision	bf16
Optimizer	AdamW
Metric for best model	macro F1

Runtime Environment

Component	Value
Python	3.12.6
PyTorch	2.8.0+cu129
Transformers	4.56.0
Datasets	4.8.5
GPU	NVIDIA A100-SXM4-40GB

Evaluation Results

Validation Set

Metric	Score
Loss	0.4421
Accuracy	0.8349
Macro F1	0.8358
Weighted F1	0.8358
Macro Precision	0.8411
Macro Recall	0.8349

Test Set

Metric	Score
Loss	0.4383
Accuracy	0.8421
Macro F1	0.8426
Weighted F1	0.8426
Macro Precision	0.8481
Macro Recall	0.8421

Test Classification Report

Label	Precision	Recall	F1	Support
entailment	0.9130	0.7850	0.8442	1,670
neutral	0.7733	0.8659	0.8169	1,670
contradiction	0.8580	0.8754	0.8666	1,670
accuracy			0.8421	5,010
macro avg	0.8481	0.8421	0.8426	5,010
weighted avg	0.8481	0.8421	0.8426	5,010

Per-Language Evaluation

The model was also evaluated on the 15 XNLI languages by retokenizing the validation/test pairs for each language.

Language	Accuracy	Macro F1	Weighted F1
en	0.8421	0.8426	0.8426
es	0.7495	0.7454	0.7454
fr	0.7489	0.7452	0.7452
ru	0.7253	0.7162	0.7162
de	0.7228	0.7135	0.7135
bg	0.7130	0.7022	0.7022
vi	0.6902	0.6751	0.6751
zh	0.6926	0.6750	0.6750
th	0.6727	0.6502	0.6502
el	0.6727	0.6497	0.6497
tr	0.6429	0.6149	0.6149
hi	0.6100	0.5697	0.5697
ar	0.6004	0.5491	0.5491
ur	0.5569	0.4946	0.4946
sw	0.5343	0.4724	0.4724

The strongest performance is on English, which is expected because the main training preprocessing selected English text from the multilingual examples.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "AyoubChLin/xlm-roberta-base-xnli-nli"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

premise = "A soccer game with multiple males playing."
hypothesis = "Some men are playing a sport."

inputs = tokenizer(
    premise,
    hypothesis,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = int(torch.argmax(probs, dim=-1))

id2label = {
    0: "entailment",
    1: "neutral",
    2: "contradiction",
}

print(id2label[pred_id])
print(probs)

Downloads last month: 14

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for AyoubChLin/xlm-roberta-base-zeroshot-nli

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3924)

this model

AyoubChLin
/

xlm-roberta-base-zeroshot-nli