XLM-RoBERTa Base Fine-tuned on XNLI for Natural Language Inference

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for 3-way Natural Language Inference using the facebook/xnli dataset.

The model predicts the logical relationship between a premise and a hypothesis:

Label ID Label
0 entailment
1 neutral
2 contradiction

Model Details

  • Base model: FacebookAI/xlm-roberta-base
  • Architecture: XLM-RoBERTa for sequence classification
  • Task: Natural Language Inference / Textual Entailment
  • Number of labels: 3
  • Parameters: 278,045,955
  • Tokenizer: Fast XLM-RoBERTa tokenizer
  • Maximum sequence length: 512
  • Fine-tuning dataset: facebook/xnli
  • Dataset config used: all_languages

Although the all_languages configuration was loaded, the main training and validation/test preprocessing selected the English text when multilingual translation dictionaries were present. Therefore, the primary fine-tuning run is best described as English XNLI fine-tuning using the XNLI all_languages source, with additional per-language evaluation performed afterward.

Intended Use

This model is intended for Natural Language Inference. Given a premise and a hypothesis, it predicts whether the hypothesis is:

  • entailed by the premise,
  • neutral with respect to the premise, or
  • contradicted by the premise.

Example use cases include:

  • entailment detection,
  • contradiction detection,
  • claim verification pipelines,
  • semantic consistency checking,
  • zero-shot-style classification pipelines that rely on NLI.

Training Data

The model was fine-tuned on the facebook/xnli dataset.

Dataset split sizes used in the notebook:

Split Rows
Train 392,702
Validation 2,490
Test 5,010

The dataset is balanced across the three labels.

Training split label distribution:

Label Count
entailment 130,899
neutral 130,900
contradiction 130,903

Validation split label distribution:

Label Count
entailment 830
neutral 830
contradiction 830

Test split label distribution:

Label Count
entailment 1,670
neutral 1,670
contradiction 1,670

Training Procedure

The model was fine-tuned using Hugging Face Trainer.

Hyperparameters

Hyperparameter Value
Epochs 3
Learning rate 2e-5
Train batch size 64
Eval batch size 64
Gradient accumulation steps 1
Weight decay 0.01
Warmup ratio 0.06
LR scheduler linear
Max sequence length 512
Max gradient norm 1.0
Seed 42
Mixed precision bf16
Optimizer AdamW
Metric for best model macro F1

Runtime Environment

Component Value
Python 3.12.6
PyTorch 2.8.0+cu129
Transformers 4.56.0
Datasets 4.8.5
GPU NVIDIA A100-SXM4-40GB

Evaluation Results

Validation Set

Metric Score
Loss 0.4421
Accuracy 0.8349
Macro F1 0.8358
Weighted F1 0.8358
Macro Precision 0.8411
Macro Recall 0.8349

Test Set

Metric Score
Loss 0.4383
Accuracy 0.8421
Macro F1 0.8426
Weighted F1 0.8426
Macro Precision 0.8481
Macro Recall 0.8421

Test Classification Report

Label Precision Recall F1 Support
entailment 0.9130 0.7850 0.8442 1,670
neutral 0.7733 0.8659 0.8169 1,670
contradiction 0.8580 0.8754 0.8666 1,670
accuracy 0.8421 5,010
macro avg 0.8481 0.8421 0.8426 5,010
weighted avg 0.8481 0.8421 0.8426 5,010

Per-Language Evaluation

The model was also evaluated on the 15 XNLI languages by retokenizing the validation/test pairs for each language.

Language Accuracy Macro F1 Weighted F1
en 0.8421 0.8426 0.8426
es 0.7495 0.7454 0.7454
fr 0.7489 0.7452 0.7452
ru 0.7253 0.7162 0.7162
de 0.7228 0.7135 0.7135
bg 0.7130 0.7022 0.7022
vi 0.6902 0.6751 0.6751
zh 0.6926 0.6750 0.6750
th 0.6727 0.6502 0.6502
el 0.6727 0.6497 0.6497
tr 0.6429 0.6149 0.6149
hi 0.6100 0.5697 0.5697
ar 0.6004 0.5491 0.5491
ur 0.5569 0.4946 0.4946
sw 0.5343 0.4724 0.4724

The strongest performance is on English, which is expected because the main training preprocessing selected English text from the multilingual examples.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "AyoubChLin/xlm-roberta-base-xnli-nli"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

premise = "A soccer game with multiple males playing."
hypothesis = "Some men are playing a sport."

inputs = tokenizer(
    premise,
    hypothesis,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = int(torch.argmax(probs, dim=-1))

id2label = {
    0: "entailment",
    1: "neutral",
    2: "contradiction",
}

print(id2label[pred_id])
print(probs)
Downloads last month
14
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AyoubChLin/xlm-roberta-base-zeroshot-nli

Finetuned
(3924)
this model

Dataset used to train AyoubChLin/xlm-roberta-base-zeroshot-nli