mmBERT-small-NLI

A multilingual Natural Language Inference (NLI) model fine-tuned from jhu-clsp/mmBERT-small, which supports 1833 languages. This model was fine-tuned on a comprehensive combination of 9 NLI datasets to enable strong NLI and zero-shot classification across a massive range of languages.

What is this model?

The base model jhu-clsp/mmBERT-small was pre-trained by Johns Hopkins University on 1833 languages for general language understanding. We fine-tuned it specifically for the Natural Language Inference (NLI) task — teaching it to determine whether a hypothesis is:

  • Entailment — the hypothesis follows from the premise
  • Neutral — the hypothesis may or may not follow
  • Contradiction — the hypothesis contradicts the premise

Training Data

This model was fine-tuned on 9 NLI datasets combining over 1.8 million training examples across multiple languages:

Dataset Examples Languages Description
MultiNLI (MNLI) 393K English Diverse genres — speech, fiction, government
SNLI 550K English Image caption based NLI
ANLI (R1+R2+R3) 162K English Adversarial NLI — hardest benchmark
FEVER-NLI 185K English Fact verification based NLI
WANLI 103K English Worker-AI collaborative NLI
LingNLI 26K English Linguistically challenging NLI
SICK 4.4K English Compositional NLI
XNLI 392K 15 languages Cross-lingual NLI benchmark
Multilingual-NLI-26lang 300K (sampled) 26 languages Machine-translated multilingual NLI

Total training examples: ~2.1 million pairs across 26+ languages

Benchmark Results

Evaluated on standard NLI test sets after training:

Benchmark Accuracy F1 (macro)
MNLI-matched 85.56% 0.8549
MNLI-mismatched 85.36% 0.8527
SNLI-test 88.27% 0.8820
ANLI-R1-test 53.50% 0.5327
ANLI-R2-test 40.80% 0.3966
ANLI-R3-test 39.58% 0.3875
WANLI-test 69.18% 0.6703
XNLI-test (15 langs) 77.72% 0.7771

Note on ANLI scores: ANLI is intentionally adversarial and designed to fool masked language models. Even large models like RoBERTa-large score ~47% on ANLI. Low ANLI scores are expected for small models.

Comparison with Other NLI Models

Model Size MNLI SNLI XNLI Languages
mmBERT-small-NLI (ours) ~117M 85.5% 88.3% 77.7% 1833
BERT-base 110M 84.6% 90.6% 74.0% 1
RoBERTa-large-MNLI 355M 90.2% 91.8% 1
DeBERTa-v3-base-MNLI 184M 90.3% 1
mDeBERTa-v3-base (multilingual) 278M 89.5% 80.2% 100

Key advantage: This is the only NLI model covering 1833 languages, compared to the next best multilingual NLI model (mDeBERTa) covering only 100 languages.

How to Use

Zero-Shot Classification

from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="BalaRajesh1/mmbert-small-nli"
)

# English
result = classifier(
    "The Federal Reserve raised interest rates today.",
    candidate_labels=["economics", "politics", "sports"]
)
print(result)

# Hindi
result = classifier(
    "सरकार ने नई शिक्षा नीति की घोषणा की।",
    candidate_labels=["education", "politics", "sports"]
)
print(result)

# Arabic
result = classifier(
    "أعلنت الحكومة عن خطة جديدة للطاقة المتجددة.",
    candidate_labels=["environment", "politics", "technology"]
)
print(result)

Direct NLI

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "BalaRajesh1/mmbert-small-nli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "The cat is sitting on the mat."
hypothesis = "There is an animal on the mat."

inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = ["entailment", "neutral", "contradiction"]
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.3f}")

Training Details

Parameter Value
Base model jhu-clsp/mmBERT-small
Learning rate 2e-5
Batch size 32 per GPU
Max sequence length 128
Warmup ratio 6%
Training epochs 3 (early stopping)
Early stopping patience 10 evals
Precision FP16
Training time 5.38 hours

Training was stopped early at ~19% of maximum steps because the model converged and validation F1 stopped improving — this is expected behavior, not an error.

Label Mapping

ID Label Meaning
0 entailment Hypothesis follows from premise
1 neutral Hypothesis may or may not follow
2 contradiction Hypothesis contradicts premise

Limitations

  • ANLI performance is low (~40%) — expected for small models on adversarial data
  • Performance may vary across the 1833 languages depending on how well represented they are in the base mmBERT pre-training
  • Max sequence length of 128 tokens — very long premise+hypothesis pairs will be truncated

Citation

If you use this model, please cite the original mmBERT paper:

@misc{mmbert2021,
  title={mmBERT: Multilingual BERT for 1000+ Languages},
  author={Johns Hopkins University CLSP},
  year={2021}
}
Downloads last month
636
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train BalaRajesh1/mmbert-small-nli

Evaluation results