mmBERT-small-NLI

A multilingual Natural Language Inference (NLI) model fine-tuned from jhu-clsp/mmBERT-small, which supports 1833 languages. This model was fine-tuned on a comprehensive combination of 9 NLI datasets to enable strong NLI and zero-shot classification across a massive range of languages.

What is this model?

The base model jhu-clsp/mmBERT-small was pre-trained by Johns Hopkins University on 1833 languages for general language understanding. We fine-tuned it specifically for the Natural Language Inference (NLI) task — teaching it to determine whether a hypothesis is:

✅ Entailment — the hypothesis follows from the premise
❓ Neutral — the hypothesis may or may not follow
❌ Contradiction — the hypothesis contradicts the premise

Training Data

This model was fine-tuned on 9 NLI datasets combining over 1.8 million training examples across multiple languages:

Dataset	Examples	Languages	Description
MultiNLI (MNLI)	393K	English	Diverse genres — speech, fiction, government
SNLI	550K	English	Image caption based NLI
ANLI (R1+R2+R3)	162K	English	Adversarial NLI — hardest benchmark
FEVER-NLI	185K	English	Fact verification based NLI
WANLI	103K	English	Worker-AI collaborative NLI
LingNLI	26K	English	Linguistically challenging NLI
SICK	4.4K	English	Compositional NLI
XNLI	392K	15 languages	Cross-lingual NLI benchmark
Multilingual-NLI-26lang	300K (sampled)	26 languages	Machine-translated multilingual NLI

Total training examples: ~2.1 million pairs across 26+ languages

Benchmark Results

Evaluated on standard NLI test sets after training:

Benchmark	Accuracy	F1 (macro)
MNLI-matched	85.56%	0.8549
MNLI-mismatched	85.36%	0.8527
SNLI-test	88.27%	0.8820
ANLI-R1-test	53.50%	0.5327
ANLI-R2-test	40.80%	0.3966
ANLI-R3-test	39.58%	0.3875
WANLI-test	69.18%	0.6703
XNLI-test (15 langs)	77.72%	0.7771

Note on ANLI scores: ANLI is intentionally adversarial and designed to fool masked language models. Even large models like RoBERTa-large score ~47% on ANLI. Low ANLI scores are expected for small models.

Comparison with Other NLI Models

Model	Size	MNLI	SNLI	XNLI	Languages
mmBERT-small-NLI (ours)	~117M	85.5%	88.3%	77.7%	1833
BERT-base	110M	84.6%	90.6%	74.0%	1
RoBERTa-large-MNLI	355M	90.2%	91.8%	—	1
DeBERTa-v3-base-MNLI	184M	90.3%	—	—	1
mDeBERTa-v3-base (multilingual)	278M	89.5%	—	80.2%	100

Key advantage: This is the only NLI model covering 1833 languages, compared to the next best multilingual NLI model (mDeBERTa) covering only 100 languages.

How to Use

Zero-Shot Classification

from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="BalaRajesh1/mmbert-small-nli"
)

# English
result = classifier(
    "The Federal Reserve raised interest rates today.",
    candidate_labels=["economics", "politics", "sports"]
)
print(result)

# Hindi
result = classifier(
    "सरकार ने नई शिक्षा नीति की घोषणा की।",
    candidate_labels=["education", "politics", "sports"]
)
print(result)

# Arabic
result = classifier(
    "أعلنت الحكومة عن خطة جديدة للطاقة المتجددة.",
    candidate_labels=["environment", "politics", "technology"]
)
print(result)

Direct NLI

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "BalaRajesh1/mmbert-small-nli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "The cat is sitting on the mat."
hypothesis = "There is an animal on the mat."

inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = ["entailment", "neutral", "contradiction"]
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.3f}")

Training Details

Parameter	Value
Base model	jhu-clsp/mmBERT-small
Learning rate	2e-5
Batch size	32 per GPU
Max sequence length	128
Warmup ratio	6%
Training epochs	3 (early stopping)
Early stopping patience	10 evals
Precision	FP16
Training time	5.38 hours

Training was stopped early at ~19% of maximum steps because the model converged and validation F1 stopped improving — this is expected behavior, not an error.

Label Mapping

ID	Label	Meaning
0	entailment	Hypothesis follows from premise
1	neutral	Hypothesis may or may not follow
2	contradiction	Hypothesis contradicts premise

Limitations

ANLI performance is low (~40%) — expected for small models on adversarial data
Performance may vary across the 1833 languages depending on how well represented they are in the base mmBERT pre-training
Max sequence length of 128 tokens — very long premise+hypothesis pairs will be truncated

Citation

If you use this model, please cite the original mmBERT paper:

@misc{mmbert2021,
  title={mmBERT: Multilingual BERT for 1000+ Languages},
  author={Johns Hopkins University CLSP},
  year={2021}
}

Downloads last month: 91

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train BalaRajesh1/mmbert-small-nli

Evaluation results

accuracy on MultiNLI (matched)
self-reported

0.856
f1 on MultiNLI (matched)
self-reported

0.855
accuracy on MultiNLI (mismatched)
self-reported

0.854
f1 on MultiNLI (mismatched)
self-reported

0.853
accuracy on SNLI
self-reported

0.883
f1 on SNLI
self-reported

0.882
accuracy on XNLI (15 languages)
self-reported

0.777
f1 on XNLI (15 languages)
self-reported

0.777