Scientific Evidence Relevance Classifier

Lightweight scikit-learn HistGradientBoostingClassifier trained on top of intfloat/e5-small-v2 sentence embeddings to decide whether a scientific paper is relevant evidence for a scientific claim.

This model accompanies the scifact-relevance-classifier project, the Lab 3 / Assignment 1 deliverable for Information Retrieval 5LN712 (Master's in Language Technology, Uppsala University, 2026).

Headline numbers

Metric	Value
Macro-F1 (test)	0.872
Accuracy (test)	0.883
Test set size	939 pairs
Class balance (test)	64% `not_relevant`, 36% `relevant`

TL;DR


Task	Binary classification: `relevant` / `not_relevant`
Embedding	`intfloat/e5-small-v2` (frozen, asymmetric query/passage prefixes)
Pair features	`[q, d, abs(q−d), q*d, cos(q,d)]` (1537 dims; InferSent recipe)
Classifier	`sklearn.ensemble.HistGradientBoostingClassifier` (max_iter=400, lr=0.05, class_weight=balanced)
Best variant	`title_abstract` (concatenated title + abstract)
Training data	`andreiaalexa/scifact-relevance-pairs`, config `title_abstract`, split `train`
Code	https://github.com/alexandreia/scifact-relevance-classifier

Usage

import pickle
import numpy as np
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

REPO = "andreiaalexa/scifact-relevance-classifier"

# 1) Download the trained classifier
clf_path = hf_hub_download(REPO, "classifier.pkl")
with open(clf_path, "rb") as f:
    clf = pickle.load(f)

# 2) Load the (public) embedding model used at training time
encoder = SentenceTransformer("intfloat/e5-small-v2")

# 3) Build pair features (E5 asymmetric prefixes, then InferSent recipe)
def pair_features(claims, documents):
    q = encoder.encode([f"query: {c}" for c in claims], normalize_embeddings=True)
    d = encoder.encode([f"passage: {p}" for p in documents], normalize_embeddings=True)
    cos = np.sum(q * d, axis=1, keepdims=True)
    return np.hstack([q, d, np.abs(q - d), q * d, cos])

# 4) Predict
claim = "Vitamin D supplementation reduces respiratory infections."
title = "Vitamin D supplementation to prevent acute respiratory tract infections."
abstract = "Randomized trials have evaluated whether vitamin D supplementation prevents acute respiratory tract infections in diverse populations."
document = f"{title}. {abstract}"

X = pair_features([claim], [document])
pred_id = int(clf.predict(X)[0])
proba = clf.predict_proba(X)[0]

label = {0: "not_relevant", 1: "relevant"}[pred_id]
print(f"prediction = {label} | P(relevant) = {proba[1]:.3f}")

The helper scifact_features.pair_features(...) is also bundled in this repo — you can download and import it directly instead of re-implementing the function above.

Classifier comparison (full ablation)

Five classifier families compared. Best-in-row in bold.

Macro-F1:

Variant	LogReg	LinSVM	RF	HGB	MLP
title	0.778	0.762	0.738	0.846	0.852
abstract	0.799	0.779	0.753	0.863	0.859
title + abstract	0.804	0.785	0.744	0.872	0.844

Accuracy:

Variant	LogReg	LinSVM	RF	HGB	MLP
title	0.790	0.774	0.792	0.860	0.865
abstract	0.817	0.798	0.805	0.874	0.871
title + abstract	0.818	0.800	0.797	0.883	0.857

Full per-class precision/recall is in experiment_results.json in this repo.

Key finding: non-linear models (HGB, MLP) clearly beat linear ones (LogReg, LinearSVM), indicating that useful pairwise interactions between embedding dimensions exist beyond what a linear weighting can capture. Random Forest underperformed on macro-F1 despite balanced class weighting.

Files in this repo

File	Purpose
`classifier.pkl`	Trained sklearn pipeline: StandardScaler not needed (HGB handles raw features); the pickle is just `HistGradientBoostingClassifier`.
`metadata.json`	Best (variant, classifier) chosen during training; label maps; embedding model id.
`experiment_results.json`	Per-classifier, per-variant metrics including per-class precision/recall.
`scifact_features.py`	The exact `pair_features()` function used at training time. Import this for inference.
`README.md`	This card.

Intended use

Re-ranking candidate scientific evidence retrieved by an upstream IR system (e.g., BM25 or DPR top-k → this classifier scores each candidate).
Filtering noisy retrieval results.
Educational use illustrating embedding-based pairwise classification.

Out of scope / limitations

Not a clinical decision tool. The training data is BEIR SciFact (biomedical claims with expert-annotated evidence). Predictions on non-biomedical scientific content may be unreliable.
English only. e5-small-v2 is multilingual-capable but the training data is English; cross-lingual performance is untested.
Frozen-encoder ceiling. Because we don't fine-tune e5-small-v2, performance is bounded by what its general-purpose embeddings already capture about scientific text.
Hard-negative mining is TF-IDF based, which means the model learns to distinguish lexically-similar but irrelevant documents — but it may be overconfident on completely off-topic documents.

Citations

If you use this model, please cite both the underlying dataset (SciFact / BEIR) and the embedding model:

@inproceedings{wadden-etal-2020-fact,
  title = "Fact or Fiction: Verifying Scientific Claims",
  author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh",
  booktitle = "EMNLP", year = "2020",
  url = "https://aclanthology.org/2020.emnlp-main.609/"
}

@article{wang-etal-2022-text-embeddings,
  title = "Text Embeddings by Weakly-Supervised Contrastive Pre-training",
  author = "Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu",
  journal = "arXiv:2212.03533", year = "2022",
  url = "https://arxiv.org/abs/2212.03533"
}

@inproceedings{conneau-etal-2017-supervised,
  title = "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data",
  author = "Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Lo{\"i}c and Bordes, Antoine",
  booktitle = "EMNLP", year = "2017",
  url = "https://aclanthology.org/D17-1070/"
}

License

MIT.

Downloads last month: -

Model tree for andreiaalexa/scifact-relevance-classifier

Base model

intfloat/e5-small-v2

Finetuned

(36)

this model

Dataset used to train andreiaalexa/scifact-relevance-classifier

Space using andreiaalexa/scifact-relevance-classifier 1

Paper for andreiaalexa/scifact-relevance-classifier

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Paper • 2212.03533 • Published Dec 7, 2022 • 2

Evaluation results

accuracy on SciFact Evidence Relevance Pairs (title+abstract config)
self-reported

0.883
macro-F1 on SciFact Evidence Relevance Pairs (title+abstract config)
self-reported

0.872