PubGuard Review Classifier

A SetFit text classifier that distinguishes literature review articles from original research papers based on abstract text. Designed as a supplementary filter for PubGuard, adding finer-grained document type detection beyond standard metadata-based filtering.

Purpose

Metadata-based review detection (e.g., OpenAlex type:review or PubMed PublicationType tags) misses many review-like publications that are tagged as regular articles. This classifier operates on abstract text to catch:

  • Narrative and scoping reviews not tagged as reviews
  • Meta-analyses and systematic reviews with ambiguous metadata
  • Survey papers comparing existing methods
  • Clinical guidelines and consensus statements
  • Comprehensive overviews disguised as research articles

Intended as an additional granularity layer on top of PubGuard's existing document type classification, specifically targeting the review/non-review boundary where metadata filters underperform.

Performance

Evaluated on a held-out test set of 9,000 abstracts (4,500 per class):

Class Precision Recall F1
Research Paper 0.916 0.822 0.867
Literature Review 0.839 0.925 0.880
Macro Average 0.877 0.873 0.873

Accuracy: 87.3%

Confusion matrix (rows = true, columns = predicted):

Pred: Research Pred: Review
True: Research 3,700 800
True: Review 339 4,161

The model favors recall on the review class (92.5%) over precision (83.9%), which is appropriate for filtering applications where missing a review is more costly than occasionally flagging a borderline research paper.

Training

  • Method: SetFit (few-shot contrastive learning + logistic regression head)
  • Base model: BAAI/bge-base-en-v1.5 (768-dim)
  • Contrastive phase: 256 samples per class, 20 iterations, 2 epochs, batch size 64
  • Head training: Logistic regression on full training set, 3 epochs
  • Training time: ~9 minutes on NVIDIA RTX PRO 6000

Dataset

60,000 abstracts balanced across two classes (30,000 each):

Literature Review (positive class):

  • 15,000 abstracts from OpenAlex type:review articles
  • 15,000 abstracts from publications with review-indicating titles (systematic review, meta-analysis, scoping review, narrative review, comprehensive review, critical review, state of the art, survey of methods, comparison of methods, overview of approaches) โ€” sourced from a 230M publication PostgreSQL database

Research Paper (negative class):

  • 20,000 abstracts from Immunology & Microbiology publications with >10 citations, excluding titles containing review/meta-analysis/survey keywords
  • 10,000 abstracts from general scientific publications with >5 citations, same title exclusions

Usage

from sentence_transformers import SentenceTransformer
import joblib

# Load
st = SentenceTransformer("jimnoneill/pubguard-review-classifier")
head = joblib.load("model_head.pkl")  # from the repo files

# Predict
abstracts = [
    "We systematically reviewed 47 studies on gut microbiome interventions...",
    "Here we report a novel bacteriophage that specifically lyses carbapenem-resistant Klebsiella...",
]
embeddings = st.encode(abstracts)
predictions = head.predict(embeddings)
# 0 = research_paper, 1 = literature_review

Or with the SetFit API (if loading works with your version):

from setfit import SetFitModel
model = SetFitModel.from_pretrained("jimnoneill/pubguard-review-classifier")
predictions = model.predict(abstracts)

Labels

  • 0 โ€” research_paper: Original research reporting new findings, methods, or data
  • 1 โ€” literature_review: Reviews, meta-analyses, surveys, guidelines, or comprehensive overviews of existing work

Limitations

  • Trained on English-language abstracts only
  • Performs best on biomedical and life sciences text; may underperform on humanities or social sciences
  • Short titles without abstract context have lower accuracy (use with full abstracts when possible)
  • Some borderline cases (e.g., methods comparison papers that also present new data) may be classified either way
  • Not designed to detect other non-research document types (posters, editorials, erratum) โ€” use PubGuard's full pipeline for comprehensive document type filtering

Citation

If you use this model, please cite:

@misc{pubguard-review-classifier,
  title={PubGuard Review Classifier: SetFit-based Literature Review Detection for Scientific Abstracts},
  author={O'Neill, James},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/jimnoneill/pubguard-review-classifier}
}
Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jimnoneill/pubguard-review-classifier

Finetuned
(454)
this model

Evaluation results