PubGuard Review Classifier
A SetFit text classifier that distinguishes literature review articles from original research papers based on abstract text. Designed as a supplementary filter for PubGuard, adding finer-grained document type detection beyond standard metadata-based filtering.
Purpose
Metadata-based review detection (e.g., OpenAlex type:review or PubMed PublicationType tags) misses many review-like publications that are tagged as regular articles. This classifier operates on abstract text to catch:
- Narrative and scoping reviews not tagged as reviews
- Meta-analyses and systematic reviews with ambiguous metadata
- Survey papers comparing existing methods
- Clinical guidelines and consensus statements
- Comprehensive overviews disguised as research articles
Intended as an additional granularity layer on top of PubGuard's existing document type classification, specifically targeting the review/non-review boundary where metadata filters underperform.
Performance
Evaluated on a held-out test set of 9,000 abstracts (4,500 per class):
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Research Paper | 0.916 | 0.822 | 0.867 |
| Literature Review | 0.839 | 0.925 | 0.880 |
| Macro Average | 0.877 | 0.873 | 0.873 |
Accuracy: 87.3%
Confusion matrix (rows = true, columns = predicted):
| Pred: Research | Pred: Review | |
|---|---|---|
| True: Research | 3,700 | 800 |
| True: Review | 339 | 4,161 |
The model favors recall on the review class (92.5%) over precision (83.9%), which is appropriate for filtering applications where missing a review is more costly than occasionally flagging a borderline research paper.
Training
- Method: SetFit (few-shot contrastive learning + logistic regression head)
- Base model: BAAI/bge-base-en-v1.5 (768-dim)
- Contrastive phase: 256 samples per class, 20 iterations, 2 epochs, batch size 64
- Head training: Logistic regression on full training set, 3 epochs
- Training time: ~9 minutes on NVIDIA RTX PRO 6000
Dataset
60,000 abstracts balanced across two classes (30,000 each):
Literature Review (positive class):
- 15,000 abstracts from OpenAlex
type:reviewarticles - 15,000 abstracts from publications with review-indicating titles (systematic review, meta-analysis, scoping review, narrative review, comprehensive review, critical review, state of the art, survey of methods, comparison of methods, overview of approaches) โ sourced from a 230M publication PostgreSQL database
Research Paper (negative class):
- 20,000 abstracts from Immunology & Microbiology publications with >10 citations, excluding titles containing review/meta-analysis/survey keywords
- 10,000 abstracts from general scientific publications with >5 citations, same title exclusions
Usage
from sentence_transformers import SentenceTransformer
import joblib
# Load
st = SentenceTransformer("jimnoneill/pubguard-review-classifier")
head = joblib.load("model_head.pkl") # from the repo files
# Predict
abstracts = [
"We systematically reviewed 47 studies on gut microbiome interventions...",
"Here we report a novel bacteriophage that specifically lyses carbapenem-resistant Klebsiella...",
]
embeddings = st.encode(abstracts)
predictions = head.predict(embeddings)
# 0 = research_paper, 1 = literature_review
Or with the SetFit API (if loading works with your version):
from setfit import SetFitModel
model = SetFitModel.from_pretrained("jimnoneill/pubguard-review-classifier")
predictions = model.predict(abstracts)
Labels
0โ research_paper: Original research reporting new findings, methods, or data1โ literature_review: Reviews, meta-analyses, surveys, guidelines, or comprehensive overviews of existing work
Limitations
- Trained on English-language abstracts only
- Performs best on biomedical and life sciences text; may underperform on humanities or social sciences
- Short titles without abstract context have lower accuracy (use with full abstracts when possible)
- Some borderline cases (e.g., methods comparison papers that also present new data) may be classified either way
- Not designed to detect other non-research document types (posters, editorials, erratum) โ use PubGuard's full pipeline for comprehensive document type filtering
Citation
If you use this model, please cite:
@misc{pubguard-review-classifier,
title={PubGuard Review Classifier: SetFit-based Literature Review Detection for Scientific Abstracts},
author={O'Neill, James},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/jimnoneill/pubguard-review-classifier}
}
- Downloads last month
- 14
Model tree for jimnoneill/pubguard-review-classifier
Base model
BAAI/bge-base-en-v1.5Evaluation results
- Accuracyself-reported0.873
- F1 (macro)self-reported0.873