SIREN Screening Cross-encoder

A 3-class cross-encoder for systematic review screening that classifies query-document pairs as Relevant, Partial, or Irrelevant. Designed to rerank candidates from the siren-screening-biencoder.

Model Details

Property	Value
Base Model	GTE-reranker-ModernBERT-base
Architecture	ModernBertForSequenceClassification (22 layers, 768 hidden)
Parameters	~149M
Max Sequence Length	8192 tokens
Output	3-class probabilities (Irrelevant, Partial, Relevant)
Training	Fine-tuned on siren-screening + SLERP merged (t=0.2)

Label Definitions

Label	ID	Definition
Irrelevant	0	Document matches NONE of the eligibility criteria
Partial	1	Document matches SOME but not ALL criteria
Relevant	2	Document matches ALL criteria

Intended Use

Primary use case: Second-stage reranking in systematic review screening pipelines.

After retrieving candidates with a bi-encoder, use this cross-encoder to:

Rerank documents for better precision at top ranks
Classify relevance for triage (prioritize Relevant, defer Partial, skip Irrelevant)

Recommended pipeline:

Retrieve top-100 candidates with siren-screening-biencoder
Rerank with this cross-encoder
Use relevance labels to prioritize human screening

Usage

Sentence-Transformers CrossEncoder

from sentence_transformers import CrossEncoder

model = CrossEncoder("Praise2112/siren-screening-crossencoder")

# Pairs of (query, document)
pairs = [
    ("RCTs of aspirin in diabetic adults", "A randomized trial of aspirin in 5,000 diabetic patients showed..."),
    ("RCTs of aspirin in diabetic adults", "This cohort study examined statin use in elderly populations..."),
]

# Get 3-class scores
scores = model.predict(pairs)
print(scores)
# Output: array([[ 0.02,  0.15,  0.83],   # Relevant
#                [ 0.91,  0.07,  0.02]])  # Irrelevant

Transformers (Direct)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Praise2112/siren-screening-crossencoder")
model = AutoModelForSequenceClassification.from_pretrained("Praise2112/siren-screening-crossencoder")

query = "RCTs of aspirin in diabetic adults"
document = "A randomized trial of aspirin in 5,000 diabetic patients showed reduced MI risk..."

inputs = tokenizer(
    query, document,
    padding=True,
    truncation=True,
    max_length=768,
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

print(f"Irrelevant: {probs[0, 0]:.3f}")
print(f"Partial: {probs[0, 1]:.3f}")
print(f"Relevant: {probs[0, 2]:.3f}")

# Get predicted label
label_id = probs.argmax().item()
labels = {0: "Irrelevant", 1: "Partial", 2: "Relevant"}
print(f"Prediction: {labels[label_id]}")

Scoring for Reranking

For reranking, convert 3-class probabilities to a single score:

def rerank_score(probs):
    """Convert 3-class probs to ranking score.

    Higher score = more relevant.
    Partial gets partial credit (1x), Relevant gets full credit (2x).
    """
    return probs[1] + 2 * probs[2]  # P(Partial) + 2 * P(Relevant)

# Example
probs = [0.02, 0.15, 0.83]  # [Irrelevant, Partial, Relevant]
score = rerank_score(probs)  # 0.15 + 2 * 0.83 = 1.81

Performance

Classification Accuracy

Metric	Value
Accuracy	90.6%
F1 (Macro)	90.6%
Irrelevant F1	92.2%
Partial F1	87.4%
Relevant F1	92.3%

Reranking Impact (MRR@10)

Configuration	MRR@10	Delta
SIREN bi-encoder alone	0.937	-
+ SIREN cross-encoder	0.952	+1.5pp
+ BGE-reranker (general)	0.846	-9.2pp

General-purpose rerankers like BGE actually hurt performance on screening queries because they're optimized for topical relevance, not criteria matching.

Cross-encoder Transfer

This cross-encoder also improves other retrievers:

Bi-encoder	Cross-encoder	MRR@10	Delta
MedCPT	-	0.697	-
MedCPT	MedCPT-CE	0.826	+12.9pp
MedCPT	SIREN-CE	0.931	+23.4pp

Training

This model was created by:

Fine-tuning on the siren-screening dataset with 3-class labels
SLERP merging encoder layers with the base model (t=0.2) to preserve generalization

Training details:

Loss: Cross-entropy
Batch size: 32 (16 x 2 gradient accumulation)
Learning rate: 2e-5
Epochs: 1
Max length: 768 tokens

Limitations

Synthetic queries, real documents: The queries and relevance labels are LLM-generated, but the documents are real PubMed articles
English only: Trained on English PubMed content

Citation

@misc{oketola2026siren,
  title={SIREN: Improving Systematic Review Screening with Synthetic Training Data for Neural Retrievers},
  author={Praise Oketola},
  year={2026},
  howpublished={\url{https://huggingface.co/Praise2112/siren-screening-crossencoder}},
  note={Cross-encoder model}
}

License

Apache 2.0

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for Praise2112/siren-screening-crossencoder

Base model

answerdotai/ModernBERT-base

Finetuned

Alibaba-NLP/gte-modernbert-base