Upload crossencoder model

fc2c724 verified 5 days ago

6.68 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- cross-encoder
	- text-classification
	- transformers
	- modernbert
	- biomedical
	- systematic-review
	- relevance-screening
	- reranking
	- pubmed
	datasets:
	- Praise2112/siren-screening
	base_model:
	- Alibaba-NLP/gte-modernbert-base
	pipeline_tag: text-classification
	---

	# SIREN Screening Cross-encoder

	<p align="center">
	<a href="https://huggingface.co/datasets/Praise2112/siren-screening">
	<img src="https://img.shields.io/badge/Dataset-siren--screening-yellow.svg" alt="Dataset"/>
	</a>
	<a href="https://huggingface.co/Praise2112/siren-screening-biencoder">
	<img src="https://img.shields.io/badge/Retriever-siren--screening--biencoder-blue.svg" alt="Bi-encoder"/>
	</a>
	<img src="https://img.shields.io/badge/License-Apache_2.0-green.svg" alt="License"/>
	</p>

	A 3-class cross-encoder for systematic review screening that classifies query-document pairs as Relevant, Partial, or Irrelevant. Designed to rerank candidates from the [siren-screening-biencoder](https://huggingface.co/Praise2112/siren-screening-biencoder).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| GTE-reranker-ModernBERT-base \|
	\| Architecture \| ModernBertForSequenceClassification (22 layers, 768 hidden) \|
	\| Parameters \| ~149M \|
	\| Max Sequence Length \| 8192 tokens \|
	\| Output \| 3-class probabilities (Irrelevant, Partial, Relevant) \|
	\| Training \| Fine-tuned on [siren-screening](https://huggingface.co/datasets/Praise2112/siren-screening) + SLERP merged (t=0.2) \|

	### Label Definitions

	\| Label \| ID \| Definition \|
	\|-------\|------\|------------\|
	\| Irrelevant \| 0 \| Document matches NONE of the eligibility criteria \|
	\| Partial \| 1 \| Document matches SOME but not ALL criteria \|
	\| Relevant \| 2 \| Document matches ALL criteria \|

	## Intended Use

	Primary use case: Second-stage reranking in systematic review screening pipelines.

	After retrieving candidates with a bi-encoder, use this cross-encoder to:
	1. Rerank documents for better precision at top ranks
	2. Classify relevance for triage (prioritize Relevant, defer Partial, skip Irrelevant)

	Recommended pipeline:
	1. Retrieve top-100 candidates with [siren-screening-biencoder](https://huggingface.co/Praise2112/siren-screening-biencoder)
	2. Rerank with this cross-encoder
	3. Use relevance labels to prioritize human screening

	## Usage

	### Sentence-Transformers CrossEncoder

	```python
	from sentence_transformers import CrossEncoder

	model = CrossEncoder("Praise2112/siren-screening-crossencoder")

	# Pairs of (query, document)
	pairs = [
	("RCTs of aspirin in diabetic adults", "A randomized trial of aspirin in 5,000 diabetic patients showed..."),
	("RCTs of aspirin in diabetic adults", "This cohort study examined statin use in elderly populations..."),
	]

	# Get 3-class scores
	scores = model.predict(pairs)
	print(scores)
	# Output: array([[ 0.02, 0.15, 0.83], # Relevant
	# [ 0.91, 0.07, 0.02]]) # Irrelevant
	```

	### Transformers (Direct)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("Praise2112/siren-screening-crossencoder")
	model = AutoModelForSequenceClassification.from_pretrained("Praise2112/siren-screening-crossencoder")

	query = "RCTs of aspirin in diabetic adults"
	document = "A randomized trial of aspirin in 5,000 diabetic patients showed reduced MI risk..."

	inputs = tokenizer(
	query, document,
	padding=True,
	truncation=True,
	max_length=768,
	return_tensors="pt"
	)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)

	print(f"Irrelevant: {probs[0, 0]:.3f}")
	print(f"Partial: {probs[0, 1]:.3f}")
	print(f"Relevant: {probs[0, 2]:.3f}")

	# Get predicted label
	label_id = probs.argmax().item()
	labels = {0: "Irrelevant", 1: "Partial", 2: "Relevant"}
	print(f"Prediction: {labels[label_id]}")
	```

	### Scoring for Reranking

	For reranking, convert 3-class probabilities to a single score:

	```python
	def rerank_score(probs):
	"""Convert 3-class probs to ranking score.

	Higher score = more relevant.
	Partial gets partial credit (1x), Relevant gets full credit (2x).
	"""
	return probs[1] + 2 * probs[2] # P(Partial) + 2 * P(Relevant)

	# Example
	probs = [0.02, 0.15, 0.83] # [Irrelevant, Partial, Relevant]
	score = rerank_score(probs) # 0.15 + 2 * 0.83 = 1.81
	```

	## Performance

	### Classification Accuracy

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 90.6% \|
	\| F1 (Macro) \| 90.6% \|
	\| Irrelevant F1 \| 92.2% \|
	\| Partial F1 \| 87.4% \|
	\| Relevant F1 \| 92.3% \|

	### Reranking Impact (MRR@10)

	\| Configuration \| MRR@10 \| Delta \|
	\|---------------\|--------\|-------\|
	\| SIREN bi-encoder alone \| 0.937 \| - \|
	\| + SIREN cross-encoder \| 0.952 \| +1.5pp \|
	\| + [BGE-reranker](https://huggingface.co/BAAI/bge-reranker-base) (general) \| 0.846 \| -9.2pp \|

	General-purpose rerankers like [BGE](https://huggingface.co/BAAI/bge-reranker-base) actually hurt performance on screening queries because they're optimized for topical relevance, not criteria matching.

	### Cross-encoder Transfer

	This cross-encoder also improves other retrievers:

	\| Bi-encoder \| Cross-encoder \| MRR@10 \| Delta \|
	\|------------\|---------------\|--------\|-------\|
	\| [MedCPT](https://huggingface.co/ncbi/MedCPT-Query-Encoder) \| - \| 0.697 \| - \|
	\| [MedCPT](https://huggingface.co/ncbi/MedCPT-Query-Encoder) \| [MedCPT-CE](https://huggingface.co/ncbi/MedCPT-Cross-Encoder) \| 0.826 \| +12.9pp \|
	\| [MedCPT](https://huggingface.co/ncbi/MedCPT-Query-Encoder) \| SIREN-CE \| 0.931 \| +23.4pp \|

	## Training

	This model was created by:
	1. Fine-tuning on the [siren-screening](https://huggingface.co/datasets/Praise2112/siren-screening) dataset with 3-class labels
	2. SLERP merging encoder layers with the base model (t=0.2) to preserve generalization

	Training details:
	- Loss: Cross-entropy
	- Batch size: 32 (16 x 2 gradient accumulation)
	- Learning rate: 2e-5
	- Epochs: 1
	- Max length: 768 tokens

	## Limitations

	- Synthetic queries, real documents: The queries and relevance labels are LLM-generated, but the documents are real PubMed articles
	- English only: Trained on English PubMed content

	## Citation

	```bibtex
	@misc{oketola2026siren,
	title={SIREN: Improving Systematic Review Screening with Synthetic Training Data for Neural Retrievers},
	author={Praise Oketola},
	year={2026},
	howpublished={\url{https://huggingface.co/Praise2112/siren-screening-crossencoder}},
	note={Cross-encoder model}
	}
	```

	## License

	Apache 2.0