RadLITE-Reranker

Radiology Late Interaction Transformer Enhanced - Cross-Encoder Reranker

A domain-specialized cross-encoder for reranking radiology search results. This model takes a query-document pair and predicts a relevance score, providing more accurate ranking than bi-encoder similarity alone.

Recommended: Use this reranker together with RadLITE-Encoder in a two-stage pipeline for optimal performance. The bi-encoder handles fast retrieval over large corpora, then this cross-encoder reranks the top candidates for precision. This combination achieves MRR 0.829 on radiology retrieval benchmarks.

Model Description

Property	Value
Model Type	Cross-Encoder (Reranker)
Base Model	ms-marco-MiniLM-L-12-v2
Domain	Radiology / Medical Imaging
Hidden Size	384
Max Sequence Length	512 tokens
Output	Single relevance score
License	Apache 2.0

Why Use a Reranker?

Bi-encoders (like RadLITE-Encoder) are fast but encode query and document independently. Cross-encoders process them together, capturing fine-grained interactions:

Approach	Speed	Accuracy	Use Case
Bi-Encoder	Fast (1000s docs/sec)	Good	First-stage retrieval
Cross-Encoder	Slow (10s docs/sec)	Excellent	Reranking top candidates

Two-stage pipeline: Use bi-encoder to get top 50-100 candidates, then rerank with cross-encoder for best results.

Performance

Impact on RadLIT-9 Benchmark

Configuration	MRR	Improvement
Bi-Encoder only	0.78	baseline
Bi-Encoder + Reranker	0.829	+6.3%

ABR Core Exam (Board-Style Questions)

Comparing two-stage pipeline (bi-encoder + reranker) vs bi-encoder alone:

Dataset	Two-Stage MRR	Bi-Encoder Only	Improvement
Core Exam Chest	0.533	0.409	+30.3%
Core Exam Combined	0.466	0.381	+22.5%

The reranker provides significant gains on complex, multi-part queries typical of board exam questions.

Published Benchmark Results

From Matulich & Mason, 2026:

Benchmark	RadLIT Result	Key Finding
NFCorpus nDCG@10	0.268	17.9x improvement over RadBERT bi-encoder (0.015)
VQA-RAD MRR	0.972	Near-perfect retrieval on radiology Q&A
RadLIT-9 Thoracic	0.736 nDCG@10	Best-in-class (beat BGE-large, ColBERTv2)
RadLIT-9 Pediatric	0.625 nDCG@10	Best-in-class (beat BGE-large, ColBERTv2)
Zebra Test	92% found rate	2.1x improvement on rare conditions vs ColBERTv2

Vocabulary Alignment Hypothesis: Domain training provides measurable advantage when queries use radiology-specific terminology that aligns with the training domain.

Quick Start

Installation

pip install sentence-transformers>=2.2.0

Basic Usage

from sentence_transformers import CrossEncoder

# Load the reranker
reranker = CrossEncoder("matulichpt/radlit-crossencoder", max_length=512)

# Query and candidate documents
query = "What are the imaging features of hepatocellular carcinoma?"
documents = [
    "HCC typically shows arterial enhancement with portal venous washout on CT.",
    "Fatty liver disease presents as decreased attenuation on non-contrast CT.",
    "Hepatic hemangiomas show peripheral nodular enhancement.",
]

# Create query-document pairs
pairs = [[query, doc] for doc in documents]

# Get relevance scores
scores = reranker.predict(pairs)

# Apply temperature calibration (RECOMMENDED)
calibrated_scores = scores / 1.5

print("Scores:", calibrated_scores)
# Document about HCC will have highest score

Temperature Calibration

Important: This model outputs scores with high variance. Apply temperature scaling for better fusion with other signals:

# Raw scores might be: [4.2, -1.5, 0.8]
# After calibration:   [2.8, -1.0, 0.53]

TEMPERATURE = 1.5  # Recommended value

def calibrated_predict(reranker, pairs):
    raw_scores = reranker.predict(pairs)
    return raw_scores / TEMPERATURE

Full Two-Stage Search Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

class RadLITESearch:
    def __init__(self, device="cuda"):
        # Stage 1: Fast bi-encoder
        self.encoder = SentenceTransformer(
            "matulichpt/radlit-biencoder",
            device=device
        )
        # Stage 2: Precise reranker
        self.reranker = CrossEncoder(
            "matulichpt/radlit-crossencoder",
            max_length=512,
            device=device
        )
        self.temperature = 1.5
        self.corpus_embeddings = None
        self.corpus = None

    def index_corpus(self, documents: list):
        """Pre-compute embeddings for your corpus."""
        self.corpus = documents
        self.corpus_embeddings = self.encoder.encode(
            documents,
            normalize_embeddings=True,
            show_progress_bar=True,
            batch_size=32
        )

    def search(self, query: str, top_k: int = 10, candidates: int = 50):
        """Two-stage search: retrieve then rerank."""

        # Stage 1: Bi-encoder retrieval
        query_emb = self.encoder.encode(query, normalize_embeddings=True)
        scores = query_emb @ self.corpus_embeddings.T
        top_indices = np.argsort(scores)[-candidates:][::-1]

        # Stage 2: Cross-encoder reranking
        candidate_docs = [self.corpus[i] for i in top_indices]
        pairs = [[query, doc] for doc in candidate_docs]
        rerank_scores = self.reranker.predict(pairs) / self.temperature

        # Sort by reranked scores
        sorted_indices = np.argsort(rerank_scores)[::-1]

        results = []
        for idx in sorted_indices[:top_k]:
            results.append({
                "document": candidate_docs[idx],
                "corpus_index": int(top_indices[idx]),
                "score": float(rerank_scores[idx]),
                "biencoder_score": float(scores[top_indices[idx]])
            })
        return results


# Usage
searcher = RadLITESearch()
searcher.index_corpus(your_radiology_documents)
results = searcher.search("pneumothorax CT findings")

Integration with Any Corpus

Radiopaedia / Educational Content

import json

# Load your content (e.g., Radiopaedia articles)
with open("radiopaedia_articles.json") as f:
    articles = json.load(f)

corpus = [article["content"] for article in articles]

# Initialize search
searcher = RadLITESearch()
searcher.index_corpus(corpus)

# Search
results = searcher.search("classic findings of pulmonary embolism on CTPA")

for r in results[:5]:
    print(f"Score: {r['score']:.3f}")
    print(f"Content: {r['document'][:200]}...")
    print()

Integration with Elasticsearch/OpenSearch

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("matulichpt/radlit-crossencoder", max_length=512)

def rerank_elasticsearch_results(query: str, es_results: list, top_k: int = 10):
    """Rerank Elasticsearch BM25 results."""
    documents = [hit["_source"]["content"] for hit in es_results]
    pairs = [[query, doc] for doc in documents]

    scores = reranker.predict(pairs) / 1.5  # Temperature calibration

    # Combine with ES scores (optional)
    for i, hit in enumerate(es_results):
        hit["rerank_score"] = float(scores[i])
        hit["combined_score"] = 0.3 * hit["_score"] + 0.7 * scores[i]

    # Sort by combined score
    reranked = sorted(es_results, key=lambda x: x["combined_score"], reverse=True)
    return reranked[:top_k]

Optimal Fusion Weights

When combining multiple signals (bi-encoder, cross-encoder, BM25), use these weights:

# Optimal weights from grid search on RadLIT-9
FUSION_WEIGHTS = {
    "biencoder": 0.5,    # RadLITE-Encoder similarity
    "crossencoder": 0.2, # RadLITE-Reranker (after temp calibration)
    "bm25": 0.3          # Lexical matching (if available)
}

def fused_score(bienc_score, ce_score, bm25_score=0):
    return (
        FUSION_WEIGHTS["biencoder"] * bienc_score +
        FUSION_WEIGHTS["crossencoder"] * ce_score +
        FUSION_WEIGHTS["bm25"] * bm25_score
    )

Architecture

[Query] + [SEP] + [Document]
           |
           v
    [BERT Tokenizer]
           |
           v
    [MiniLM Encoder] (12 layers, 384 hidden)
           |
           v
    [Classification Head]
           |
           v
    Relevance Score (float)

Training Details

Base Model: ms-marco-MiniLM-L-12-v2 (trained on MS MARCO passage ranking)
Fine-tuning: Radiology query-document relevance pairs
Training Steps: 5,626
Best Validation Loss: 0.691
Learning Rate: 2e-5
Batch Size: 32
Category Weighting: Yes (balanced across radiology subspecialties)

Best Practices

1. Always Use Temperature Calibration

Raw cross-encoder scores can be extreme. Temperature scaling (1.5) produces better fusion:

calibrated = raw_score / 1.5

2. Limit Candidates for Reranking

Cross-encoders are slow. Only rerank top 50-100 candidates from bi-encoder:

# Good: Rerank top 50
rerank_candidates = 50

# Bad: Rerank entire corpus
rerank_candidates = len(corpus)  # Too slow!

3. Batch Predictions

# Efficient: Single batch call
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs, batch_size=32)

# Inefficient: Individual calls
scores = [reranker.predict([[query, doc]])[0] for doc in candidates]

4. GPU Acceleration

reranker = CrossEncoder(
    "matulichpt/radlit-crossencoder",
    max_length=512,
    device="cuda"  # Use GPU
)

Limitations

English only: Trained on English radiology text
Speed: ~10-50 pairs/second (use for reranking, not full corpus)
512 token limit: Long documents are truncated
Domain-specific: Optimized for radiology, may underperform on general medical content

Citation

If you use RadLITE in your work, please cite:

@article{matulich2026radlit,
    title = {Late Interaction Retrieval Unlocks Domain Knowledge in Radiology Language Models},
    author = {Matulich, Patrick and Mason, Dan},
    year = {2026},
    journal = {Radiology: Artificial Intelligence},
    note = {17.9x improvement over RadBERT; best-in-class on Thoracic/Pediatric subspecialties},
    url = {https://huggingface.co/matulichpt/radlit-biencoder}
}

Related Models

RadLITE-Encoder - Bi-encoder for first-stage retrieval
RadBERT-RoBERTa-4m - Base radiology language model

License

Apache 2.0 - Free for commercial and research use.

Downloads last month: 7

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for matulichpt/radlit-crossencoder

Base model

microsoft/MiniLM-L12-H384-uncased

Quantized

cross-encoder/ms-marco-MiniLM-L12-v2

Finetuned

(31)

this model

Evaluation results

MRR (with bi-encoder) on RadLIT-9 (Radiology Retrieval Benchmark)
self-reported

0.829
MRR on ABR Core Exam (Chest) on RadLIT-9 (Radiology Retrieval Benchmark)
self-reported

0.533