|
|
---
|
|
|
license: apache-2.0
|
|
|
language:
|
|
|
- en
|
|
|
tags:
|
|
|
- sentence-transformers
|
|
|
- sentence-similarity
|
|
|
- feature-extraction
|
|
|
- radiology
|
|
|
- medical
|
|
|
- retrieval
|
|
|
- embeddings
|
|
|
- healthcare
|
|
|
- clinical
|
|
|
base_model: zzxslp/RadBERT-RoBERTa-4m
|
|
|
pipeline_tag: sentence-similarity
|
|
|
library_name: sentence-transformers
|
|
|
datasets:
|
|
|
- radiology-education-corpus
|
|
|
metrics:
|
|
|
- mrr
|
|
|
- ndcg
|
|
|
model-index:
|
|
|
- name: RadLITE-Encoder
|
|
|
results:
|
|
|
- task:
|
|
|
type: retrieval
|
|
|
name: Information Retrieval
|
|
|
dataset:
|
|
|
name: RadLIT-9 (Radiology Retrieval Benchmark)
|
|
|
type: radiology-retrieval
|
|
|
metrics:
|
|
|
- type: mrr
|
|
|
value: 0.829
|
|
|
name: MRR (with full pipeline)
|
|
|
- type: ndcg@10
|
|
|
value: 0.863
|
|
|
name: nDCG@10
|
|
|
- type: recall@10
|
|
|
value: 0.90
|
|
|
name: Recall@10
|
|
|
- task:
|
|
|
type: semantic-similarity
|
|
|
name: Semantic Similarity
|
|
|
dataset:
|
|
|
name: Radiology Similarity Evaluation
|
|
|
type: radiology-similarity
|
|
|
metrics:
|
|
|
- type: spearman_cosine
|
|
|
value: 0.8454
|
|
|
name: Spearman Correlation
|
|
|
- type: pearson_cosine
|
|
|
value: 0.8504
|
|
|
name: Pearson Correlation
|
|
|
---
|
|
|
|
|
|
# RadLITE-Encoder
|
|
|
|
|
|
**Radiology Late Interaction Transformer Enhanced - Bi-Encoder Component**
|
|
|
|
|
|
A domain-specialized sentence transformer for radiology and medical imaging content. This model encodes radiology text (reports, articles, educational content) into 768-dimensional dense vectors optimized for semantic search and retrieval.
|
|
|
|
|
|
> **Recommended:** For optimal retrieval performance, use this encoder with [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) in a two-stage pipeline. The bi-encoder provides fast candidate retrieval, while the cross-encoder reranker delivers precision. This combination achieves **MRR 0.829** on radiology benchmarks.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
| Property | Value |
|
|
|
|----------|-------|
|
|
|
| **Model Type** | Sentence Transformer (Bi-Encoder) |
|
|
|
| **Base Model** | [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) |
|
|
|
| **Domain** | Radiology / Medical Imaging |
|
|
|
| **Vector Dimensions** | 768 |
|
|
|
| **Max Sequence Length** | 512 tokens |
|
|
|
| **Similarity Function** | Cosine Similarity |
|
|
|
| **License** | Apache 2.0 |
|
|
|
|
|
|
### Why RadLITE-Encoder?
|
|
|
|
|
|
Standard embedding models (BGE, E5, OpenAI) are trained on general web text and struggle with radiology-specific terminology:
|
|
|
|
|
|
- **Anatomical terms**: "hepatic flexure", "foramen magnum", "costophrenic angle"
|
|
|
- **Imaging sequences**: "T2 FLAIR", "DWI/ADC mismatch", "post-gadolinium"
|
|
|
- **Pathology descriptions**: "ground-glass opacity", "cortical ribbon sign", "double duct sign"
|
|
|
- **Abbreviations**: "HCC", "RCC", "NSCLC", "BI-RADS"
|
|
|
|
|
|
RadLITE-Encoder is fine-tuned on millions of radiology documents to understand this specialized vocabulary.
|
|
|
|
|
|
## Performance
|
|
|
|
|
|
### RadLIT-9 Benchmark (Radiology Retrieval)
|
|
|
|
|
|
| Model | MRR | nDCG@10 | Notes |
|
|
|
|-------|-----|---------|-------|
|
|
|
| **RadLITE-Encoder** | **0.829** | **0.863** | Full pipeline with reranker |
|
|
|
| RadLITE-Encoder (standalone) | 0.78 | 0.81 | Bi-encoder only |
|
|
|
| BGE-large-en-v1.5 | 0.72 | 0.76 | General-purpose |
|
|
|
| RadBERT (baseline) | 0.45 | 0.52 | No retrieval training |
|
|
|
|
|
|
### Subspecialty Performance
|
|
|
|
|
|
| Subspecialty | MRR | Notes |
|
|
|
|--------------|-----|-------|
|
|
|
| Physics/Nuclear Medicine | 0.936 | Excellent |
|
|
|
| Pediatric Radiology | 0.931 | Excellent |
|
|
|
| Thoracic Imaging | 0.913 | Excellent |
|
|
|
| Cardiac Imaging | 0.862 | Good |
|
|
|
| Neuroradiology | 0.860 | Good |
|
|
|
| Gastrointestinal | 0.800 | Good |
|
|
|
| Breast Imaging | 0.722 | Moderate |
|
|
|
| Musculoskeletal | 0.695 | Moderate |
|
|
|
| Genitourinary | 0.694 | Moderate |
|
|
|
|
|
|
## Quick Start
|
|
|
|
|
|
### Installation
|
|
|
|
|
|
```bash
|
|
|
pip install sentence-transformers>=2.2.0
|
|
|
```
|
|
|
|
|
|
### Basic Usage
|
|
|
|
|
|
```python
|
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
|
|
# Load the model
|
|
|
model = SentenceTransformer("matulichpt/RadLITE-Encoder")
|
|
|
|
|
|
# Encode radiology text
|
|
|
documents = [
|
|
|
"Hepatocellular carcinoma typically shows arterial enhancement with washout on portal venous phase.",
|
|
|
"Ground-glass opacities in the bilateral lower lobes, concerning for viral pneumonia.",
|
|
|
"No acute intracranial abnormality. Age-appropriate cerebral volume loss.",
|
|
|
]
|
|
|
|
|
|
queries = [
|
|
|
"HCC imaging characteristics on CT",
|
|
|
"COVID-19 chest CT findings",
|
|
|
]
|
|
|
|
|
|
# Generate embeddings
|
|
|
doc_embeddings = model.encode(documents, normalize_embeddings=True)
|
|
|
query_embeddings = model.encode(queries, normalize_embeddings=True)
|
|
|
|
|
|
# Compute similarities
|
|
|
similarities = query_embeddings @ doc_embeddings.T
|
|
|
print(similarities)
|
|
|
# Query 1 (HCC) will score highest with Document 1
|
|
|
# Query 2 (COVID) will score highest with Document 2
|
|
|
```
|
|
|
|
|
|
### Semantic Search over Your Corpus
|
|
|
|
|
|
```python
|
|
|
from sentence_transformers import SentenceTransformer, util
|
|
|
import torch
|
|
|
|
|
|
# Load model
|
|
|
model = SentenceTransformer("matulichpt/RadLITE-Encoder")
|
|
|
|
|
|
# Your radiology corpus (articles, reports, educational content)
|
|
|
corpus = [
|
|
|
{"id": "doc1", "text": "Pancoast tumor: apical lung mass with rib destruction..."},
|
|
|
{"id": "doc2", "text": "Hepatic hemangioma shows peripheral nodular enhancement..."},
|
|
|
{"id": "doc3", "text": "Acoustic neuroma appears as enhancing CP angle mass..."},
|
|
|
# ... your documents
|
|
|
]
|
|
|
|
|
|
# Pre-compute corpus embeddings (do this once, save for reuse)
|
|
|
corpus_texts = [doc["text"] for doc in corpus]
|
|
|
corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True, show_progress_bar=True)
|
|
|
|
|
|
# Save embeddings for later
|
|
|
torch.save(corpus_embeddings, "corpus_embeddings.pt")
|
|
|
|
|
|
# Search function
|
|
|
def search(query: str, top_k: int = 10):
|
|
|
query_embedding = model.encode(query, normalize_embeddings=True)
|
|
|
scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
|
|
|
top_results = torch.topk(scores, k=min(top_k, len(corpus)))
|
|
|
|
|
|
results = []
|
|
|
for score, idx in zip(top_results.values, top_results.indices):
|
|
|
results.append({
|
|
|
"document": corpus[idx],
|
|
|
"score": float(score)
|
|
|
})
|
|
|
return results
|
|
|
|
|
|
# Example search
|
|
|
results = search("superior sulcus tumor with Horner syndrome")
|
|
|
for r in results[:3]:
|
|
|
print(f"Score: {r['score']:.3f} - {r['document']['text'][:100]}...")
|
|
|
```
|
|
|
|
|
|
### Integration with FAISS (Large-Scale)
|
|
|
|
|
|
```python
|
|
|
import faiss
|
|
|
import numpy as np
|
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
|
|
model = SentenceTransformer("matulichpt/RadLITE-Encoder")
|
|
|
|
|
|
# Encode your corpus
|
|
|
corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True)
|
|
|
corpus_embeddings = np.array(corpus_embeddings).astype('float32')
|
|
|
|
|
|
# Build FAISS index
|
|
|
dimension = 768
|
|
|
index = faiss.IndexFlatIP(dimension) # Inner product = cosine for normalized vectors
|
|
|
index.add(corpus_embeddings)
|
|
|
|
|
|
# Save index
|
|
|
faiss.write_index(index, "radiology_index.faiss")
|
|
|
|
|
|
# Search
|
|
|
def faiss_search(query: str, top_k: int = 10):
|
|
|
query_embedding = model.encode(query, normalize_embeddings=True)
|
|
|
query_embedding = np.array([query_embedding]).astype('float32')
|
|
|
scores, indices = index.search(query_embedding, top_k)
|
|
|
return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]
|
|
|
```
|
|
|
|
|
|
## Best Practices
|
|
|
|
|
|
### 1. Normalize Embeddings
|
|
|
|
|
|
Always use `normalize_embeddings=True` for retrieval tasks. This enables efficient cosine similarity via dot product.
|
|
|
|
|
|
### 2. Chunk Long Documents
|
|
|
|
|
|
The model has a 512 token limit. For long articles:
|
|
|
|
|
|
```python
|
|
|
def chunk_text(text: str, max_length: int = 400, overlap: int = 50):
|
|
|
"""Chunk text with overlap for better retrieval."""
|
|
|
words = text.split()
|
|
|
chunks = []
|
|
|
for i in range(0, len(words), max_length - overlap):
|
|
|
chunk = " ".join(words[i:i + max_length])
|
|
|
chunks.append(chunk)
|
|
|
return chunks
|
|
|
```
|
|
|
|
|
|
### 3. Batch Processing
|
|
|
|
|
|
For large corpora, use batching:
|
|
|
|
|
|
```python
|
|
|
embeddings = model.encode(
|
|
|
texts,
|
|
|
batch_size=32,
|
|
|
normalize_embeddings=True,
|
|
|
show_progress_bar=True
|
|
|
)
|
|
|
```
|
|
|
|
|
|
### 4. GPU Acceleration
|
|
|
|
|
|
```python
|
|
|
model = SentenceTransformer("matulichpt/RadLITE-Encoder", device="cuda")
|
|
|
```
|
|
|
|
|
|
## Two-Stage Retrieval (Recommended)
|
|
|
|
|
|
For best results, combine RadLITE-Encoder with the [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker):
|
|
|
|
|
|
```python
|
|
|
from sentence_transformers import SentenceTransformer, CrossEncoder
|
|
|
|
|
|
# Stage 1: Fast bi-encoder retrieval
|
|
|
encoder = SentenceTransformer("matulichpt/RadLITE-Encoder")
|
|
|
# Stage 2: Precise cross-encoder reranking
|
|
|
reranker = CrossEncoder("matulichpt/RadLITE-Reranker", max_length=512)
|
|
|
|
|
|
def two_stage_search(query: str, corpus: list, top_k: int = 10):
|
|
|
# Stage 1: Get top candidates (fast)
|
|
|
query_emb = encoder.encode(query, normalize_embeddings=True)
|
|
|
corpus_embs = encoder.encode(corpus, normalize_embeddings=True)
|
|
|
scores = query_emb @ corpus_embs.T
|
|
|
top_indices = scores.argsort()[-50:][::-1] # Top 50 candidates
|
|
|
|
|
|
# Stage 2: Rerank with cross-encoder (precise)
|
|
|
candidates = [corpus[i] for i in top_indices]
|
|
|
pairs = [[query, doc] for doc in candidates]
|
|
|
rerank_scores = reranker.predict(pairs)
|
|
|
|
|
|
# Apply temperature calibration (recommended: 1.5)
|
|
|
rerank_scores = rerank_scores / 1.5
|
|
|
|
|
|
# Sort by reranked scores
|
|
|
reranked = sorted(zip(top_indices, rerank_scores), key=lambda x: x[1], reverse=True)
|
|
|
return reranked[:top_k]
|
|
|
```
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
```
|
|
|
Input Text
|
|
|
|
|
|
|
v
|
|
|
[RadBERT Tokenizer] --> tokens (max 512)
|
|
|
|
|
|
|
v
|
|
|
[RoBERTa Encoder] --> 12 layers, 768 hidden
|
|
|
|
|
|
|
v
|
|
|
[Mean Pooling] --> aggregate token embeddings
|
|
|
|
|
|
|
v
|
|
|
768-dim embedding vector
|
|
|
```
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
- **Base Model**: RadBERT-RoBERTa-4m (pre-trained on 4.42M VA radiology reports)
|
|
|
- **Fine-tuning**: Contrastive learning on radiology education corpus
|
|
|
- **Training Samples**: 6.7M query-document pairs
|
|
|
- **Loss Function**: Multiple Negatives Ranking Loss
|
|
|
- **Epochs**: 2 (8,400 steps)
|
|
|
- **Final Spearman**: 0.8454
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- **English only**: Trained on English radiology text
|
|
|
- **Domain-specific**: May underperform on non-radiology medical content
|
|
|
- **Subspecialty variance**: GU/MSK content has lower performance than Physics/Neuro
|
|
|
- **512 token limit**: Long documents require chunking
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use RadLITE in your work, please cite both RadLITE and the underlying RadBERT model:
|
|
|
|
|
|
```bibtex
|
|
|
@software{radlite_2026,
|
|
|
title = {RadLITE: Calibrated Multi-Stage Retrieval for Radiology Education},
|
|
|
author = {Grai Team},
|
|
|
year = {2026},
|
|
|
month = {January},
|
|
|
url = {https://huggingface.co/matulichpt/RadLITE-Encoder},
|
|
|
note = {MRR 0.829 on RadLIT-9 benchmark}
|
|
|
}
|
|
|
|
|
|
@article{yan2022radbert,
|
|
|
title = {RadBERT: Adapting Transformer-based Language Models to Radiology},
|
|
|
author = {Yan, An and McAuley, Julian and Lu, Xing and Du, Jiang and Chang, Eric Y and Gentili, Amilcare and Hsu, Chun-Nan},
|
|
|
journal = {Radiology: Artificial Intelligence},
|
|
|
volume = {4},
|
|
|
number = {4},
|
|
|
pages = {e210258},
|
|
|
year = {2022},
|
|
|
publisher = {Radiological Society of North America},
|
|
|
doi = {10.1148/ryai.210258}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Related Models
|
|
|
|
|
|
- [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) - Cross-encoder for reranking
|
|
|
- [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) - Base model
|
|
|
|
|
|
## License
|
|
|
|
|
|
Apache 2.0 - Free for commercial and research use.
|
|
|
|