---
license: apache-2.0
language:
- en
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- radiology
- medical
- retrieval
- embeddings
- healthcare
- clinical
base_model: zzxslp/RadBERT-RoBERTa-4m
pipeline_tag: sentence-similarity
library_name: sentence-transformers
datasets:
- radiology-education-corpus
metrics:
- mrr
- ndcg
model-index:
- name: RadLITE-Encoder
  results:
  - task:
      type: retrieval
      name: Information Retrieval
    dataset:
      name: RadLIT-9 (Radiology Retrieval Benchmark)
      type: radiology-retrieval
    metrics:
    - type: mrr
      value: 0.829
      name: MRR (with full pipeline)
    - type: ndcg@10
      value: 0.863
      name: nDCG@10
    - type: recall@10
      value: 0.90
      name: Recall@10
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: Radiology Similarity Evaluation
      type: radiology-similarity
    metrics:
    - type: spearman_cosine
      value: 0.8454
      name: Spearman Correlation
    - type: pearson_cosine
      value: 0.8504
      name: Pearson Correlation
---

# RadLITE-Encoder

**Radiology Late Interaction Transformer Enhanced - Bi-Encoder Component**

A domain-specialized sentence transformer for radiology and medical imaging content. This model encodes radiology text (reports, articles, educational content) into 768-dimensional dense vectors optimized for semantic search and retrieval.

> **Recommended:** For optimal retrieval performance, use this encoder with [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) in a two-stage pipeline. The bi-encoder provides fast candidate retrieval, while the cross-encoder reranker delivers precision. This combination achieves **MRR 0.829** on radiology benchmarks.

## Model Description

| Property | Value |
|----------|-------|
| **Model Type** | Sentence Transformer (Bi-Encoder) |
| **Base Model** | [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) |
| **Domain** | Radiology / Medical Imaging |
| **Vector Dimensions** | 768 |
| **Max Sequence Length** | 512 tokens |
| **Similarity Function** | Cosine Similarity |
| **License** | Apache 2.0 |

### Why RadLITE-Encoder?

Standard embedding models (BGE, E5, OpenAI) are trained on general web text and struggle with radiology-specific terminology:

- **Anatomical terms**: "hepatic flexure", "foramen magnum", "costophrenic angle"
- **Imaging sequences**: "T2 FLAIR", "DWI/ADC mismatch", "post-gadolinium"
- **Pathology descriptions**: "ground-glass opacity", "cortical ribbon sign", "double duct sign"
- **Abbreviations**: "HCC", "RCC", "NSCLC", "BI-RADS"

RadLITE-Encoder is fine-tuned on millions of radiology documents to understand this specialized vocabulary.

## Performance

### RadLIT-9 Benchmark (Radiology Retrieval)

| Model | MRR | nDCG@10 | Notes |
|-------|-----|---------|-------|
| **RadLITE-Encoder** | **0.829** | **0.863** | Full pipeline with reranker |
| RadLITE-Encoder (standalone) | 0.78 | 0.81 | Bi-encoder only |
| BGE-large-en-v1.5 | 0.72 | 0.76 | General-purpose |
| RadBERT (baseline) | 0.45 | 0.52 | No retrieval training |

### Subspecialty Performance

| Subspecialty | MRR | Notes |
|--------------|-----|-------|
| Physics/Nuclear Medicine | 0.936 | Excellent |
| Pediatric Radiology | 0.931 | Excellent |
| Thoracic Imaging | 0.913 | Excellent |
| Cardiac Imaging | 0.862 | Good |
| Neuroradiology | 0.860 | Good |
| Gastrointestinal | 0.800 | Good |
| Breast Imaging | 0.722 | Moderate |
| Musculoskeletal | 0.695 | Moderate |
| Genitourinary | 0.694 | Moderate |

## Quick Start

### Installation

```bash
pip install sentence-transformers>=2.2.0
```

### Basic Usage

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("matulichpt/RadLITE-Encoder")

# Encode radiology text
documents = [
    "Hepatocellular carcinoma typically shows arterial enhancement with washout on portal venous phase.",
    "Ground-glass opacities in the bilateral lower lobes, concerning for viral pneumonia.",
    "No acute intracranial abnormality. Age-appropriate cerebral volume loss.",
]

queries = [
    "HCC imaging characteristics on CT",
    "COVID-19 chest CT findings",
]

# Generate embeddings
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query_embeddings = model.encode(queries, normalize_embeddings=True)

# Compute similarities
similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# Query 1 (HCC) will score highest with Document 1
# Query 2 (COVID) will score highest with Document 2
```

### Semantic Search over Your Corpus

```python
from sentence_transformers import SentenceTransformer, util
import torch

# Load model
model = SentenceTransformer("matulichpt/RadLITE-Encoder")

# Your radiology corpus (articles, reports, educational content)
corpus = [
    {"id": "doc1", "text": "Pancoast tumor: apical lung mass with rib destruction..."},
    {"id": "doc2", "text": "Hepatic hemangioma shows peripheral nodular enhancement..."},
    {"id": "doc3", "text": "Acoustic neuroma appears as enhancing CP angle mass..."},
    # ... your documents
]

# Pre-compute corpus embeddings (do this once, save for reuse)
corpus_texts = [doc["text"] for doc in corpus]
corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True, show_progress_bar=True)

# Save embeddings for later
torch.save(corpus_embeddings, "corpus_embeddings.pt")

# Search function
def search(query: str, top_k: int = 10):
    query_embedding = model.encode(query, normalize_embeddings=True)
    scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(scores, k=min(top_k, len(corpus)))

    results = []
    for score, idx in zip(top_results.values, top_results.indices):
        results.append({
            "document": corpus[idx],
            "score": float(score)
        })
    return results

# Example search
results = search("superior sulcus tumor with Horner syndrome")
for r in results[:3]:
    print(f"Score: {r['score']:.3f} - {r['document']['text'][:100]}...")
```

### Integration with FAISS (Large-Scale)

```python
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("matulichpt/RadLITE-Encoder")

# Encode your corpus
corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True)
corpus_embeddings = np.array(corpus_embeddings).astype('float32')

# Build FAISS index
dimension = 768
index = faiss.IndexFlatIP(dimension)  # Inner product = cosine for normalized vectors
index.add(corpus_embeddings)

# Save index
faiss.write_index(index, "radiology_index.faiss")

# Search
def faiss_search(query: str, top_k: int = 10):
    query_embedding = model.encode(query, normalize_embeddings=True)
    query_embedding = np.array([query_embedding]).astype('float32')
    scores, indices = index.search(query_embedding, top_k)
    return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]
```

## Best Practices

### 1. Normalize Embeddings

Always use `normalize_embeddings=True` for retrieval tasks. This enables efficient cosine similarity via dot product.

### 2. Chunk Long Documents

The model has a 512 token limit. For long articles:

```python
def chunk_text(text: str, max_length: int = 400, overlap: int = 50):
    """Chunk text with overlap for better retrieval."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_length - overlap):
        chunk = " ".join(words[i:i + max_length])
        chunks.append(chunk)
    return chunks
```

### 3. Batch Processing

For large corpora, use batching:

```python
embeddings = model.encode(
    texts,
    batch_size=32,
    normalize_embeddings=True,
    show_progress_bar=True
)
```

### 4. GPU Acceleration

```python
model = SentenceTransformer("matulichpt/RadLITE-Encoder", device="cuda")
```

## Two-Stage Retrieval (Recommended)

For best results, combine RadLITE-Encoder with the [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker):

```python
from sentence_transformers import SentenceTransformer, CrossEncoder

# Stage 1: Fast bi-encoder retrieval
encoder = SentenceTransformer("matulichpt/RadLITE-Encoder")
# Stage 2: Precise cross-encoder reranking
reranker = CrossEncoder("matulichpt/RadLITE-Reranker", max_length=512)

def two_stage_search(query: str, corpus: list, top_k: int = 10):
    # Stage 1: Get top candidates (fast)
    query_emb = encoder.encode(query, normalize_embeddings=True)
    corpus_embs = encoder.encode(corpus, normalize_embeddings=True)
    scores = query_emb @ corpus_embs.T
    top_indices = scores.argsort()[-50:][::-1]  # Top 50 candidates

    # Stage 2: Rerank with cross-encoder (precise)
    candidates = [corpus[i] for i in top_indices]
    pairs = [[query, doc] for doc in candidates]
    rerank_scores = reranker.predict(pairs)

    # Apply temperature calibration (recommended: 1.5)
    rerank_scores = rerank_scores / 1.5

    # Sort by reranked scores
    reranked = sorted(zip(top_indices, rerank_scores), key=lambda x: x[1], reverse=True)
    return reranked[:top_k]
```

## Architecture

```
Input Text
    |
    v
[RadBERT Tokenizer] --> tokens (max 512)
    |
    v
[RoBERTa Encoder] --> 12 layers, 768 hidden
    |
    v
[Mean Pooling] --> aggregate token embeddings
    |
    v
768-dim embedding vector
```

## Training Details

- **Base Model**: RadBERT-RoBERTa-4m (pre-trained on 4.42M VA radiology reports)
- **Fine-tuning**: Contrastive learning on radiology education corpus
- **Training Samples**: 6.7M query-document pairs
- **Loss Function**: Multiple Negatives Ranking Loss
- **Epochs**: 2 (8,400 steps)
- **Final Spearman**: 0.8454

## Limitations

- **English only**: Trained on English radiology text
- **Domain-specific**: May underperform on non-radiology medical content
- **Subspecialty variance**: GU/MSK content has lower performance than Physics/Neuro
- **512 token limit**: Long documents require chunking

## Citation

If you use RadLITE in your work, please cite both RadLITE and the underlying RadBERT model:

```bibtex
@software{radlite_2026,
    title = {RadLITE: Calibrated Multi-Stage Retrieval for Radiology Education},
    author = {Grai Team},
    year = {2026},
    month = {January},
    url = {https://huggingface.co/matulichpt/RadLITE-Encoder},
    note = {MRR 0.829 on RadLIT-9 benchmark}
}

@article{yan2022radbert,
    title = {RadBERT: Adapting Transformer-based Language Models to Radiology},
    author = {Yan, An and McAuley, Julian and Lu, Xing and Du, Jiang and Chang, Eric Y and Gentili, Amilcare and Hsu, Chun-Nan},
    journal = {Radiology: Artificial Intelligence},
    volume = {4},
    number = {4},
    pages = {e210258},
    year = {2022},
    publisher = {Radiological Society of North America},
    doi = {10.1148/ryai.210258}
}
```

## Related Models

- [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) - Cross-encoder for reranking
- [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) - Base model

## License

Apache 2.0 - Free for commercial and research use.