File size: 11,527 Bytes

---

license: apache-2.0
language:
- en
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- radiology
- medical
- retrieval
- embeddings
- healthcare
- clinical
base_model: zzxslp/RadBERT-RoBERTa-4m
pipeline_tag: sentence-similarity
library_name: sentence-transformers
datasets:
- radiology-education-corpus
metrics:
- mrr
- ndcg
model-index:
- name: RadLITE-Encoder
  results:
  - task:
      type: retrieval
      name: Information Retrieval
    dataset:
      name: RadLIT-9 (Radiology Retrieval Benchmark)
      type: radiology-retrieval
    metrics:
    - type: mrr
      value: 0.829
      name: MRR (with full pipeline)
    - type: ndcg@10
      value: 0.863
      name: nDCG@10
    - type: recall@10
      value: 0.90
      name: Recall@10
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: Radiology Similarity Evaluation
      type: radiology-similarity
    metrics:
    - type: spearman_cosine
      value: 0.8454
      name: Spearman Correlation
    - type: pearson_cosine
      value: 0.8504
      name: Pearson Correlation
---


# RadLITE-Encoder

**Radiology Late Interaction Transformer Enhanced - Bi-Encoder Component**

A domain-specialized sentence transformer for radiology and medical imaging content. This model encodes radiology text (reports, articles, educational content) into 768-dimensional dense vectors optimized for semantic search and retrieval.

> **Recommended:** For optimal retrieval performance, use this encoder with [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) in a two-stage pipeline. The bi-encoder provides fast candidate retrieval, while the cross-encoder reranker delivers precision. This combination achieves **MRR 0.829** on radiology benchmarks.

## Model Description

| Property | Value |
|----------|-------|
| **Model Type** | Sentence Transformer (Bi-Encoder) |
| **Base Model** | [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) |
| **Domain** | Radiology / Medical Imaging |
| **Vector Dimensions** | 768 |
| **Max Sequence Length** | 512 tokens |
| **Similarity Function** | Cosine Similarity |
| **License** | Apache 2.0 |

### Why RadLITE-Encoder?

Standard embedding models (BGE, E5, OpenAI) are trained on general web text and struggle with radiology-specific terminology:

- **Anatomical terms**: "hepatic flexure", "foramen magnum", "costophrenic angle"
- **Imaging sequences**: "T2 FLAIR", "DWI/ADC mismatch", "post-gadolinium"
- **Pathology descriptions**: "ground-glass opacity", "cortical ribbon sign", "double duct sign"
- **Abbreviations**: "HCC", "RCC", "NSCLC", "BI-RADS"

RadLITE-Encoder is fine-tuned on millions of radiology documents to understand this specialized vocabulary.

## Performance

### RadLIT-9 Benchmark (Radiology Retrieval)

| Model | MRR | nDCG@10 | Notes |
|-------|-----|---------|-------|
| **RadLITE-Encoder** | **0.829** | **0.863** | Full pipeline with reranker |
| RadLITE-Encoder (standalone) | 0.78 | 0.81 | Bi-encoder only |
| BGE-large-en-v1.5 | 0.72 | 0.76 | General-purpose |
| RadBERT (baseline) | 0.45 | 0.52 | No retrieval training |

### Subspecialty Performance

| Subspecialty | MRR | Notes |
|--------------|-----|-------|
| Physics/Nuclear Medicine | 0.936 | Excellent |
| Pediatric Radiology | 0.931 | Excellent |
| Thoracic Imaging | 0.913 | Excellent |
| Cardiac Imaging | 0.862 | Good |
| Neuroradiology | 0.860 | Good |
| Gastrointestinal | 0.800 | Good |
| Breast Imaging | 0.722 | Moderate |
| Musculoskeletal | 0.695 | Moderate |
| Genitourinary | 0.694 | Moderate |

## Quick Start

### Installation

```bash

pip install sentence-transformers>=2.2.0

```

### Basic Usage

```python

from sentence_transformers import SentenceTransformer



# Load the model

model = SentenceTransformer("matulichpt/RadLITE-Encoder")



# Encode radiology text

documents = [

    "Hepatocellular carcinoma typically shows arterial enhancement with washout on portal venous phase.",

    "Ground-glass opacities in the bilateral lower lobes, concerning for viral pneumonia.",

    "No acute intracranial abnormality. Age-appropriate cerebral volume loss.",

]



queries = [

    "HCC imaging characteristics on CT",

    "COVID-19 chest CT findings",

]



# Generate embeddings

doc_embeddings = model.encode(documents, normalize_embeddings=True)

query_embeddings = model.encode(queries, normalize_embeddings=True)



# Compute similarities

similarities = query_embeddings @ doc_embeddings.T

print(similarities)

# Query 1 (HCC) will score highest with Document 1

# Query 2 (COVID) will score highest with Document 2

```

### Semantic Search over Your Corpus

```python

from sentence_transformers import SentenceTransformer, util

import torch



# Load model

model = SentenceTransformer("matulichpt/RadLITE-Encoder")



# Your radiology corpus (articles, reports, educational content)

corpus = [

    {"id": "doc1", "text": "Pancoast tumor: apical lung mass with rib destruction..."},

    {"id": "doc2", "text": "Hepatic hemangioma shows peripheral nodular enhancement..."},

    {"id": "doc3", "text": "Acoustic neuroma appears as enhancing CP angle mass..."},

    # ... your documents

]



# Pre-compute corpus embeddings (do this once, save for reuse)

corpus_texts = [doc["text"] for doc in corpus]

corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True, show_progress_bar=True)



# Save embeddings for later

torch.save(corpus_embeddings, "corpus_embeddings.pt")



# Search function

def search(query: str, top_k: int = 10):

    query_embedding = model.encode(query, normalize_embeddings=True)

    scores = util.cos_sim(query_embedding, corpus_embeddings)[0]

    top_results = torch.topk(scores, k=min(top_k, len(corpus)))



    results = []

    for score, idx in zip(top_results.values, top_results.indices):

        results.append({

            "document": corpus[idx],

            "score": float(score)

        })

    return results



# Example search

results = search("superior sulcus tumor with Horner syndrome")

for r in results[:3]:

    print(f"Score: {r['score']:.3f} - {r['document']['text'][:100]}...")

```

### Integration with FAISS (Large-Scale)

```python

import faiss

import numpy as np

from sentence_transformers import SentenceTransformer



model = SentenceTransformer("matulichpt/RadLITE-Encoder")



# Encode your corpus

corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True)

corpus_embeddings = np.array(corpus_embeddings).astype('float32')



# Build FAISS index

dimension = 768

index = faiss.IndexFlatIP(dimension)  # Inner product = cosine for normalized vectors

index.add(corpus_embeddings)



# Save index

faiss.write_index(index, "radiology_index.faiss")



# Search

def faiss_search(query: str, top_k: int = 10):

    query_embedding = model.encode(query, normalize_embeddings=True)

    query_embedding = np.array([query_embedding]).astype('float32')

    scores, indices = index.search(query_embedding, top_k)

    return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]

```

## Best Practices

### 1. Normalize Embeddings

Always use `normalize_embeddings=True` for retrieval tasks. This enables efficient cosine similarity via dot product.

### 2. Chunk Long Documents

The model has a 512 token limit. For long articles:

```python

def chunk_text(text: str, max_length: int = 400, overlap: int = 50):

    """Chunk text with overlap for better retrieval."""

    words = text.split()

    chunks = []

    for i in range(0, len(words), max_length - overlap):

        chunk = " ".join(words[i:i + max_length])

        chunks.append(chunk)

    return chunks

```

### 3. Batch Processing

For large corpora, use batching:

```python

embeddings = model.encode(

    texts,

    batch_size=32,

    normalize_embeddings=True,

    show_progress_bar=True

)

```

### 4. GPU Acceleration

```python

model = SentenceTransformer("matulichpt/RadLITE-Encoder", device="cuda")

```

## Two-Stage Retrieval (Recommended)

For best results, combine RadLITE-Encoder with the [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker):

```python

from sentence_transformers import SentenceTransformer, CrossEncoder



# Stage 1: Fast bi-encoder retrieval

encoder = SentenceTransformer("matulichpt/RadLITE-Encoder")

# Stage 2: Precise cross-encoder reranking

reranker = CrossEncoder("matulichpt/RadLITE-Reranker", max_length=512)



def two_stage_search(query: str, corpus: list, top_k: int = 10):

    # Stage 1: Get top candidates (fast)

    query_emb = encoder.encode(query, normalize_embeddings=True)

    corpus_embs = encoder.encode(corpus, normalize_embeddings=True)

    scores = query_emb @ corpus_embs.T

    top_indices = scores.argsort()[-50:][::-1]  # Top 50 candidates



    # Stage 2: Rerank with cross-encoder (precise)

    candidates = [corpus[i] for i in top_indices]

    pairs = [[query, doc] for doc in candidates]

    rerank_scores = reranker.predict(pairs)



    # Apply temperature calibration (recommended: 1.5)

    rerank_scores = rerank_scores / 1.5



    # Sort by reranked scores

    reranked = sorted(zip(top_indices, rerank_scores), key=lambda x: x[1], reverse=True)

    return reranked[:top_k]

```

## Architecture

```

Input Text

    |

    v

[RadBERT Tokenizer] --> tokens (max 512)

    |

    v

[RoBERTa Encoder] --> 12 layers, 768 hidden

    |

    v

[Mean Pooling] --> aggregate token embeddings

    |

    v

768-dim embedding vector

```

## Training Details

- **Base Model**: RadBERT-RoBERTa-4m (pre-trained on 4.42M VA radiology reports)
- **Fine-tuning**: Contrastive learning on radiology education corpus
- **Training Samples**: 6.7M query-document pairs
- **Loss Function**: Multiple Negatives Ranking Loss
- **Epochs**: 2 (8,400 steps)
- **Final Spearman**: 0.8454

## Limitations

- **English only**: Trained on English radiology text
- **Domain-specific**: May underperform on non-radiology medical content
- **Subspecialty variance**: GU/MSK content has lower performance than Physics/Neuro
- **512 token limit**: Long documents require chunking

## Citation

If you use RadLITE in your work, please cite both RadLITE and the underlying RadBERT model:

```bibtex

@software{radlite_2026,

    title = {RadLITE: Calibrated Multi-Stage Retrieval for Radiology Education},

    author = {Grai Team},

    year = {2026},

    month = {January},

    url = {https://huggingface.co/matulichpt/RadLITE-Encoder},

    note = {MRR 0.829 on RadLIT-9 benchmark}

}



@article{yan2022radbert,

    title = {RadBERT: Adapting Transformer-based Language Models to Radiology},

    author = {Yan, An and McAuley, Julian and Lu, Xing and Du, Jiang and Chang, Eric Y and Gentili, Amilcare and Hsu, Chun-Nan},

    journal = {Radiology: Artificial Intelligence},

    volume = {4},

    number = {4},

    pages = {e210258},

    year = {2022},

    publisher = {Radiological Society of North America},

    doi = {10.1148/ryai.210258}

}

```

## Related Models

- [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) - Cross-encoder for reranking
- [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) - Base model

## License

Apache 2.0 - Free for commercial and research use.