--- license: apache-2.0 language: - en tags: - sentence-transformers - sentence-similarity - feature-extraction - radiology - medical - retrieval - embeddings - healthcare - clinical base_model: zzxslp/RadBERT-RoBERTa-4m pipeline_tag: sentence-similarity library_name: sentence-transformers datasets: - radiology-education-corpus metrics: - mrr - ndcg model-index: - name: RadLITE-Encoder results: - task: type: retrieval name: Information Retrieval dataset: name: RadLIT-9 (Radiology Retrieval Benchmark) type: radiology-retrieval metrics: - type: mrr value: 0.829 name: MRR (with full pipeline) - type: ndcg@10 value: 0.863 name: nDCG@10 - type: recall@10 value: 0.90 name: Recall@10 - task: type: semantic-similarity name: Semantic Similarity dataset: name: Radiology Similarity Evaluation type: radiology-similarity metrics: - type: spearman_cosine value: 0.8454 name: Spearman Correlation - type: pearson_cosine value: 0.8504 name: Pearson Correlation --- # RadLITE-Encoder **Radiology Late Interaction Transformer Enhanced - Bi-Encoder Component** A domain-specialized sentence transformer for radiology and medical imaging content. This model encodes radiology text (reports, articles, educational content) into 768-dimensional dense vectors optimized for semantic search and retrieval. > **Recommended:** For optimal retrieval performance, use this encoder with [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) in a two-stage pipeline. The bi-encoder provides fast candidate retrieval, while the cross-encoder reranker delivers precision. This combination achieves **MRR 0.829** on radiology benchmarks. ## Model Description | Property | Value | |----------|-------| | **Model Type** | Sentence Transformer (Bi-Encoder) | | **Base Model** | [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) | | **Domain** | Radiology / Medical Imaging | | **Vector Dimensions** | 768 | | **Max Sequence Length** | 512 tokens | | **Similarity Function** | Cosine Similarity | | **License** | Apache 2.0 | ### Why RadLITE-Encoder? Standard embedding models (BGE, E5, OpenAI) are trained on general web text and struggle with radiology-specific terminology: - **Anatomical terms**: "hepatic flexure", "foramen magnum", "costophrenic angle" - **Imaging sequences**: "T2 FLAIR", "DWI/ADC mismatch", "post-gadolinium" - **Pathology descriptions**: "ground-glass opacity", "cortical ribbon sign", "double duct sign" - **Abbreviations**: "HCC", "RCC", "NSCLC", "BI-RADS" RadLITE-Encoder is fine-tuned on millions of radiology documents to understand this specialized vocabulary. ## Performance ### RadLIT-9 Benchmark (Radiology Retrieval) | Model | MRR | nDCG@10 | Notes | |-------|-----|---------|-------| | **RadLITE-Encoder** | **0.829** | **0.863** | Full pipeline with reranker | | RadLITE-Encoder (standalone) | 0.78 | 0.81 | Bi-encoder only | | BGE-large-en-v1.5 | 0.72 | 0.76 | General-purpose | | RadBERT (baseline) | 0.45 | 0.52 | No retrieval training | ### Subspecialty Performance | Subspecialty | MRR | Notes | |--------------|-----|-------| | Physics/Nuclear Medicine | 0.936 | Excellent | | Pediatric Radiology | 0.931 | Excellent | | Thoracic Imaging | 0.913 | Excellent | | Cardiac Imaging | 0.862 | Good | | Neuroradiology | 0.860 | Good | | Gastrointestinal | 0.800 | Good | | Breast Imaging | 0.722 | Moderate | | Musculoskeletal | 0.695 | Moderate | | Genitourinary | 0.694 | Moderate | ## Quick Start ### Installation ```bash pip install sentence-transformers>=2.2.0 ``` ### Basic Usage ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("matulichpt/RadLITE-Encoder") # Encode radiology text documents = [ "Hepatocellular carcinoma typically shows arterial enhancement with washout on portal venous phase.", "Ground-glass opacities in the bilateral lower lobes, concerning for viral pneumonia.", "No acute intracranial abnormality. Age-appropriate cerebral volume loss.", ] queries = [ "HCC imaging characteristics on CT", "COVID-19 chest CT findings", ] # Generate embeddings doc_embeddings = model.encode(documents, normalize_embeddings=True) query_embeddings = model.encode(queries, normalize_embeddings=True) # Compute similarities similarities = query_embeddings @ doc_embeddings.T print(similarities) # Query 1 (HCC) will score highest with Document 1 # Query 2 (COVID) will score highest with Document 2 ``` ### Semantic Search over Your Corpus ```python from sentence_transformers import SentenceTransformer, util import torch # Load model model = SentenceTransformer("matulichpt/RadLITE-Encoder") # Your radiology corpus (articles, reports, educational content) corpus = [ {"id": "doc1", "text": "Pancoast tumor: apical lung mass with rib destruction..."}, {"id": "doc2", "text": "Hepatic hemangioma shows peripheral nodular enhancement..."}, {"id": "doc3", "text": "Acoustic neuroma appears as enhancing CP angle mass..."}, # ... your documents ] # Pre-compute corpus embeddings (do this once, save for reuse) corpus_texts = [doc["text"] for doc in corpus] corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True, show_progress_bar=True) # Save embeddings for later torch.save(corpus_embeddings, "corpus_embeddings.pt") # Search function def search(query: str, top_k: int = 10): query_embedding = model.encode(query, normalize_embeddings=True) scores = util.cos_sim(query_embedding, corpus_embeddings)[0] top_results = torch.topk(scores, k=min(top_k, len(corpus))) results = [] for score, idx in zip(top_results.values, top_results.indices): results.append({ "document": corpus[idx], "score": float(score) }) return results # Example search results = search("superior sulcus tumor with Horner syndrome") for r in results[:3]: print(f"Score: {r['score']:.3f} - {r['document']['text'][:100]}...") ``` ### Integration with FAISS (Large-Scale) ```python import faiss import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("matulichpt/RadLITE-Encoder") # Encode your corpus corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True) corpus_embeddings = np.array(corpus_embeddings).astype('float32') # Build FAISS index dimension = 768 index = faiss.IndexFlatIP(dimension) # Inner product = cosine for normalized vectors index.add(corpus_embeddings) # Save index faiss.write_index(index, "radiology_index.faiss") # Search def faiss_search(query: str, top_k: int = 10): query_embedding = model.encode(query, normalize_embeddings=True) query_embedding = np.array([query_embedding]).astype('float32') scores, indices = index.search(query_embedding, top_k) return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])] ``` ## Best Practices ### 1. Normalize Embeddings Always use `normalize_embeddings=True` for retrieval tasks. This enables efficient cosine similarity via dot product. ### 2. Chunk Long Documents The model has a 512 token limit. For long articles: ```python def chunk_text(text: str, max_length: int = 400, overlap: int = 50): """Chunk text with overlap for better retrieval.""" words = text.split() chunks = [] for i in range(0, len(words), max_length - overlap): chunk = " ".join(words[i:i + max_length]) chunks.append(chunk) return chunks ``` ### 3. Batch Processing For large corpora, use batching: ```python embeddings = model.encode( texts, batch_size=32, normalize_embeddings=True, show_progress_bar=True ) ``` ### 4. GPU Acceleration ```python model = SentenceTransformer("matulichpt/RadLITE-Encoder", device="cuda") ``` ## Two-Stage Retrieval (Recommended) For best results, combine RadLITE-Encoder with the [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker): ```python from sentence_transformers import SentenceTransformer, CrossEncoder # Stage 1: Fast bi-encoder retrieval encoder = SentenceTransformer("matulichpt/RadLITE-Encoder") # Stage 2: Precise cross-encoder reranking reranker = CrossEncoder("matulichpt/RadLITE-Reranker", max_length=512) def two_stage_search(query: str, corpus: list, top_k: int = 10): # Stage 1: Get top candidates (fast) query_emb = encoder.encode(query, normalize_embeddings=True) corpus_embs = encoder.encode(corpus, normalize_embeddings=True) scores = query_emb @ corpus_embs.T top_indices = scores.argsort()[-50:][::-1] # Top 50 candidates # Stage 2: Rerank with cross-encoder (precise) candidates = [corpus[i] for i in top_indices] pairs = [[query, doc] for doc in candidates] rerank_scores = reranker.predict(pairs) # Apply temperature calibration (recommended: 1.5) rerank_scores = rerank_scores / 1.5 # Sort by reranked scores reranked = sorted(zip(top_indices, rerank_scores), key=lambda x: x[1], reverse=True) return reranked[:top_k] ``` ## Architecture ``` Input Text | v [RadBERT Tokenizer] --> tokens (max 512) | v [RoBERTa Encoder] --> 12 layers, 768 hidden | v [Mean Pooling] --> aggregate token embeddings | v 768-dim embedding vector ``` ## Training Details - **Base Model**: RadBERT-RoBERTa-4m (pre-trained on 4.42M VA radiology reports) - **Fine-tuning**: Contrastive learning on radiology education corpus - **Training Samples**: 6.7M query-document pairs - **Loss Function**: Multiple Negatives Ranking Loss - **Epochs**: 2 (8,400 steps) - **Final Spearman**: 0.8454 ## Limitations - **English only**: Trained on English radiology text - **Domain-specific**: May underperform on non-radiology medical content - **Subspecialty variance**: GU/MSK content has lower performance than Physics/Neuro - **512 token limit**: Long documents require chunking ## Citation If you use RadLITE in your work, please cite both RadLITE and the underlying RadBERT model: ```bibtex @software{radlite_2026, title = {RadLITE: Calibrated Multi-Stage Retrieval for Radiology Education}, author = {Grai Team}, year = {2026}, month = {January}, url = {https://huggingface.co/matulichpt/RadLITE-Encoder}, note = {MRR 0.829 on RadLIT-9 benchmark} } @article{yan2022radbert, title = {RadBERT: Adapting Transformer-based Language Models to Radiology}, author = {Yan, An and McAuley, Julian and Lu, Xing and Du, Jiang and Chang, Eric Y and Gentili, Amilcare and Hsu, Chun-Nan}, journal = {Radiology: Artificial Intelligence}, volume = {4}, number = {4}, pages = {e210258}, year = {2022}, publisher = {Radiological Society of North America}, doi = {10.1148/ryai.210258} } ``` ## Related Models - [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) - Cross-encoder for reranking - [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) - Base model ## License Apache 2.0 - Free for commercial and research use.