RadLITE-Encoder / README.md

Upload README.md with huggingface_hub

1aa7fb4 verified 4 days ago

11.5 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- radiology
	- medical
	- retrieval
	- embeddings
	- healthcare
	- clinical
	base_model: zzxslp/RadBERT-RoBERTa-4m
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	datasets:
	- radiology-education-corpus
	metrics:
	- mrr
	- ndcg
	model-index:
	- name: RadLITE-Encoder
	results:
	- task:
	type: retrieval
	name: Information Retrieval
	dataset:
	name: RadLIT-9 (Radiology Retrieval Benchmark)
	type: radiology-retrieval
	metrics:
	- type: mrr
	value: 0.829
	name: MRR (with full pipeline)
	- type: ndcg@10
	value: 0.863
	name: nDCG@10
	- type: recall@10
	value: 0.90
	name: Recall@10
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: Radiology Similarity Evaluation
	type: radiology-similarity
	metrics:
	- type: spearman_cosine
	value: 0.8454
	name: Spearman Correlation
	- type: pearson_cosine
	value: 0.8504
	name: Pearson Correlation
	---

	# RadLITE-Encoder

	Radiology Late Interaction Transformer Enhanced - Bi-Encoder Component

	A domain-specialized sentence transformer for radiology and medical imaging content. This model encodes radiology text (reports, articles, educational content) into 768-dimensional dense vectors optimized for semantic search and retrieval.

	> Recommended: For optimal retrieval performance, use this encoder with [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) in a two-stage pipeline. The bi-encoder provides fast candidate retrieval, while the cross-encoder reranker delivers precision. This combination achieves MRR 0.829 on radiology benchmarks.

	## Model Description

	\| Property \| Value \|
	\|----------\|-------\|
	\| Model Type \| Sentence Transformer (Bi-Encoder) \|
	\| Base Model \| [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) \|
	\| Domain \| Radiology / Medical Imaging \|
	\| Vector Dimensions \| 768 \|
	\| Max Sequence Length \| 512 tokens \|
	\| Similarity Function \| Cosine Similarity \|
	\| License \| Apache 2.0 \|

	### Why RadLITE-Encoder?

	Standard embedding models (BGE, E5, OpenAI) are trained on general web text and struggle with radiology-specific terminology:

	- Anatomical terms: "hepatic flexure", "foramen magnum", "costophrenic angle"
	- Imaging sequences: "T2 FLAIR", "DWI/ADC mismatch", "post-gadolinium"
	- Pathology descriptions: "ground-glass opacity", "cortical ribbon sign", "double duct sign"
	- Abbreviations: "HCC", "RCC", "NSCLC", "BI-RADS"

	RadLITE-Encoder is fine-tuned on millions of radiology documents to understand this specialized vocabulary.

	## Performance

	### RadLIT-9 Benchmark (Radiology Retrieval)

	\| Model \| MRR \| nDCG@10 \| Notes \|
	\|-------\|-----\|---------\|-------\|
	\| RadLITE-Encoder \| 0.829 \| 0.863 \| Full pipeline with reranker \|
	\| RadLITE-Encoder (standalone) \| 0.78 \| 0.81 \| Bi-encoder only \|
	\| BGE-large-en-v1.5 \| 0.72 \| 0.76 \| General-purpose \|
	\| RadBERT (baseline) \| 0.45 \| 0.52 \| No retrieval training \|

	### Subspecialty Performance

	\| Subspecialty \| MRR \| Notes \|
	\|--------------\|-----\|-------\|
	\| Physics/Nuclear Medicine \| 0.936 \| Excellent \|
	\| Pediatric Radiology \| 0.931 \| Excellent \|
	\| Thoracic Imaging \| 0.913 \| Excellent \|
	\| Cardiac Imaging \| 0.862 \| Good \|
	\| Neuroradiology \| 0.860 \| Good \|
	\| Gastrointestinal \| 0.800 \| Good \|
	\| Breast Imaging \| 0.722 \| Moderate \|
	\| Musculoskeletal \| 0.695 \| Moderate \|
	\| Genitourinary \| 0.694 \| Moderate \|

	## Quick Start

	### Installation

	```bash
	pip install sentence-transformers>=2.2.0
	```

	### Basic Usage

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("matulichpt/RadLITE-Encoder")

	# Encode radiology text
	documents = [
	"Hepatocellular carcinoma typically shows arterial enhancement with washout on portal venous phase.",
	"Ground-glass opacities in the bilateral lower lobes, concerning for viral pneumonia.",
	"No acute intracranial abnormality. Age-appropriate cerebral volume loss.",
	]

	queries = [
	"HCC imaging characteristics on CT",
	"COVID-19 chest CT findings",
	]

	# Generate embeddings
	doc_embeddings = model.encode(documents, normalize_embeddings=True)
	query_embeddings = model.encode(queries, normalize_embeddings=True)

	# Compute similarities
	similarities = query_embeddings @ doc_embeddings.T
	print(similarities)
	# Query 1 (HCC) will score highest with Document 1
	# Query 2 (COVID) will score highest with Document 2
	```

	### Semantic Search over Your Corpus

	```python
	from sentence_transformers import SentenceTransformer, util
	import torch

	# Load model
	model = SentenceTransformer("matulichpt/RadLITE-Encoder")

	# Your radiology corpus (articles, reports, educational content)
	corpus = [
	{"id": "doc1", "text": "Pancoast tumor: apical lung mass with rib destruction..."},
	{"id": "doc2", "text": "Hepatic hemangioma shows peripheral nodular enhancement..."},
	{"id": "doc3", "text": "Acoustic neuroma appears as enhancing CP angle mass..."},
	# ... your documents
	]

	# Pre-compute corpus embeddings (do this once, save for reuse)
	corpus_texts = [doc["text"] for doc in corpus]
	corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True, show_progress_bar=True)

	# Save embeddings for later
	torch.save(corpus_embeddings, "corpus_embeddings.pt")

	# Search function
	def search(query: str, top_k: int = 10):
	query_embedding = model.encode(query, normalize_embeddings=True)
	scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
	top_results = torch.topk(scores, k=min(top_k, len(corpus)))

	results = []
	for score, idx in zip(top_results.values, top_results.indices):
	results.append({
	"document": corpus[idx],
	"score": float(score)
	})
	return results

	# Example search
	results = search("superior sulcus tumor with Horner syndrome")
	for r in results[:3]:
	print(f"Score: {r['score']:.3f} - {r['document']['text'][:100]}...")
	```

	### Integration with FAISS (Large-Scale)

	```python
	import faiss
	import numpy as np
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("matulichpt/RadLITE-Encoder")

	# Encode your corpus
	corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True)
	corpus_embeddings = np.array(corpus_embeddings).astype('float32')

	# Build FAISS index
	dimension = 768
	index = faiss.IndexFlatIP(dimension) # Inner product = cosine for normalized vectors
	index.add(corpus_embeddings)

	# Save index
	faiss.write_index(index, "radiology_index.faiss")

	# Search
	def faiss_search(query: str, top_k: int = 10):
	query_embedding = model.encode(query, normalize_embeddings=True)
	query_embedding = np.array([query_embedding]).astype('float32')
	scores, indices = index.search(query_embedding, top_k)
	return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]
	```

	## Best Practices

	### 1. Normalize Embeddings

	Always use `normalize_embeddings=True` for retrieval tasks. This enables efficient cosine similarity via dot product.

	### 2. Chunk Long Documents

	The model has a 512 token limit. For long articles:

	```python
	def chunk_text(text: str, max_length: int = 400, overlap: int = 50):
	"""Chunk text with overlap for better retrieval."""
	words = text.split()
	chunks = []
	for i in range(0, len(words), max_length - overlap):
	chunk = " ".join(words[i:i + max_length])
	chunks.append(chunk)
	return chunks
	```

	### 3. Batch Processing

	For large corpora, use batching:

	```python
	embeddings = model.encode(
	texts,
	batch_size=32,
	normalize_embeddings=True,
	show_progress_bar=True
	)
	```

	### 4. GPU Acceleration

	```python
	model = SentenceTransformer("matulichpt/RadLITE-Encoder", device="cuda")
	```

	## Two-Stage Retrieval (Recommended)

	For best results, combine RadLITE-Encoder with the [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker):

	```python
	from sentence_transformers import SentenceTransformer, CrossEncoder

	# Stage 1: Fast bi-encoder retrieval
	encoder = SentenceTransformer("matulichpt/RadLITE-Encoder")
	# Stage 2: Precise cross-encoder reranking
	reranker = CrossEncoder("matulichpt/RadLITE-Reranker", max_length=512)

	def two_stage_search(query: str, corpus: list, top_k: int = 10):
	# Stage 1: Get top candidates (fast)
	query_emb = encoder.encode(query, normalize_embeddings=True)
	corpus_embs = encoder.encode(corpus, normalize_embeddings=True)
	scores = query_emb @ corpus_embs.T
	top_indices = scores.argsort()[-50:][::-1] # Top 50 candidates

	# Stage 2: Rerank with cross-encoder (precise)
	candidates = [corpus[i] for i in top_indices]
	pairs = [[query, doc] for doc in candidates]
	rerank_scores = reranker.predict(pairs)

	# Apply temperature calibration (recommended: 1.5)
	rerank_scores = rerank_scores / 1.5

	# Sort by reranked scores
	reranked = sorted(zip(top_indices, rerank_scores), key=lambda x: x[1], reverse=True)
	return reranked[:top_k]
	```

	## Architecture

	```
	Input Text
	\|
	v
	[RadBERT Tokenizer] --> tokens (max 512)
	\|
	v
	[RoBERTa Encoder] --> 12 layers, 768 hidden
	\|
	v
	[Mean Pooling] --> aggregate token embeddings
	\|
	v
	768-dim embedding vector
	```

	## Training Details

	- Base Model: RadBERT-RoBERTa-4m (pre-trained on 4.42M VA radiology reports)
	- Fine-tuning: Contrastive learning on radiology education corpus
	- Training Samples: 6.7M query-document pairs
	- Loss Function: Multiple Negatives Ranking Loss
	- Epochs: 2 (8,400 steps)
	- Final Spearman: 0.8454

	## Limitations

	- English only: Trained on English radiology text
	- Domain-specific: May underperform on non-radiology medical content
	- Subspecialty variance: GU/MSK content has lower performance than Physics/Neuro
	- 512 token limit: Long documents require chunking

	## Citation

	If you use RadLITE in your work, please cite both RadLITE and the underlying RadBERT model:

	```bibtex
	@software{radlite_2026,
	title = {RadLITE: Calibrated Multi-Stage Retrieval for Radiology Education},
	author = {Grai Team},
	year = {2026},
	month = {January},
	url = {https://huggingface.co/matulichpt/RadLITE-Encoder},
	note = {MRR 0.829 on RadLIT-9 benchmark}
	}

	@article{yan2022radbert,
	title = {RadBERT: Adapting Transformer-based Language Models to Radiology},
	author = {Yan, An and McAuley, Julian and Lu, Xing and Du, Jiang and Chang, Eric Y and Gentili, Amilcare and Hsu, Chun-Nan},
	journal = {Radiology: Artificial Intelligence},
	volume = {4},
	number = {4},
	pages = {e210258},
	year = {2022},
	publisher = {Radiological Society of North America},
	doi = {10.1148/ryai.210258}
	}
	```

	## Related Models

	- [RadLITE-Reranker](https://huggingface.co/matulichpt/RadLITE-Reranker) - Cross-encoder for reranking
	- [RadBERT-RoBERTa-4m](https://huggingface.co/zzxslp/RadBERT-RoBERTa-4m) - Base model

	## License

	Apache 2.0 - Free for commercial and research use.