Initial model upload with benchmarks

Browse files

Files changed (12) hide show

1_Pooling/config.json +10 -0
README.md +242 -0
config.json +27 -0
config_sentence_transformers.json +14 -0
merges.txt +0 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
vocab.json +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 768,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,242 @@

+---
+language:
+- en
+license: apache-2.0
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- radiology
+- medical
+- retrieval
+- embedding
+datasets:
+- custom
+metrics:
+- mrr
+- recall
+pipeline_tag: sentence-similarity
+model-index:
+- name: radlit-biencoder
+  results:
+  - task:
+      type: retrieval
+      name: Radiology Document Retrieval
+    dataset:
+      type: custom
+      name: RadLIT-9
+      config: radlit9-v1.1-balanced
+    metrics:
+    - type: mrr
+      value: 0.829
+      name: MRR
+    - type: recall@10
+      value: 0.971
+      name: Recall@10
+    - type: ndcg@10
+      value: 0.863
+      name: nDCG@10
+---
+# RadLIT-BiEncoder: Radiology Late Interaction Transformer
+A domain-specialized bi-encoder model for radiology document retrieval, trained to understand medical imaging terminology, clinical reasoning patterns, and radiology-specific queries.
+## Model Description
+RadLIT-BiEncoder is the first stage of the RadLITE retrieval pipeline. It generates dense embeddings optimized for radiology content retrieval, significantly outperforming general-purpose embedding models on radiology-specific queries.
+### Architecture
+- **Base Model**: RoBERTa-base architecture
+- **Hidden Size**: 768
+- **Layers**: 12
+- **Attention Heads**: 12
+- **Parameters**: ~125M
+- **Max Sequence Length**: 512 tokens
+- **Embedding Dimension**: 768
+### Training
+The model was trained using contrastive learning with hard negative mining on a large corpus of radiology educational content. Training details:
+- **Training Objective**: Multiple Negatives Ranking Loss with hard negatives
+- **Batch Size**: 32
+- **Learning Rate**: 2e-5 with warmup
+- **Training Epochs**: 4
+- **Hard Negatives**: Mined from top-k retrieval failures
+**Note**: Training data consisted of radiology educational materials. Specific sources are not disclosed due to variable licensing, but the model is released under Apache 2.0 for research and commercial use.
+## Performance
+### RadLIT-9 Benchmark
+RadLIT-9 is a comprehensive radiology retrieval benchmark covering 9 subspecialties:
+| Metric | Score |
+|--------|-------|
+| **MRR** | 0.829 |
+| **nDCG@10** | 0.863 |
+| **Recall@10** | 97.1% |
+| **Recall@5** | 93.8% |
+| **Recall@1** | 74.3% |
+### Subspecialty Performance
+| Subspecialty | MRR | Recall@10 |
+|--------------|-----|-----------|
+| Physics/Nuclear | 0.936 | 100% |
+| Pediatric | 0.931 | 100% |
+| Thoracic | 0.913 | 98% |
+| Cardiac | 0.862 | 98% |
+| Neuroradiology | 0.860 | 98% |
+| Gastrointestinal | 0.800 | 96% |
+| Breast | 0.722 | 93% |
+| Musculoskeletal | 0.695 | 89% |
+| Genitourinary | 0.694 | 100% |
+### Comparison with Baselines
+| Model | MRR | vs RadLIT |
+|-------|-----|-----------|
+| **RadLIT-BiEncoder** | **0.829** | -- |
+| ColBERT-v2 | 0.750 | -9.5% |
+| General bi-encoder | 0.703 | -15.2% |
+| BM25 | ~0.55 | -33.6% |
+## Usage
+### Installation
+```bash
+pip install sentence-transformers
+```
+### Basic Usage
+```python
+from sentence_transformers import SentenceTransformer
+# Load model
+model = SentenceTransformer('matulichpt/radlit-biencoder')
+# Encode queries and documents
+queries = [
+    "What are the imaging features of hepatocellular carcinoma on MRI?",
+    "How do you differentiate glioblastoma from metastasis?"
+]
+documents = [
+    "HCC typically shows arterial enhancement with washout on portal venous phase...",
+    "GBM and metastases can be differentiated by their location and multiplicity..."
+]
+query_embeddings = model.encode(queries, convert_to_tensor=True)
+doc_embeddings = model.encode(documents, convert_to_tensor=True)
+# Compute similarity
+from sentence_transformers.util import cos_sim
+similarities = cos_sim(query_embeddings, doc_embeddings)
+print(similarities)
+```
+### For Retrieval Pipeline
+```python
+from sentence_transformers import SentenceTransformer, util
+import torch
+model = SentenceTransformer('matulichpt/radlit-biencoder')
+# Pre-encode your document corpus
+corpus = ["document 1...", "document 2...", ...]
+corpus_embeddings = model.encode(corpus, convert_to_tensor=True, show_progress_bar=True)
+# At query time
+query = "What are the CT findings in pulmonary embolism?"
+query_embedding = model.encode(query, convert_to_tensor=True)
+# Find top-k similar documents
+cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
+top_results = torch.topk(cos_scores, k=10)
+for score, idx in zip(top_results[0], top_results[1]):
+    print(f"Score: {score:.4f} - {corpus[idx][:100]}...")
+```
+## Recommended: Full RadLITE Pipeline
+For best results, use RadLIT-BiEncoder as the first stage followed by RadLIT-CrossEncoder for reranking:
+```python
+from sentence_transformers import SentenceTransformer, CrossEncoder
+# Stage 1: Bi-encoder retrieval
+biencoder = SentenceTransformer('grai-rad/radlit-biencoder')
+# Stage 2: Cross-encoder reranking
+crossencoder = CrossEncoder('matulichpt/radlit-crossencoder')
+# Retrieve candidates
+query = "What are the MRI findings in anterior cruciate ligament tear?"
+candidates = retrieve_with_biencoder(query, corpus, biencoder, top_k=50)
+# Rerank with cross-encoder
+pairs = [[query, doc] for doc in candidates]
+scores = crossencoder.predict(pairs)
+# Apply temperature calibration (recommended: T=1.5)
+calibrated_scores = scores / 1.5
+# Sort by calibrated scores
+reranked = sorted(zip(candidates, calibrated_scores), key=lambda x: x[1], reverse=True)
+```
+## Intended Use
+### Primary Use Cases
+- Radiology educational content retrieval
+- Medical imaging literature search
+- Clinical decision support (retrieval component)
+- Radiology question-answering systems
+### Out-of-Scope Uses
+- General web search
+- Non-medical document retrieval
+- Clinical diagnosis (this is a retrieval model, not a diagnostic tool)
+## Limitations
+1. **Domain Specificity**: Optimized for radiology; may underperform on general medical or non-medical content
+2. **Language**: English only
+3. **Subspecialty Variance**: Performance varies by subspecialty (0.69-0.94 MRR range)
+4. **Not a Diagnostic Tool**: This model retrieves relevant documents; it does not provide medical diagnoses
+## Ethical Considerations
+- This model should not be used as a sole source for clinical decision-making
+- Retrieved documents should be reviewed by qualified medical professionals
+- The model may reflect biases present in radiology educational literature
+## Citation
+```bibtex
+@software{radlit_biencoder_2026,
+  title = {RadLIT-BiEncoder: Domain-Specialized Embeddings for Radiology Retrieval},
+  author = {Grai Team},
+  year = {2026},
+  url = {https://huggingface.co/matulichpt/radlit-biencoder},
+  note = {MRR 0.829 on RadLIT-9 benchmark}
+}
+```
+## License
+Apache 2.0 - Free for research and commercial use.
+## Contact
+For questions or collaboration: Open an issue on the model repository

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "RobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.56.2",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.1.1",
+    "transformers": "4.56.2",
+    "pytorch": "2.10.0.dev20251011+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1e5e54f4a42b7e4a337b631bf88c517650f8e9cbb569b56f8f9c92b83b43e8a
+size 498604904

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 512,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": 512,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff