HugSena13
/

neuroRewriter

query-rewriting

Model card Files Files and versions

HugSena13 commited on Dec 13, 2025

Commit

7e5410a

·

verified ·

1 Parent(s): 68736d1

Update README.md

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+tags:
+- medical
+- neurology
+- neurosurgery
+- search
+- rag
+- query-rewriting
+datasets:
+- miriad/miriad-4.4M
+base_model:
+- google/flan-t5-small
+language: en
+---
+# NeuroRewriter: Neurology & Neurosurgery Query Optimizer
+## 🩺 Model Description
+**NeuroRewriter** is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of **Neurology** and **Neurosurgery**.
+Its primary function is to act as a **Query Rewriter** in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures).
+## 🚀 Intended Use & Best Practices
+### 1. RAG Pipeline Integration
+This model is designed to sit between the User and your Vector Database/Search Engine.
+* **Input:** "What are the common complications after a craniotomy?"
+* **Output:** "craniotomy complications post-op"
+### 2. Retrieval Strategy (Important)
+This model is optimized for **Keyword-Based Retrieval (Sparse Retrieval)** methods such as:
+* **BM25**
+* **TF-IDF**
+* **Splade**
+* **Elasticsearch / OpenSearch**
+> **Note:** Because this model removes grammatical connectors ("stop words") to boost keyword density, it is **less effective** for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25.
+## ⚠️ Limitations & Medical Disclaimer
+**NOT FOR CLINICAL DIAGNOSIS.**
+This model is intended for **informational retrieval purposes only**.
+* It is not a doctor and should not be used to make medical decisions.
+* While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts).
+* Always verify results against trusted medical sources.
+## 📚 Training Data
+This model was fine-tuned on a curated subset of the **MIRIAD dataset** (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery).
+* **License:** ODC-By 1.0
+* **Attribution:** Zheng et al. (2025)
+## 💻 How to Use
+```python
+# pip install transformers torch
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# 1. Load the model
+model_name = "HugSena13/neroRewriter"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+# 2. Prepare the input (Include the prefix used in training!)
+input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?"
+inputs = tokenizer(input_text, return_tensors="pt")
+# 3. Generate (Adjust max_new_tokens if output is cut off)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=50,
+    num_beams=5,
+    early_stopping=True
+)
+# 4. Decode
+result = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(result)