HugSena13 commited on
Commit
7e5410a
·
verified ·
1 Parent(s): 68736d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - medical
5
+ - neurology
6
+ - neurosurgery
7
+ - search
8
+ - rag
9
+ - query-rewriting
10
+ datasets:
11
+ - miriad/miriad-4.4M
12
+ base_model:
13
+ - google/flan-t5-small
14
+ language: en
15
+ ---
16
+
17
+ # NeuroRewriter: Neurology & Neurosurgery Query Optimizer
18
+
19
+ ## 🩺 Model Description
20
+ **NeuroRewriter** is a fine-tuned version of `google/flan-t5-small` specialized for the medical domains of **Neurology** and **Neurosurgery**.
21
+
22
+ Its primary function is to act as a **Query Rewriter** in RAG (Retrieval-Augmented Generation) pipelines. It transforms verbose, natural language user questions into concise, keyword-rich search strings. This "denoising" process strips away conversational fluff to focus on high-value medical entities (symptoms, anatomy, drug names, procedures).
23
+
24
+ ## 🚀 Intended Use & Best Practices
25
+
26
+ ### 1. RAG Pipeline Integration
27
+ This model is designed to sit between the User and your Vector Database/Search Engine.
28
+ * **Input:** "What are the common complications after a craniotomy?"
29
+ * **Output:** "craniotomy complications post-op"
30
+
31
+ ### 2. Retrieval Strategy (Important)
32
+ This model is optimized for **Keyword-Based Retrieval (Sparse Retrieval)** methods such as:
33
+ * **BM25**
34
+ * **TF-IDF**
35
+ * **Splade**
36
+ * **Elasticsearch / OpenSearch**
37
+
38
+ > **Note:** Because this model removes grammatical connectors ("stop words") to boost keyword density, it is **less effective** for pure dense vector retrieval (like OpenAI embeddings) which often relies on full sentence context. For best results, use a hybrid approach or pure BM25.
39
+
40
+ ## ⚠️ Limitations & Medical Disclaimer
41
+ **NOT FOR CLINICAL DIAGNOSIS.**
42
+ This model is intended for **informational retrieval purposes only**.
43
+ * It is not a doctor and should not be used to make medical decisions.
44
+ * While it improves search relevance, it may occasionally generate keywords that slightly alter the medical intent (e.g., confusing "acute" vs. "chronic" contexts).
45
+ * Always verify results against trusted medical sources.
46
+
47
+ ## 📚 Training Data
48
+ This model was fine-tuned on a curated subset of the **MIRIAD dataset** (MIRIAD: A Large-Scale Dataset for Medical Information Retrieval and Answer Discovery).
49
+ * **License:** ODC-By 1.0
50
+ * **Attribution:** Zheng et al. (2025)
51
+
52
+ ## 💻 How to Use
53
+ ```python
54
+ # pip install transformers torch
55
+
56
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
57
+
58
+ # 1. Load the model
59
+ model_name = "HugSena13/neroRewriter"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
62
+
63
+ # 2. Prepare the input (Include the prefix used in training!)
64
+ input_text = "extract search keywords: What are the treatment options for glioblastoma multiforme?"
65
+ inputs = tokenizer(input_text, return_tensors="pt")
66
+
67
+ # 3. Generate (Adjust max_new_tokens if output is cut off)
68
+ outputs = model.generate(
69
+ **inputs,
70
+ max_new_tokens=50,
71
+ num_beams=5,
72
+ early_stopping=True
73
+ )
74
+
75
+ # 4. Decode
76
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
77
+ print(result)