matthewlewis06 commited on
Commit
b69b364
·
1 Parent(s): c217741

First commit

Browse files
Files changed (8) hide show
  1. .gitattributes +1 -0
  2. Dockerfile +2 -1
  3. README.md +169 -10
  4. requirements.txt +5 -1
  5. src/config.py +36 -0
  6. src/query_rag.py +309 -0
  7. src/search_engine.py +46 -0
  8. src/streamlit_app.py +242 -38
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.db filter=lfs diff=lfs merge=lfs -text
Dockerfile CHANGED
@@ -5,8 +5,9 @@ WORKDIR /app
5
  RUN apt-get update && apt-get install -y \
6
  build-essential \
7
  curl \
 
8
  git \
9
- && rm -rf /var/lib/apt/lists/*
10
 
11
  COPY requirements.txt ./
12
  COPY src/ ./src/
 
5
  RUN apt-get update && apt-get install -y \
6
  build-essential \
7
  curl \
8
+ software-properties-common \
9
  git \
10
+ && rm -rf /var/lib/apt/lists/*
11
 
12
  COPY requirements.txt ./
13
  COPY src/ ./src/
README.md CHANGED
@@ -1,20 +1,179 @@
1
  ---
2
- title: NHS CHAT
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
  sdk: docker
7
  app_port: 8501
8
  tags:
9
  - streamlit
 
 
 
 
 
10
  pinned: false
11
- short_description: Chabot for querying NHS medical information.
12
- license: agpl-3.0
13
  ---
14
 
15
- # Welcome to Streamlit!
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: NHS Clinical Assistant
3
+ emoji: 🩺
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
  app_port: 8501
8
  tags:
9
  - streamlit
10
+ - healthcare
11
+ - nhs
12
+ - rag
13
+ - llm
14
+ - medical
15
  pinned: false
16
+ short_description: RAG-powered NHS health information chatbot
 
17
  ---
18
 
19
+ # NHS Clinical Assistant
20
 
21
+ A RAG-based chatbot for querying NHS health condition information. This application uses Retrieval-Augmented Generation to provide accurate, evidence-based responses from official NHS health documentation.
22
 
23
+ ## 🌟 Features
24
+
25
+ - **NHS Health Information Search**: Search through NHS health conditions using semantic search powered by Voyage AI embeddings
26
+ - **RAG-powered Chat**: Ask questions and get contextually relevant answers from NHS health information with source citations
27
+ - **Multiple LLM Support**: Choose between Gemini models (2.5-flash, 2.5-flash-lite, 2.5-pro) for generating responses
28
+ - **Source Attribution**: All responses include links to original NHS web pages
29
+ - **Streaming Responses**: Real-time response generation for better user experience
30
+ - **Interactive Interface**: Clean Streamlit frontend optimized for healthcare information queries
31
+
32
+ ## 📁 Project Structure
33
+
34
+ ### Core Application Files
35
+
36
+ #### [`src/streamlit_app.py`](src/streamlit_app.py)
37
+ Main Streamlit application interface providing:
38
+ - User-friendly web interface for NHS health information queries
39
+ - Chat interface with conversation history
40
+ - Model selection (Gemini variants)
41
+ - Source attribution display with NHS links
42
+ - Suggested queries for common health topics
43
+
44
+ #### [`src/query_rag.py`](src/query_rag.py)
45
+ RAG (Retrieval-Augmented Generation) system that handles:
46
+ - Query processing and validation
47
+ - Integration with search engine and LLM clients
48
+ - Context generation from NHS health documents
49
+ - Streaming response generation
50
+ - Source extraction and formatting
51
+ - Can be used as standalone CLI tool for testing
52
+
53
+ #### [`src/search_engine.py`](src/search_engine.py)
54
+ Search functionality using Pinecone vector database:
55
+ - Similarity search using Voyage AI embeddings (voyage-context-3 model)
56
+ - Integration with Pinecone vector database
57
+ - NHS health information retrieval
58
+
59
+ ### Configuration
60
+
61
+ #### [`src/config.py`](src/config.py)
62
+ Centralized configuration management:
63
+ - NHS source configuration
64
+ - System prompts and error messages
65
+ - Default search parameters
66
+
67
+ ### Infrastructure
68
+
69
+ #### [`requirements.txt`](requirements.txt)
70
+ Python dependencies:
71
+ - `streamlit==1.40.1` - Web application framework
72
+ - `openai` - LLM client (used for Gemini API access)
73
+ - `voyageai` - Embedding generation
74
+ - `pinecone` - Vector database client
75
+ - `pandas` - Data manipulation
76
+ - `altair` - Visualization support
77
+
78
+ #### [`Dockerfile`](Dockerfile)
79
+ Container configuration for deployment:
80
+ - Python 3.9 base image
81
+ - Production-ready setup
82
+ - Health check configuration
83
+ - Streamlit server configuration
84
+
85
+ ## 🚀 Getting Started
86
+
87
+ ### Prerequisites
88
+ - Python 3.9+
89
+ - Gemini API key (for LLM responses)
90
+ - Voyage AI API key (for embeddings)
91
+ - Pinecone API key (for vector search)
92
+
93
+ ### Environment Variables
94
+ Set the following environment variables:
95
+ ```bash
96
+ export GEMINI_API_KEY=your_gemini_api_key
97
+ export VOYAGE_API_KEY=your_voyage_api_key
98
+ export PINECONE_API_KEY=your_pinecone_api_key
99
+ ```
100
+
101
+ ### Installation
102
+ 1. Clone the repository
103
+ 2. Install dependencies:
104
+ ```bash
105
+ pip install -r requirements.txt
106
+ ```
107
+
108
+ ### Run the application
109
+ ```bash
110
+ streamlit run src/streamlit_app.py
111
+ ```
112
+
113
+ The application will be available at `http://localhost:8501`
114
+
115
+ ### Docker Deployment
116
+ ```bash
117
+ docker build -t nhs-clinical-assistant .
118
+ docker run -p 8501:8501 \
119
+ -e GEMINI_API_KEY=your_gemini_api_key \
120
+ -e VOYAGE_API_KEY=your_voyage_api_key \
121
+ -e PINECONE_API_KEY=your_pinecone_api_key \
122
+ nhs-clinical-assistant
123
+ ```
124
+
125
+ ## 🔧 Usage
126
+
127
+ ### Web Interface
128
+ 1. Open the application in your browser
129
+ 2. Select your preferred Gemini model from the sidebar
130
+ 3. Type your NHS health-related question in the chat input
131
+ 4. View the response with source attribution
132
+ 5. Click "View Sources" to see NHS page references
133
+
134
+ ### CLI Usage
135
+ Test the RAG system directly:
136
+ ```bash
137
+ python src/query_rag.py --query_text "What are the symptoms of ADHD in adults?" --llm_model "gemini-2.5-flash"
138
+ ```
139
+
140
+ ### Example Queries
141
+ - "What are the symptoms of ADHD in adults?"
142
+ - "How is type 2 diabetes diagnosed?"
143
+ - "What are the treatment options for depression?"
144
+
145
+ ## 🏗️ Architecture
146
+
147
+ The system uses a simple but effective RAG architecture:
148
+
149
+ 1. **Query Processing**: User query is validated and processed
150
+ 2. **Vector Search**: Query is embedded using Voyage AI and searched against Pinecone vector database containing NHS health information
151
+ 3. **Context Generation**: Retrieved NHS documents are formatted into context
152
+ 4. **LLM Response**: Gemini generates response based strictly on NHS context
153
+ 5. **Source Attribution**: Original NHS page links are provided with responses
154
+
155
+ ## 📊 Data Sources
156
+
157
+ The system is built on NHS health condition information, stored in a Pinecone vector database with the namespace `nhs_guidelines_voyage_3_large`. All responses include proper attribution to NHS sources with direct links to official NHS web pages.
158
+
159
+ ## ⚠️ Important Notes
160
+
161
+ - **Medical Disclaimer**: This tool provides information from NHS sources but should not replace professional medical advice
162
+ - **Data Accuracy**: Always consult official NHS sources for the most current information
163
+ - **Context Limitation**: The system only responds based on information available in the indexed NHS documents
164
+
165
+ ## 📄 License
166
+
167
+ This project is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**.
168
+
169
+ ### Code License
170
+ The source code of this application is released under AGPL-3.0, which means:
171
+ - You can freely use, modify, and distribute this software
172
+ - Any modifications or derivative works must also be released under AGPL-3.0
173
+ - If you run this software as a network service, you must provide the source code to users
174
+ - See the [LICENSE](LICENSE) file for full terms
175
+
176
+ ### NHS Data Usage
177
+ This tool utilizes NHS health information under the [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/). All NHS content remains subject to their original terms and conditions and is used for informational purposes in compliance with UK public sector information licensing.
178
+
179
+ **Note**: While the application code is AGPL-3.0 licensed, the NHS health information content accessed through this application remains under Crown Copyright and the Open Government Licence.
requirements.txt CHANGED
@@ -1,3 +1,7 @@
1
  altair
2
  pandas
3
- streamlit
 
 
 
 
 
1
  altair
2
  pandas
3
+ streamlit==1.40.1
4
+ openai
5
+ pandas
6
+ voyageai
7
+ pinecone
src/config.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from enum import Enum
3
+ from typing import Dict, NamedTuple
4
+ from dataclasses import dataclass
5
+
6
+ class InfoSource(Enum):
7
+ NHS = "nhs"
8
+
9
+ @dataclass
10
+ class SourceConfig:
11
+ context_description: str
12
+ not_found_message: str
13
+
14
+ class Config:
15
+ """Configuration settings for the RAG system"""
16
+
17
+ # Default similarity search parameters
18
+ DEFAULT_SIMILARITY_K = 5
19
+
20
+ SOURCE_CONFIGS = {
21
+ InfoSource.NHS: SourceConfig(
22
+ context_description="NHS health conditions and medical information",
23
+ not_found_message="no relevant NHS health information is available to answer this question"
24
+ )
25
+ }
26
+
27
+ @classmethod
28
+ def get_source_config(cls, source: str) -> SourceConfig:
29
+ """Get configuration for a source"""
30
+ try:
31
+ source_enum = InfoSource(source.lower())
32
+ return cls.SOURCE_CONFIGS[source_enum]
33
+ except ValueError:
34
+ raise ValueError(f"Unknown source: {source}. Valid sources: {[s.value for s in InfoSource]}")
35
+
36
+
src/query_rag.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import argparse
4
+ import logging
5
+ import re
6
+ from typing import Dict, List, Optional, Generator, Tuple
7
+ from openai import OpenAI
8
+ from config import Config, InfoSource
9
+ from search_engine import SearchEngine
10
+ import voyageai
11
+
12
+ # Setup logging
13
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
14
+ logger = logging.getLogger(__name__)
15
+
16
+ class RAGSystem:
17
+ """Main RAG system class"""
18
+
19
+ def __init__(self, shared_data=None):
20
+ self.config = Config()
21
+
22
+ # Initialize clients
23
+ gemini_api_key = os.getenv("GEMINI_API_KEY")
24
+ if gemini_api_key:
25
+ self.gemini_client = OpenAI(
26
+ api_key=gemini_api_key,
27
+ base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
28
+ )
29
+ else:
30
+ self.gemini_client = None
31
+
32
+ openai_api_key = os.getenv("OPENAI_API_KEY")
33
+ if openai_api_key:
34
+ self.openai_client = OpenAI(api_key=openai_api_key)
35
+ else:
36
+ self.openai_client = None
37
+
38
+ self.voyage_client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
39
+ self.search_engine = SearchEngine(self.voyage_client)
40
+
41
+ def _validate_inputs(self, query_text: str, similarity_k: int, info_source: str):
42
+ """Validate input parameters"""
43
+ if not query_text or not query_text.strip():
44
+ raise ValueError("Query text cannot be empty")
45
+
46
+ if similarity_k <= 0:
47
+ raise ValueError("similarity_k must be a positive integer")
48
+
49
+ try:
50
+ InfoSource(info_source.lower())
51
+ except ValueError:
52
+ valid_sources = [s.value for s in InfoSource]
53
+ raise ValueError(f"Invalid info_source '{info_source}'. Must be one of: {valid_sources}")
54
+
55
+ def _clean_section_id(self, section_id: str) -> str:
56
+ """Clean section ID for display - NHS format: condition__section__part"""
57
+ if not section_id or section_id == 'Unknown section':
58
+ return section_id
59
+
60
+ # Handle NHS format: "adhd-adults__Overview__Part_1"
61
+ if '__' in section_id:
62
+ parts = section_id.split('__')
63
+ if len(parts) >= 2:
64
+ # Get condition and section, ignore part number
65
+ condition = parts[0].replace('-', ' ').replace('_', ' ').title()
66
+ section = parts[1].replace('_', ' ').title()
67
+ return f"{condition} - {section}"
68
+
69
+ # Fallback: just clean up underscores and dashes
70
+ clean_section = section_id.replace('_', ' ').replace('-', ' ').title()
71
+ return clean_section
72
+
73
+ def _get_context_text(self, results: List[Dict]) -> str:
74
+ """Generate context text from search results"""
75
+ context_text_sections = []
76
+
77
+ for doc in results:
78
+ section_id = doc['metadata'].get('original_id', 'Unknown section')
79
+ url = doc['metadata'].get('url', '')
80
+ document_text = doc['metadata'].get('document', '')
81
+
82
+ # Clean up section_id for display
83
+ clean_section_id = self._clean_section_id(section_id)
84
+
85
+ # Create formatted section without showing URL explicitly
86
+ # The URL will be available in the document_text if it was part of the original content
87
+ formatted_section = (
88
+ f"Source Information: [Section: {clean_section_id}]\n"
89
+ f"Context: {document_text}"
90
+ f"{f' Available at: {url}' if url else ''}" # Include URL for LLM to use
91
+ )
92
+ context_text_sections.append(formatted_section)
93
+
94
+ return "\n\n---\n\n".join(context_text_sections)
95
+
96
+ def _create_system_prompt(self, context_text: str, context_description: str,
97
+ not_found_message: str, query_text: str) -> List[Dict]:
98
+ """Create system prompt for LLM"""
99
+ return [
100
+ {
101
+ "role": "system",
102
+ "content": (
103
+ f"You are a medical AI assistant tasked with answering clinical questions strictly based on the provided {context_description} context. Follow the requirements below to ensure accurate, consistent, and professional responses.\n\n"
104
+ "# Response Rules\n\n"
105
+ "1. **Context Restriction**:\n"
106
+ " - Only use information given in the provided NHS health information context.\n"
107
+ " - Do not generate or speculate with information not explicitly found in the given context.\n\n"
108
+ "2. **Answer Format**:\n"
109
+ " - Provide a clear and concise response based solely on the context.\n"
110
+ " - When including a list, use standard markdown bullet points (`*` or `-`).\n"
111
+ " - If a list follows introductory text, insert a line break before the first bullet point.\n"
112
+ " - Each bullet point must be on its own line.\n\n"
113
+ "3. **Preserve Tables**:\n"
114
+ " - If relevant markdown tables appear in the context, reproduce them in your answer.\n"
115
+ " - Maintain the original structure, formatting, and content of any included tables.\n\n"
116
+ "4. **Links and URLs**:\n"
117
+ " - Include any URLs or web links from the context directly in your response when relevant.\n"
118
+ " - Integrate links naturally within sentences, using markdown syntax for clickable text links.\n"
119
+ " - DO NOT generate or invent any URLs not explicitly present in the context.\n\n"
120
+ "5. **Markdown Link Formatting**:\n"
121
+ " - In responses, only the descriptive text in brackets should be visible and clickable (e.g., `[NHS ADHD information](https://www.nhs.uk/conditions/attention-deficit-hyperactivity-disorder-adhd/)`).\n"
122
+ " - Readers should never see raw URLs in the text.\n"
123
+ " - Use descriptive link text like 'NHS ADHD information' or 'NHS depression guide' rather than generic terms.\n\n"
124
+ "6. **If No Relevant Information**:\n"
125
+ " - If the context contains no relevant information, state clearly:\n"
126
+ f" *\"{not_found_message}\"*\n\n"
127
+ "# Output Format\n\n"
128
+ "- All responses should be in plain text, using markdown formatting for lists and links as required.\n"
129
+ "- Do not use code blocks.\n"
130
+ "- Answers should be concise, accurate, and formatted according to the rules above.\n\n"
131
+ "# Examples\n\n"
132
+ "**Example 1: Integration of markdown link in context**\n"
133
+ "Question: \"What are the symptoms of ADHD?\"\n"
134
+ "Context snippet: ...see the NHS information on ADHD symptoms...\n"
135
+ "Output:\n"
136
+ "According to the [NHS ADHD information](https://www.nhs.uk/conditions/attention-deficit-hyperactivity-disorder-adhd/), symptoms include...\n\n"
137
+ "**Example 2: Multiple condition references**\n"
138
+ "According to NHS guidance:\n"
139
+ "* Initial symptoms may include difficulty concentrating.\n"
140
+ "* For detailed information, see the [NHS ADHD guide](https://www.nhs.uk/conditions/adhd/).\n\n"
141
+ "**Example 3: No relevant context**\n"
142
+ f"{not_found_message}\n\n"
143
+ "# Notes\n\n"
144
+ "- Never output information beyond what is provided in the supplied context.\n"
145
+ "- Always use markdown for lists and links.\n"
146
+ "- Make sure all markdown tables from context are preserved in your answer if relevant.\n"
147
+ "- Present links only as clickable text, not as bare URLs.\n"
148
+ "- Use descriptive link text that indicates the specific NHS condition or topic.\n\n"
149
+ "**REMINDER:**\n"
150
+ "Strictly adhere to all formatting and content rules above for every response."
151
+ ),
152
+ },
153
+ {
154
+ "role": "assistant",
155
+ "content": (
156
+ f"Here is the context from {context_description} that you should use to answer the following question:\n\n{context_text}\n\n"
157
+ ),
158
+ },
159
+ {
160
+ "role": "user",
161
+ "content": query_text,
162
+ },
163
+ ]
164
+
165
+
166
+
167
+ def get_sources_from_results(self, results: List[Dict], info_source: str) -> List[Dict]:
168
+ """Extract formatted sources from search results"""
169
+ sources = []
170
+ for doc in results:
171
+ metadata = doc.get('metadata', {})
172
+ section_id = metadata.get('original_id', 'Unknown section')
173
+ source = metadata.get('source', 'Unknown')
174
+ url = metadata.get('url', '')
175
+
176
+ # Clean section ID for display
177
+ clean_section_id = self._clean_section_id(section_id)
178
+
179
+ source_info = {
180
+ 'metadata': {
181
+ 'source': source,
182
+ 'original_id': section_id,
183
+ 'url': url,
184
+ 'clean_section': clean_section_id
185
+ }
186
+ }
187
+ sources.append(source_info)
188
+ return sources
189
+
190
+ def query_rag_stream(self, query_text: str, llm_model: str, similarity_k: int = 25, info_source: str = "NHS",
191
+ filename_filter: Optional[str] = None) -> Generator[Tuple[str, List[Dict]], None, None]:
192
+ """Query RAG system with streaming response"""
193
+ try:
194
+ self._validate_inputs(query_text, similarity_k, info_source)
195
+ source_config = self.config.get_source_config(info_source)
196
+
197
+ # Use the correct namespace from your test
198
+ namespace = "nhs_guidelines_voyage_3_large"
199
+
200
+ # Get similar documents using only similarity search
201
+ results = self.search_engine.similarity_search(
202
+ query_text,
203
+ namespace=namespace,
204
+ top_k=similarity_k
205
+ )
206
+
207
+ if not results:
208
+ yield "I couldn't find any relevant information to answer your question.", []
209
+ return
210
+
211
+ # Generate context and system prompt
212
+ context_text = self._get_context_text(results)
213
+ system_messages = self._create_system_prompt(
214
+ context_text,
215
+ source_config.context_description,
216
+ source_config.not_found_message,
217
+ query_text
218
+ )
219
+
220
+ # Get sources for response
221
+ sources_data = self.get_sources_from_results(results, info_source)
222
+
223
+ # Stream LLM response
224
+ yield from self._stream_llm_response(system_messages, query_text, llm_model, sources_data)
225
+
226
+ except Exception as e:
227
+ logger.error(f"Error in query_rag_stream: {e}")
228
+ yield f"An error occurred while processing your query: {str(e)}", []
229
+
230
+ def _stream_llm_response(self, system_messages: List[Dict], query_text: str,
231
+ llm_model: str, sources_data: List[Dict]) -> Generator[Tuple[str, List[Dict]], None, None]:
232
+ """Stream LLM response"""
233
+ try:
234
+ if "gemini" in llm_model.lower() and self.gemini_client:
235
+ stream = self.gemini_client.chat.completions.create(
236
+ model=llm_model,
237
+ messages=system_messages,
238
+ temperature=0,
239
+ stream=True
240
+ )
241
+
242
+ for chunk in stream:
243
+ if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
244
+ content = chunk.choices[0].delta.content
245
+ yield content, sources_data
246
+
247
+ else:
248
+ error_msg = f"Unsupported LLM model or client not available: {llm_model}"
249
+ logger.error(error_msg)
250
+ yield error_msg, []
251
+ return
252
+
253
+ except Exception as e:
254
+ logger.error(f"Error in LLM completion: {e}")
255
+ yield f"Error generating response: {str(e)}", []
256
+
257
+
258
+
259
+ def main():
260
+ """Main function for CLI usage"""
261
+ parser = argparse.ArgumentParser(description="RAG System Query Interface")
262
+ parser.add_argument("--query_text", type=str, default="What are the symptoms of ADHD in adults?",
263
+ help="The query text.")
264
+ parser.add_argument("--llm_model", type=str, default="gemini-2.0-flash",
265
+ help="The LLM model to use.")
266
+ parser.add_argument("--similarity_k", type=int, default=5,
267
+ help="Number of results to retrieve in similarity search.")
268
+ parser.add_argument("--info_source", type=str, default="NHS",
269
+ choices=["nhs", "NHS"],
270
+ help="Information source to query.")
271
+
272
+ args = parser.parse_args()
273
+
274
+ try:
275
+ print("Initializing RAG system...")
276
+ rag_system = RAGSystem()
277
+
278
+ print(f"\n=== Query: {args.query_text} ===")
279
+ print(f"Source: {args.info_source}")
280
+ print(f"LLM Model: {args.llm_model}")
281
+ print("\n=== LLM Response ===\n")
282
+
283
+ response_text, sources_data = "", []
284
+
285
+ for chunk, sources in rag_system.query_rag_stream(
286
+ query_text=args.query_text,
287
+ llm_model=args.llm_model,
288
+ similarity_k=args.similarity_k,
289
+ info_source=args.info_source
290
+ ):
291
+ print(chunk, end="", flush=True)
292
+ response_text += chunk
293
+ sources_data = sources
294
+
295
+ print("\n\n=== Sources Data ===\n")
296
+ for i, source in enumerate(sources_data, 1):
297
+ metadata = source.get('metadata', {})
298
+ print(f"Source {i}:")
299
+ print(f" Clean Section: {metadata.get('clean_section', 'Unknown')}")
300
+ print(f" URL: {metadata.get('url', 'No URL')}")
301
+ print()
302
+
303
+ except Exception as e:
304
+ logger.error(f"Error in main: {e}")
305
+ print(f"Error: {e}")
306
+
307
+
308
+ if __name__ == "__main__":
309
+ main()
src/search_engine.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import pandas as pd
3
+ import voyageai
4
+ from typing import List, Dict, Tuple, Optional
5
+ from collections import defaultdict
6
+ import logging
7
+ import os
8
+ from pinecone import Pinecone
9
+
10
+ pinecone_api_key = os.getenv("PINECONE_API_KEY")
11
+
12
+ class SearchEngine:
13
+ """Handles similarity search"""
14
+
15
+ def __init__(self, voyage_client: voyageai.Client):
16
+ self.vo = voyage_client
17
+ self.logger = logging.getLogger(__name__)
18
+ self.pc = Pinecone(api_key=pinecone_api_key)
19
+ self.index = self.pc.Index("nhs-conditions")
20
+
21
+ def similarity_search(self, query_text: str, namespace: str, top_k: int = 25) -> List[dict]:
22
+ """Perform similarity search using Pinecone"""
23
+ try:
24
+ # Embed the query using the same model - matches your example exactly
25
+ query_embedding = self.vo.contextualized_embed(
26
+ inputs=[[query_text]],
27
+ model="voyage-context-3",
28
+ input_type="query",
29
+ output_dimension=2048
30
+ ).results[0].embeddings[0]
31
+
32
+ # Search Pinecone
33
+ results = self.index.query(
34
+ vector=query_embedding,
35
+ top_k=top_k,
36
+ namespace=namespace,
37
+ include_metadata=True
38
+ )
39
+
40
+ matches = results['matches']
41
+ self.logger.info(f"Pinecone search found {len(matches)} results")
42
+ return matches
43
+
44
+ except Exception as e:
45
+ self.logger.error(f"Error in Pinecone similarity search: {e}")
46
+ return []
src/streamlit_app.py CHANGED
@@ -1,40 +1,244 @@
1
- import altair as alt
2
- import numpy as np
3
- import pandas as pd
4
  import streamlit as st
 
5
 
6
- """
7
- # Welcome to Streamlit!
8
-
9
- Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
10
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
11
- forums](https://discuss.streamlit.io).
12
-
13
- In the meantime, below is an example of what you can do with just a few lines of code:
14
- """
15
-
16
- num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
17
- num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
18
-
19
- indices = np.linspace(0, 1, num_points)
20
- theta = 2 * np.pi * num_turns * indices
21
- radius = indices
22
-
23
- x = radius * np.cos(theta)
24
- y = radius * np.sin(theta)
25
-
26
- df = pd.DataFrame({
27
- "x": x,
28
- "y": y,
29
- "idx": indices,
30
- "rand": np.random.randn(num_points),
31
- })
32
-
33
- st.altair_chart(alt.Chart(df, height=700, width=700)
34
- .mark_point(filled=True)
35
- .encode(
36
- x=alt.X("x", axis=None),
37
- y=alt.Y("y", axis=None),
38
- color=alt.Color("idx", legend=None, scale=alt.Scale()),
39
- size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
40
- ))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import streamlit as st
2
+ from typing import Dict, List
3
 
4
+ try:
5
+ from query_rag import RAGSystem
6
+ except ImportError as e:
7
+ st.error(f"Import error: {e}. Please ensure all required modules are available.")
8
+ st.stop()
9
+
10
+
11
+ # --- Page Configuration and Initialization ---
12
+ st.set_page_config(page_title="NHS Clinical Assistant", layout="wide")
13
+
14
+
15
+ # Initialize RAG System
16
+ def get_rag_system():
17
+ """Initialize the RAG system"""
18
+ try:
19
+ return RAGSystem()
20
+ except Exception as e:
21
+ st.error(f"Failed to initialize RAG system: {e}")
22
+ return None
23
+
24
+ # Initialize RAG system once at startup
25
+ if 'rag_system' not in st.session_state:
26
+ st.session_state.rag_system = get_rag_system()
27
+
28
+ rag_system = st.session_state.rag_system
29
+ if rag_system is None:
30
+ st.error("RAG system failed to initialize. Please check your configuration.")
31
+ st.stop()
32
+
33
+ # --- Helper Functions ---
34
+ def display_sources(sources_data: List[Dict]):
35
+ """Display sources with clean NHS formatting"""
36
+ if not sources_data:
37
+ st.markdown("No sources available for this response.")
38
+ return
39
+
40
+ for idx, source_info in enumerate(sources_data):
41
+ # Get metadata from source_info
42
+ metadata = source_info.get('metadata', {})
43
+ clean_section = metadata.get('clean_section', 'Unknown Section')
44
+ url = metadata.get('url', '')
45
+
46
+ source_text = f"**Source {idx+1}:** {clean_section}"
47
+ st.markdown(source_text)
48
+
49
+ if url:
50
+ st.markdown(f" 🔗 [View Online]({url})")
51
+
52
+ st.markdown("---")
53
+
54
+
55
+ def initialize_session_state():
56
+ # Common state
57
+ if "app_mode" not in st.session_state:
58
+ st.session_state.app_mode = "NHS Chat"
59
+
60
+ # Chat specific state
61
+ if "chat_history" not in st.session_state:
62
+ st.session_state.chat_history = []
63
+ if "query" not in st.session_state:
64
+ st.session_state.query = ""
65
+ if "processing_query" not in st.session_state:
66
+ st.session_state.processing_query = False
67
+ if "query_to_run_next" not in st.session_state:
68
+ st.session_state.query_to_run_next = None
69
+ if "similarity_k" not in st.session_state:
70
+ st.session_state.similarity_k = 5
71
+ if "llm_model" not in st.session_state:
72
+ st.session_state.llm_model = "gemini-2.5-flash"
73
+
74
+
75
+ initialize_session_state()
76
+
77
+ # --- STYLING ---
78
+ st.markdown("""
79
+ <style>
80
+ .main {background-color: #f9f9f9; font-family: Arial, sans-serif;}
81
+ h1, h2, h3, h4, h5, h6 {color: #2b6777;}
82
+ h1 {font-weight: bold;}
83
+ [data-testid="stSidebar"] {background-color: #e8f0fe; padding: 10px;}
84
+ .result-box {
85
+ border-left: 4px solid #4CAF50;
86
+ padding: 10px;
87
+ background-color: #fff;
88
+ margin-bottom: 10px;
89
+ border-radius: 4px;
90
+ box-shadow: 0 1px 3px rgba(0,0,0,0.1);
91
+ }
92
+ div.stTextArea > div { border-radius: 8px; }
93
+ textarea { font-family: Arial, sans-serif; font-size: 16px; color: #333; resize: vertical; }
94
+ .stButton>button { border-radius: 5px; }
95
+ div.stSelectbox > label {
96
+ font-size: 16px !important;
97
+ font-weight: bold !important;
98
+ }
99
+ </style>
100
+ """, unsafe_allow_html=True)
101
+
102
+ # --- SIDEBAR ---
103
+ with st.sidebar:
104
+ st.header("🩺 NHS Clinical Assistant")
105
+
106
+ st.header("⚙️ Settings")
107
+
108
+ llm_options = ["gemini-2.5-flash", "gemini-2.5-flash-lite", "gemini-2.5-pro"]
109
+ try:
110
+ current_llm_index = llm_options.index(st.session_state.llm_model)
111
+ except ValueError:
112
+ current_llm_index = 0
113
+ st.session_state.llm_model = llm_options[0]
114
+
115
+ selected_llm = st.selectbox(
116
+ "LLM Model",
117
+ options=llm_options,
118
+ key="llm_model_selector",
119
+ index=current_llm_index
120
+ )
121
+ if selected_llm != st.session_state.llm_model:
122
+ st.session_state.llm_model = selected_llm
123
+
124
+ st.markdown("---")
125
+
126
+ def new_chat_callback():
127
+ st.session_state.chat_history = []
128
+ st.session_state.query = ""
129
+
130
+ if st.button("🗑️ New Chat", key="new_chat", on_click=new_chat_callback):
131
+ pass
132
+
133
+
134
+ # --- MAIN APPLICATION AREA ---
135
+ st.title("🩺 NHS Clinical Assistant")
136
+ st.markdown("Ask questions and get relevant information from trusted NHS health condition sources.")
137
+
138
+ def submit_and_process_query(query_to_send: str, display_query_text: str):
139
+ st.session_state.processing_query = True
140
+
141
+ try:
142
+ with st.spinner("Retrieving relevant NHS information..."):
143
+ response_chunks = []
144
+ sources_data = []
145
+ temp_response_placeholder = st.empty()
146
+
147
+ for chunk, chunk_sources_data in rag_system.query_rag_stream(
148
+ query_to_send,
149
+ st.session_state.llm_model,
150
+ info_source="NHS",
151
+ similarity_k=st.session_state.similarity_k,
152
+ ):
153
+ response_chunks.append(chunk)
154
+ sources_data = chunk_sources_data
155
+
156
+ temp_response_placeholder.markdown(
157
+ f"<div style='border-left: 4px solid #4CAF50; padding-left: 10px;'>{''.join(response_chunks)}</div>",
158
+ unsafe_allow_html=True
159
+ )
160
+
161
+ final_response = ''.join(response_chunks)
162
+ temp_response_placeholder.empty()
163
+
164
+ st.session_state.chat_history.append({
165
+ "query_sent": query_to_send,
166
+ "display_query": display_query_text,
167
+ "response": final_response,
168
+ "sources_data": sources_data,
169
+ "llm_model": st.session_state.llm_model
170
+ })
171
+
172
+ except Exception as e:
173
+ st.error(f"Error processing query: {e}")
174
+ finally:
175
+ st.session_state.processing_query = False
176
+ st.rerun()
177
+
178
+ # Display chat history
179
+ for i, chat_entry in enumerate(st.session_state.chat_history):
180
+ st.markdown(f"👤 **You:** {chat_entry['display_query']}")
181
+
182
+ response_info = f"(LLM: {chat_entry.get('llm_model', 'N/A')})"
183
+
184
+ st.markdown(f"🤖 **Assistant** {response_info}:")
185
+ st.markdown(
186
+ f"<div style='border-left: 4px solid #4CAF50; padding-left: 10px; margin-bottom: 10px;'>{chat_entry['response']}</div>",
187
+ unsafe_allow_html=True
188
+ )
189
+
190
+ st.subheader("📚 Sources:")
191
+ with st.expander("View Sources", expanded=False):
192
+ sources_data = chat_entry.get("sources_data", [])
193
+ if sources_data:
194
+ display_sources(sources_data)
195
+ else:
196
+ st.markdown("No sources available for this response.")
197
+ st.markdown("---")
198
+
199
+ # Suggested queries
200
+ st.markdown("<h6>💡 Suggested Queries:</h6>", unsafe_allow_html=True)
201
+ suggested_queries_list = [
202
+ "What are the symptoms of ADHD in adults?",
203
+ "How is type 2 diabetes diagnosed?",
204
+ "What are the treatment options for depression?"
205
+ ]
206
+ sq_cols = st.columns(len(suggested_queries_list))
207
+ for idx, sq_text_item in enumerate(suggested_queries_list):
208
+ if sq_cols[idx].button(
209
+ sq_text_item,
210
+ key=f"suggested_{idx}",
211
+ disabled=st.session_state.processing_query
212
+ ):
213
+ st.session_state.processing_query = True
214
+ st.session_state.query_to_run_next = sq_text_item
215
+ st.rerun()
216
+
217
+
218
+ # User input section
219
+ user_query = st.chat_input(
220
+ "e.g., What are the symptoms of ADHD?",
221
+ max_chars=1000,
222
+ disabled=st.session_state.processing_query
223
+ )
224
+
225
+ if user_query:
226
+ st.session_state.processing_query = True
227
+ st.session_state.query_to_run_next = user_query
228
+ st.rerun()
229
+
230
+ # Process query if one is set to run next
231
+ if st.session_state.get("query_to_run_next"):
232
+ query_to_process = st.session_state.query_to_run_next
233
+ st.session_state.query_to_run_next = None # Clear it so it doesn't run again
234
+ submit_and_process_query(query_to_process, query_to_process)
235
+
236
+ # --- Footer with Licensing Information ---
237
+ st.markdown("---")
238
+ st.caption("""
239
+ **Data Usage and Licensing:**
240
+ This tool utilizes information from NHS sources, which is made available under their respective open licensing terms.
241
+ - **NHS:** Content is used under the terms of the Open Government Licence. For full details, please refer to the [NHS Terms and Conditions](https://www.nhs.uk/our-policies/terms-and-conditions/).
242
+
243
+ Always consult the official sources for the most accurate, complete, and up-to-date information.
244
+ """)