ohmygaugh commited on
Commit
398a370
·
1 Parent(s): 104d504

planning phase

Browse files
.cursorindexingignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+
2
+ # Don't index SpecStory auto-save files, but allow explicit context inclusion via @ references
3
+ .specstory/**
.specstory/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # SpecStory explanation file
2
+ /.what-is-this.md
.specstory/clickup-tasks/todo.md ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Intelligent Graph-Based SQL Federation Middleware - Requirements
2
+ Project Overview
3
+ A Docker-containerized middleware system that uses graph analytics to intelligently federate queries across multiple relational databases, translating natural language questions into coordinated SQL queries.
4
+ Example problem to solve:
5
+ An LLM asks to find all the customers connected to a specific drug and pull pricing information. This request may need to be queried from 3 different databases. Therefore, our intelligent middleware system reads the different schemas from federated sources and designs a plan to retrieve the information for the user-LLM asking the question.
6
+
7
+
8
+
9
+ Epic 1: Database Connection Management
10
+ Feature 1.1: Multi-Database Connection Registry
11
+ As a system administrator
12
+ I want to configure and manage connections to multiple relational databases
13
+ So that the middleware can access federated data sources
14
+ Story 1.1.1: Secure Credential Storage
15
+ Acceptance Criteria:
16
+ System reads database credentials from encrypted key files
17
+ Supports username/password authentication
18
+ Credentials stored outside container (mounted volume)
19
+ Support for multiple database types (PostgreSQL, MySQL, SQL Server, Oracle)
20
+ Connection string validation on startup
21
+ Story 1.1.2: Connection Health Monitoring
22
+ Acceptance Criteria:
23
+ System validates connections on initialization
24
+ Periodic health checks (configurable interval)
25
+ Logs connection failures with descriptive errors
26
+ Automatic retry logic with exponential backoff
27
+ Status endpoint to query connection health
28
+ Story 1.1.3: MCP Protocol Support (Optional/Future)
29
+ Acceptance Criteria:
30
+ Support Model Context Protocol connections as alternative to direct DB access
31
+ Fallback to direct connection if MCP unavailable
32
+ Configuration flag to toggle MCP vs direct connection
33
+
34
+ Epic 2: Schema Discovery & Graph Representation
35
+ Feature 2.1: Automatic Schema Ingestion
36
+ As a data architect
37
+ I want to automatically discover and index database schemas
38
+ So that the system understands available data structures
39
+ Story 2.1.1: Schema Metadata Extraction
40
+ Acceptance Criteria:
41
+ Extract table names, column names, data types
42
+ Identify primary keys and foreign keys
43
+ Capture indexes and constraints
44
+ Store table/column descriptions if available in DB metadata
45
+ Support incremental schema updates
46
+ Story 2.1.2: Cross-Database Relationship Discovery
47
+ Acceptance Criteria:
48
+ Identify implicit relationships between tables across databases
49
+ Use naming conventions to suggest relationships (e.g., customer_id fields)
50
+ Allow manual relationship definition/override
51
+ Store relationship confidence scores
52
+ Feature 2.2: Knowledge Graph Construction
53
+ As a system architect
54
+ I want to represent database schemas as a knowledge graph
55
+ So that the LLM can understand data relationships and query planning
56
+ Story 2.2.1: Graph Schema Storage
57
+ Acceptance Criteria:
58
+ Nodes represent: databases, tables, columns, relationships
59
+ Edges represent: contains, references, foreign_key relationships
60
+ Graph stored in embedded graph database (Neo4j, TigerGraph, or similar)
61
+ Node properties include: data types, cardinality estimates, usage frequency
62
+ Story 2.2.2: Semantic Annotations
63
+ Acceptance Criteria:
64
+ Support manual annotation of tables/columns with business terms
65
+ Tag tables with domains (e.g., "customer", "pricing", "product")
66
+ Associate synonyms with columns (e.g., "client" → "customer")
67
+ Store sample values for key columns to aid LLM understanding
68
+ Story 2.2.3: Graph Query API
69
+ Acceptance Criteria:
70
+ RESTful API to query graph structure
71
+ Support graph traversal queries (find path between entities)
72
+ Return subgraphs relevant to query keywords
73
+ Export graph in standard formats (GraphML, JSON-LD)
74
+
75
+ Epic 3: Human-in-the-Loop Configuration
76
+ Feature 3.1: Connection Management Interface
77
+ As a database administrator
78
+ I want to manage database connections and priorities
79
+ So that I can control which data sources are used
80
+ Story 3.1.1: Connection CRUD Operations
81
+ Acceptance Criteria:
82
+ API endpoints to add/edit/remove database connections
83
+ Update credentials without system restart
84
+ Enable/disable connections without deletion
85
+ Validation of connection parameters before saving
86
+ Story 3.1.2: Table Priority Configuration
87
+ Acceptance Criteria:
88
+ Assign priority scores to tables (1-10 scale)
89
+ Mark certain tables as "preferred" for specific query types
90
+ Set cost/latency estimates per database
91
+ Configure table deprecation warnings
92
+ Feature 3.2: Schema Annotation Interface
93
+ As a data steward
94
+ I want to annotate and enrich schema metadata
95
+ So that query planning is more accurate
96
+ Story 3.2.1: Business Glossary Management
97
+ Acceptance Criteria:
98
+ Add/edit business descriptions for tables and columns
99
+ Define domain tags and categories
100
+ Specify which tables contain authoritative data
101
+ Mark PII/sensitive columns with special flags
102
+ Story 3.2.2: Relationship Override
103
+ Acceptance Criteria:
104
+ Manually define cross-database relationships
105
+ Override auto-discovered relationships
106
+ Specify join conditions between tables
107
+ Document rationale for manual relationships
108
+
109
+ Epic 4: Intelligent Query Planning
110
+ Feature 4.1: Natural Language Query Understanding
111
+ As a user LLM
112
+ I want to send natural language queries
113
+ So that I can retrieve data without writing SQL
114
+ Story 4.1.1: Query Intent Classification
115
+ Acceptance Criteria:
116
+ Parse incoming natural language query
117
+ Extract entities (customers, drugs, pricing, etc.)
118
+ Identify required data attributes
119
+ Classify query type (lookup, aggregation, join, etc.)
120
+ Return confidence score for understanding
121
+ Story 4.1.2: Entity-to-Schema Mapping
122
+ Acceptance Criteria:
123
+ Map query entities to graph nodes (tables/columns)
124
+ Use semantic annotations and business glossary
125
+ Leverage embeddings for fuzzy matching
126
+ Return top-k candidate mappings with confidence scores
127
+ Feature 4.2: Federated Query Plan Generation
128
+ As a query planner
129
+ I want to generate optimal execution plans across databases
130
+ So that queries are efficient and accurate
131
+ Story 4.2.1: Graph-Based Query Decomposition
132
+ Acceptance Criteria:
133
+ Use graph traversal to find paths between required entities
134
+ Identify which databases contain needed data
135
+ Decompose complex queries into sub-queries per database
136
+ Minimize cross-database joins
137
+ Generate dependency graph of sub-queries
138
+ Story 4.2.2: Query Optimization
139
+ Acceptance Criteria:
140
+ Consider table priorities in plan selection
141
+ Estimate query costs (latency, data volume)
142
+ Choose optimal join strategies
143
+ Apply predicate pushdown where possible
144
+ Generate multiple candidate plans with cost estimates
145
+ Story 4.2.3: Execution Plan Explainability
146
+ Acceptance Criteria:
147
+ Generate human-readable explanation of query plan
148
+ Show which databases will be queried
149
+ Explain why certain tables were chosen
150
+ Visualize query execution flow
151
+ Include estimated execution time
152
+
153
+ Epic 5: Query Execution & Result Compilation
154
+ Feature 5.1: Distributed Query Execution
155
+ As a query executor
156
+ I want to execute SQL across multiple databases
157
+ So that I can retrieve federated results
158
+ Story 5.1.1: Parallel Sub-Query Execution
159
+ Acceptance Criteria:
160
+ Execute independent sub-queries in parallel
161
+ Respect query dependencies (wait for required data)
162
+ Handle connection pooling per database
163
+ Set per-query timeouts
164
+ Collect execution metrics (time, rows returned)
165
+ Story 5.1.2: Error Handling & Fallbacks
166
+ Acceptance Criteria:
167
+ Gracefully handle query failures
168
+ Provide partial results if some queries fail
169
+ Suggest alternative queries on failure
170
+ Log detailed error information
171
+ Implement circuit breaker pattern for failing databases
172
+ Feature 5.2: Result Integration & Formatting
173
+ As a result processor
174
+ I want to combine and format query results
175
+ So that the user LLM receives coherent answers
176
+ Story 5.2.1: Cross-Database Join Processing
177
+ Acceptance Criteria:
178
+ Perform in-memory joins on results from different databases
179
+ Support common join types (inner, left, outer)
180
+ Handle data type mismatches gracefully
181
+ Optimize memory usage for large result sets
182
+ Story 5.2.2: LLM-Powered Result Synthesis
183
+ Acceptance Criteria:
184
+ Use LLM to format raw results into natural language answers
185
+ Generate summary statistics where appropriate
186
+ Create structured JSON responses with metadata
187
+ Include data provenance (which DB each piece came from)
188
+ Handle null/missing values intelligently
189
+ Story 5.2.3: Response Caching
190
+ Acceptance Criteria:
191
+ Cache query results with TTL
192
+ Use query fingerprint for cache key
193
+ Support cache invalidation on schema changes
194
+ Configurable cache size limits
195
+ Cache hit/miss metrics
196
+
197
+ Epic 6: System Infrastructure & DevOps
198
+ Feature 6.1: Docker Containerization
199
+ As a DevOps engineer
200
+ I want to deploy the system as a Docker container
201
+ So that it's portable and easy to manage
202
+ Story 6.1.1: Container Build & Configuration
203
+ Acceptance Criteria:
204
+ Dockerfile with multi-stage build
205
+ Environment-based configuration
206
+ Volume mounts for credentials and graph storage
207
+ Health check endpoint
208
+ Minimal base image (security hardened)
209
+ Story 6.1.2: Docker Compose Setup
210
+ Acceptance Criteria:
211
+ Compose file with middleware + graph DB
212
+ Network configuration for service communication
213
+ Persistent volumes for data
214
+ Easy local development setup
215
+ Environment variable templates
216
+ Feature 6.2: Observability & Monitoring
217
+ As a system operator
218
+ I want to monitor system performance and health
219
+ So that I can ensure reliability
220
+ Story 6.2.1: Structured Logging
221
+ Acceptance Criteria:
222
+ JSON-formatted logs
223
+ Log levels: DEBUG, INFO, WARN, ERROR
224
+ Request tracing with correlation IDs
225
+ Performance timing logs
226
+ Sensitive data redaction
227
+ Story 6.2.2: Metrics & Instrumentation
228
+ Acceptance Criteria:
229
+ Prometheus-compatible metrics endpoint
230
+ Track: query latency, error rates, cache hits
231
+ Database connection pool metrics
232
+ Graph query performance metrics
233
+ Export metrics in OpenTelemetry format
234
+
235
+ Epic 7: API & Integration Layer
236
+ Feature 7.1: RESTful Query API
237
+ As a client LLM
238
+ I want to submit queries via REST API
239
+ So that I can integrate with the middleware
240
+ Story 7.1.1: Query Submission Endpoint
241
+ Acceptance Criteria:
242
+ POST /api/v1/query endpoint
243
+ Accept natural language query in request body
244
+ Return JSON response with results and metadata
245
+ Support synchronous and async query modes
246
+ Include request ID for tracking
247
+ Story 7.1.2: Query Status & Results Retrieval
248
+ Acceptance Criteria:
249
+ GET /api/v1/query/{request_id}/status
250
+ GET /api/v1/query/{request_id}/results
251
+ Webhook callback support for async queries
252
+ Streaming results for large datasets
253
+ Pagination support
254
+ Feature 7.2: Administrative API
255
+ As a system administrator
256
+ I want to manage system configuration via API
257
+ So that I can automate operations
258
+ Story 7.2.1: Schema Management Endpoints
259
+ Acceptance Criteria:
260
+ POST /api/v1/databases (register new database)
261
+ PUT /api/v1/databases/{id}/refresh (update schema)
262
+ GET /api/v1/graph/search (query knowledge graph)
263
+ POST /api/v1/annotations (add semantic annotations)
264
+ Story 7.2.2: System Control Endpoints
265
+ Acceptance Criteria:
266
+ GET /health (readiness and liveness)
267
+ GET /metrics (Prometheus metrics)
268
+ POST /cache/clear
269
+ GET /api/v1/connections/test
270
+
271
+ Example Workflow: Drug-Customer-Pricing Query
272
+ Scenario
273
+ Query: "Find all customers connected to drug 'Aspirin' and pull pricing information"
274
+ System Flow:
275
+ Query Understanding:
276
+
277
+
278
+ Entities identified: "customers", "drug: Aspirin", "pricing"
279
+ Intent: Multi-entity lookup with join
280
+ Graph Traversal:
281
+
282
+
283
+ Find path: Customer → Prescription → Drug → Pricing
284
+ Identify tables: pharma_db.customers, orders_db.prescriptions, pharma_db.drugs, pricing_db.drug_prices
285
+ Query Plan:
286
+
287
+
288
+ Query 1 (pharma_db): SELECT drug_id FROM drugs WHERE name = 'Aspirin'
289
+ Query 2 (orders_db): SELECT customer_id FROM prescriptions WHERE drug_id = ?
290
+ Query 3 (pharma_db): SELECT * FROM customers WHERE id IN (?)
291
+ Query 4 (pricing_db): SELECT * FROM drug_prices WHERE drug_id = ?
292
+ Execution:
293
+
294
+
295
+ Execute Query 1, get drug_id = 12345
296
+ Execute Query 2 & 4 in parallel using drug_id
297
+ Execute Query 3 using customer_ids from Query 2
298
+ Result Synthesis:
299
+
300
+
301
+ LLM combines results into natural language response
302
+ Includes customer names, order details, and pricing
303
+ Cites source databases in response
304
+
305
+ Non-Functional Requirements
306
+ Performance
307
+ Query response time < 5 seconds for 80% of queries
308
+ Support concurrent query execution (10+ simultaneous queries)
309
+ Graph traversal queries < 100ms
310
+ Security
311
+ Encrypted credential storage (AES-256)
312
+ TLS for all external API communications
313
+ Role-based access control for admin APIs
314
+ SQL injection prevention
315
+ Audit logging of all query executions
316
+ Scalability
317
+ Horizontal scaling of query executors
318
+ Support for 100+ database connections
319
+ Handle schemas with 1000+ tables
320
+ Process result sets up to 100k rows
321
+ Reliability
322
+ 99.5% uptime target
323
+ Graceful degradation on partial failures
324
+ Automatic retry on transient errors
325
+ Data consistency validation
326
+
327
+ Technical Stack Recommendations
328
+ Language: Python 3.11+ (for LLM integration, data processing)
329
+ Graph Database: Neo4j Community Edition or Apache AGE
330
+ API Framework: FastAPI
331
+ Database Drivers: SQLAlchemy (multi-DB support)
332
+ LLM Integration: LangChain or direct API calls
333
+ Caching: Redis
334
+ Containerization: Docker + Docker Compose
335
+ Monitoring: Prometheus + Grafana
336
+
337
+
338
+
339
+
340
+ —---
341
+
342
+
343
+ Epic 8: Streamlit Demonstration Interface
344
+ Feature 8.1: Simple Single-Page Query Demo
345
+ As a product demonstrator
346
+ I want to showcase the middleware capabilities through a simple web interface
347
+ So that stakeholders can see the federated query system in action
348
+
349
+ Story 8.1.1: Single-Page Query Interface
350
+ Acceptance Criteria:
351
+ Single page Streamlit app with three main sections:
352
+ Query Input (top)
353
+ Execution Visualization (middle)
354
+ Results Display (bottom)
355
+ Text area for natural language query input
356
+ "Execute Query" button
357
+ Dropdown with 3-4 example queries for quick testing
358
+ Clean, minimal design with clear visual separation between sections
359
+ Example Queries:
360
+ "Find all customers connected to drug 'Aspirin' and pull pricing information"
361
+ "Show me customers who have purchased drugs over $500"
362
+ "List all prescriptions for customer 'John Smith' with pricing"
363
+
364
+ Story 8.1.2: Middleware Execution Visualization
365
+ Acceptance Criteria:
366
+ Display execution progress in real-time with simple status indicators
367
+ Show 4 key phases as they happen:
368
+ 🔍 Understanding Query - Show identified entities (customers, drug: Aspirin, pricing)
369
+ 🗺️ Finding Tables - Display which tables will be used (e.g., "3 tables across 3 databases")
370
+ ⚙️ Executing Queries - Show databases being queried with loading animation
371
+ ✅ Compiling Results - Brief status before showing results
372
+ Each phase appears sequentially with checkmark when complete
373
+ Simple text descriptions, no complex visualizations
374
+ Total execution time displayed at the end
375
+ Visual Example:
376
+ Execution Status:
377
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
378
+
379
+ ✓ Understanding Query (0.2s)
380
+ Entities: customers, drug (Aspirin), pricing
381
+
382
+ ✓ Finding Tables (0.3s)
383
+ Located 3 tables across 3 databases
384
+ • pharma_db.customers
385
+ • orders_db.prescriptions
386
+ • pricing_db.drug_prices
387
+
388
+ ✓ Executing Queries (1.8s)
389
+ Queried: pharma_db → orders_db → pricing_db
390
+
391
+ ✓ Compiling Results (0.4s)
392
+ Retrieved 47 records
393
+
394
+ Total Time: 2.7s
395
+
396
+
397
+ Story 8.1.3: Results Display
398
+ Acceptance Criteria:
399
+ Display results in two formats:
400
+ Summary: Natural language answer (2-3 sentences from LLM)
401
+ Data Table: Clean dataframe showing the retrieved data
402
+ Table includes:
403
+ Customer names
404
+ Drug information
405
+ Pricing details
406
+ Database source indicator (subtle badge or column)
407
+ Show row count and key statistics (e.g., "47 customers found, avg price: $24.50")
408
+ Download CSV button below the table
409
+ Results are paginated if more than 50 rows
410
+ Example Output:
411
+ 📊 Summary:
412
+ Found 47 customers who have been prescribed Aspirin. The pricing
413
+ ranges from $5.99 to $15.99 depending on dosage and pharmacy.
414
+
415
+ Results (47 rows):
416
+ ┌──────────────┬────────────┬──────────┬──────────┐
417
+ │ Customer │ Drug │ Price │ Source │
418
+ ├──────────────┼────────────┼──────────┼──────────┤
419
+ │ John Smith │ Aspirin │ $12.99 │ pharma │
420
+ │ Jane Doe │ Aspirin │ $15.99 │ pharma │
421
+ │ ... │ ... │ ... │ ... │
422
+ └──────────────┴────────────┴──────────┴──────────┘
423
+
424
+ [Download CSV]
425
+
426
+
427
+ Story 8.1.4: Basic Configuration Display
428
+ Acceptance Criteria:
429
+ Sidebar shows simple connection status
430
+ List of connected databases with status indicators (🟢/🔴)
431
+ Total table count per database
432
+ No edit functionality needed - just display
433
+ Collapsible sidebar to maximize screen space
434
+ Sidebar Content:
435
+ 📂 Connected Databases
436
+
437
+ 🟢 pharma_db
438
+ 3 tables
439
+
440
+ 🟢 orders_db
441
+ 2 tables
442
+
443
+ 🟢 pricing_db
444
+ 1 table
445
+
446
+
447
+ Story 8.1.5: Error Handling
448
+ Acceptance Criteria:
449
+ Display friendly error messages if queries fail
450
+ Show which phase failed and basic reason
451
+ Suggest trying example queries if custom query fails
452
+ Include basic troubleshooting hint (e.g., "Check database connections")
453
+ Error doesn't break the interface - user can try another query
454
+
455
+ Technical Implementation
456
+ Streamlit App Structure
457
+ streamlit_app/
458
+ ├── app.py # Single main file (~200-300 lines)
459
+ ├── middleware_client.py # API calls to middleware
460
+ ├── requirements.txt
461
+ └── config.py # Database connection display config
462
+
463
+ Simple API Integration
464
+ # Query submission
465
+ POST /api/v1/query
466
+ {
467
+ "query": "Find all customers connected to drug 'Aspirin'..."
468
+ }
469
+
470
+ # Response includes execution phases and results
471
+ {
472
+ "request_id": "abc123",
473
+ "phases": [
474
+ {"name": "understanding", "duration_ms": 200, "details": {...}},
475
+ {"name": "planning", "duration_ms": 300, "details": {...}},
476
+ {"name": "execution", "duration_ms": 1800, "details": {...}},
477
+ {"name": "synthesis", "duration_ms": 400, "details": {...}}
478
+ ],
479
+ "results": {
480
+ "summary": "Found 47 customers...",
481
+ "data": [...],
482
+ "row_count": 47
483
+ }
484
+ }
485
+
486
+ Key Dependencies
487
+ streamlit>=1.28.0
488
+ requests>=2.31.0
489
+ pandas>=2.1.0
490
+
491
+
492
+ Acceptance Criteria for Epic 8
493
+ [ ] Single page Streamlit app runs in Docker container
494
+ [ ] User can input natural language queries
495
+ [ ] Example queries work correctly
496
+ [ ] Execution phases display sequentially with timing
497
+ [ ] Results show as summary + data table
498
+ [ ] Database connection status visible in sidebar
499
+ [ ] Download CSV functionality works
500
+ [ ] Error messages are clear and helpful
501
+ [ ] Complete demo runs in < 5 seconds
502
+ [ ] UI is clean and uncluttered
503
+ [ ] Successfully demonstrates federated query across 3 databases
504
+
505
+ Total Scope: One simple page, ~300 lines of code, focused entirely on demonstrating the core capability with minimal complexity.
README.md CHANGED
@@ -1,4 +1,5 @@
1
  # Graph-Driven Agentic System MVP
 
2
 
3
  ## Overview
4
  An intelligent agent system that reads instructions from Neo4j, queries PostgreSQL databases, pauses for human review, and maintains a complete audit trail. The system demonstrates agentic workflow orchestration with human-in-the-loop controls.
@@ -27,6 +28,16 @@ An intelligent agent system that reads instructions from Neo4j, queries PostgreS
27
  └─────────────┘ └─────────────┘
28
  ```
29
 
 
 
 
 
 
 
 
 
 
 
30
  ### Components
31
 
32
  - **Neo4j**: Graph database storing workflows, instructions, and execution metadata
 
1
  # Graph-Driven Agentic System MVP
2
+ "Keep your data where it is but we will treat it like a graph for you and solve these problems for you"
3
 
4
  ## Overview
5
  An intelligent agent system that reads instructions from Neo4j, queries PostgreSQL databases, pauses for human review, and maintains a complete audit trail. The system demonstrates agentic workflow orchestration with human-in-the-loop controls.
 
28
  └─────────────┘ └─────────────┘
29
  ```
30
 
31
+ ###### Any MCP applicaiton/API/Agent can be both a client and a server. Clients and Servers are a logical seperation only, not a physical one. There is an natural idea of chaining/composability between clients and servers. Like a fire bucket chain of context slosh. Use a pydantic graph here as the engine for the orchestrator? I think the point is to create a co-pilot for the analyst that is using graphRAG to inform itself, given the users request, to think in graphRAG before determining how to navigate the MCP tools
32
+
33
+
34
+ actually you aren't immediately writeing data to neo4j from the relational DB instead it's about doing graphRAG to curate the proper SQL statements and tool call to make... there maybe tools to do this.
35
+
36
+ Use MCP inspector and also have an MCP server that automatically checks the logs in the inspector: https://modelcontextprotocol.io/docs/tools/inspector
37
+
38
+
39
+ pydantic is key here and now worrying about frontend until demo time
40
+
41
  ### Components
42
 
43
  - **Neo4j**: Graph database storing workflows, instructions, and execution metadata
agent/main.py CHANGED
@@ -22,6 +22,24 @@ if "gpt" in LLM_MODEL:
22
  else:
23
  llm_client = Anthropic(api_key=LLM_API_KEY)
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  # Global flag for interrupt handling
26
  interrupted = False
27
 
 
22
  else:
23
  llm_client = Anthropic(api_key=LLM_API_KEY)
24
 
25
+ # Defining the agents
26
+ ## Data Procurement Agent
27
+ ## Graph Analysis Agent
28
+ ## x Agent etc...
29
+
30
+ # Create Orchestrato with all the agents and their tasks per orchestrator.py file
31
+ orchestrator = Orchestrator(
32
+ llm_factory=EastridgeAugmentedLLM,
33
+ available_agents=[
34
+ DataProcurementAgent(),
35
+ GraphAnalysisAgent(),
36
+ xAgent()
37
+ ],
38
+ plan_type="full",
39
+ plan_output_path=Path("output/execution_plan.md"),
40
+ )
41
+
42
+
43
  # Global flag for interrupt handling
44
  interrupted = False
45
 
agent/orchestrator.py ADDED
File without changes
agent/plan.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "data": {
3
+ "steps": [
4
+ {
5
+ "description": "Procure data from the source",
6
+ "tasks": {
7
+ "description": "Go to the relational database and procure the data we need. (1st find out what we need to procure from the KG/graph db? thus using the knowledge graph to curate the semantics used for the SQL query? dor can we just hard code the queries like mindsDB? or do we use it to instruct/populate these desctiptions for the agent to reference here(is that actually used as system_prompt-input/augmentation?)",
8
+ "agent": "data_procurement_agent"
9
+ }
10
+ },
11
+ {
12
+ "description": "Transform the data into a graph database",
13
+ "tasks": {
14
+ "description": "run analysis on the data and transform it into a graph database",
15
+ "agent": "graph_transformation_agent"
16
+ }
17
+ }
18
+ ]
19
+ }
20
+ }
agent/task.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # overall task you want the agent(s) to accomplish:
2
+
3
+ # Task title
4
+
5
+ Do this step by step:
6
+
7
+ 1.
8
+ 2.
9
+ 3.
10
+ 4.
11
+ 5.
12
+ 6.
13
+ 7.
14
+ 8.
15
+ 9.
16
+ 10.
17
+
18
+ Create this output in this format:
19
+
20
+ Save as this file naming convention -
21
+
22
+ Persona/System_Prompt Insturctions:
23
+
24
+ Read only access to the relational database.
plan.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plan: Intelligent Graph-Based SQL Federation Middleware (Revised)
2
+
3
+ This document outlines the revised strategy to implement the target features by integrating valuable assets from the `semantic-query-router` codebase into our existing architecture.
4
+
5
+ ### Overall Strategy
6
+ The core task is to evolve the current single-step agent into a multi-step, GraphRAG-powered orchestrator using LangChain. We will enhance the MCP server with advanced core logic, replace PostgreSQL with a rich life sciences SQLite dataset, and transform the Streamlit monitor into a fully conversational chat UI. The `frontend/` Next.js application will be deprecated.
7
+
8
+ ---
9
+
10
+ ### Phase 1: Integrate New Dataset & Core Logic (Due by Friday, Oct 3rd)
11
+
12
+ **Goal**: Replace the existing data foundation with the life sciences dataset and upgrade the MCP server with advanced, reusable logic from the `semantic-query-router` project.
13
+
14
+ - **Task 1.1: Adopt Life Sciences Dataset**
15
+ - Integrate the `generate_sample_databases.py` script into our `ops/scripts/` directory.
16
+ - Create a new `make seed-db` command in the `Makefile` to generate the `clinical_trials.db`, `laboratory.db`, and `drug_discovery.db` SQLite files.
17
+ - Update `docker-compose.yml` to remove the PostgreSQL service and mount the new `data/` directory for the SQLite databases.
18
+
19
+ - **Task 1.2: Enhance MCP Server with Core Logic**
20
+ - Create a new `mcp/core/` directory.
21
+ - Migrate the advanced logic from `semantic-query-router/src/core/` (`discovery.py`, `graph.py`, `intelligence.py`) into our `mcp/core/` directory.
22
+ - Refactor these modules to fit our project structure and standards.
23
+
24
+ - **Task 1.3: Create a Dedicated Ingestion Process**
25
+ - Create a new script, `ops/scripts/ingest.py`, that uses the new core logic to perform a one-time ingestion of the SQLite database schemas into Neo4j.
26
+ - Create a `make ingest` command in the `Makefile` to run this script. This separates the schema ingestion process from the agent's runtime duties, making the system more modular.
27
+ - Remove the schema discovery logic from `agent/main.py`.
28
+
29
+ ---
30
+
31
+ ### Phase 2: Rebuild Agent with LangChain (Due by Tuesday, Oct 7th)
32
+
33
+ **Goal**: Re-architect the agent from a simple script into a robust LangChain-powered orchestrator that leverages the enhanced MCP server.
34
+
35
+ - **Task 2.1: Refactor Agent to use LangChain**
36
+ - Overhaul `agent/main.py` to implement the `AgentExecutor` pattern from `langchain_integration.py`.
37
+ - Define a formal agent prompt that instructs the LLM on how to use the available tools to answer questions.
38
+
39
+ - **Task 2.2: Implement Custom LangChain Tools**
40
+ - Create a new `agent/tools.py` file.
41
+ - Implement custom LangChain tools that make authenticated REST API calls to our enhanced MCP server.
42
+ - The tools will include: `SchemaSearchTool`, `JoinPathFinderTool`, and `QueryExecutorTool`. These tools will act as clients to the powerful logic we integrated into the MCP in Phase 1.
43
+
44
+ - **Task 2.3: Update Agent's Main Loop**
45
+ - Modify the agent's main loop to delegate tasks to the LangChain `AgentExecutor` instead of handling instructions directly. The agent's primary role will now be to orchestrate the LangChain agent and log the results.
46
+
47
+ ---
48
+
49
+ ### Phase 3: Build the Chat UI & Finalize (Due by Thursday, Oct 9th)
50
+
51
+ **Goal**: Replace the basic Streamlit monitor with a full-featured conversational chat interface and complete the final integration for the demo.
52
+
53
+ - **Task 3.1: Implement Conversational Chat UI**
54
+ - Replace the entire contents of `streamlit/app.py` with the conversational UI logic from `semantic-query-router/src/chat_app.py`.
55
+ - Adapt the UI to work with our project's MCP REST API (instead of WebSocket) for submitting questions and fetching results.
56
+
57
+ - **Task 3.2: Integrate Demo-Specific Features**
58
+ - Ensure the new Streamlit UI includes the required demo features:
59
+ - Display of execution phases (e.g., "Searching Schema," "Finding Join Path," "Executing Query").
60
+ - A final results view that shows both the natural language summary from the agent and a clean data table (Pandas DataFrame) of the raw results.
61
+ - A "Download CSV" button for the results table.
62
+ - A sidebar that displays the connection status of the Neo4j and SQLite databases.
63
+
64
+ - **Task 3.3: Final Integration and Testing**
65
+ - Perform end-to-end testing of the full workflow: from asking a question in the Streamlit app to the agent's orchestration and the final result display.
66
+ - Clean up any unused files and finalize the `README.md` with updated instructions.