jashdoshi77 commited on
Commit
dd8c94f
·
1 Parent(s): 84cc717

made AI smarter and quarter comparison

Browse files
Files changed (2) hide show
  1. QUERY_UNDERSTANDING_REVIEW.md +236 -0
  2. services/rag_service.py +136 -17
QUERY_UNDERSTANDING_REVIEW.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Query Understanding & AI Intelligence Review
2
+
3
+ ## Issues Identified
4
+
5
+ ### 1. **Abbreviation & Typo Handling** ("frb" → "feb")
6
+ **Problem**: The system doesn't understand typos or non-standard abbreviations in month names.
7
+ - "frb" should be understood as "february"
8
+ - Current date parser only handles standard abbreviations (jan, feb, mar, etc.)
9
+
10
+ **Root Cause**:
11
+ - Date parser has limited month mappings
12
+ - AI query parser doesn't normalize/expand abbreviations before parsing
13
+ - No fuzzy matching for month names
14
+
15
+ ---
16
+
17
+ ### 2. **Quarter Understanding** ("q1" vs "q2")
18
+ **Problem**: System doesn't understand quarter references and can't perform quarter-based analysis.
19
+ - "q1" should be understood as Q1 (Jan-Mar)
20
+ - "q2" should be understood as Q2 (Apr-Jun)
21
+ - Need to aggregate metadata by quarters and perform calculations
22
+
23
+ **Root Cause**:
24
+ - No quarter detection in query parser
25
+ - No quarter-based date filtering in metadata queries
26
+ - No quarter aggregation logic
27
+ - AI doesn't understand business quarters
28
+
29
+ ---
30
+
31
+ ### 3. **Context Contamination / Hallucination**
32
+ **Problem**: AI mixes information from previous queries (Tata) with new queries (Virat Hospitality).
33
+ - When asking about "Virat Hospitality", it incorrectly says insurer is "Tata" (from previous query)
34
+ - Conversation history is causing data leakage between different entities
35
+
36
+ **Root Cause**:
37
+ - Full conversation history is passed to AI without entity isolation
38
+ - No mechanism to detect when query is about a NEW entity vs follow-up
39
+ - AI is using previous context even when querying a different entity
40
+ - System prompt doesn't emphasize using ONLY current query's retrieved documents
41
+
42
+ ---
43
+
44
+ ### 4. **General Query Understanding**
45
+ **Problem**: AI should be more intelligent about understanding queries, typos, abbreviations, and variations.
46
+
47
+ **Root Cause**:
48
+ - Query parser doesn't do pre-processing/normalization
49
+ - No typo correction
50
+ - No abbreviation expansion
51
+ - Limited entity disambiguation
52
+
53
+ ---
54
+
55
+ ## Proposed Solutions
56
+
57
+ ### Solution 1: Enhanced Query Pre-Processing & Normalization
58
+
59
+ **Implementation**:
60
+ 1. **Query Normalization Layer** (before AI parsing):
61
+ - Typo correction for common words (fuzzy matching)
62
+ - Abbreviation expansion (frb → february, q1 → Q1, etc.)
63
+ - Month name normalization (handle variations: frb, feb, february)
64
+ - Quarter expansion (q1 → Q1, quarter 1, first quarter)
65
+
66
+ 2. **Enhanced Date Parser**:
67
+ - Add fuzzy matching for month names (using Levenshtein distance)
68
+ - Support more month abbreviations (frb, fbr, etc.)
69
+ - Add quarter detection and parsing
70
+
71
+ 3. **AI Query Parser Enhancement**:
72
+ - Add instructions to handle typos and abbreviations
73
+ - Add quarter detection rules
74
+ - Add date normalization in system prompt
75
+
76
+ **Files to Modify**:
77
+ - `services/rag_service.py` - Add query normalization function
78
+ - `services/date_parser.py` - Add fuzzy month matching, quarter support
79
+ - `services/rag_service.py` - Enhance AI parser prompt
80
+
81
+ ---
82
+
83
+ ### Solution 2: Quarter Understanding & Analysis
84
+
85
+ **Implementation**:
86
+ 1. **Quarter Detection in Query Parser**:
87
+ - Detect "q1", "q2", "q3", "q4" in queries
88
+ - Map to date ranges: Q1 (Jan-Mar), Q2 (Apr-Jun), Q3 (Jul-Sep), Q4 (Oct-Dec)
89
+ - Add `quarter` filter to parsed query
90
+
91
+ 2. **Quarter-Based Metadata Filtering**:
92
+ - Filter metadata by quarter date ranges
93
+ - Support quarter comparisons (q1 vs q2)
94
+ - Calculate aggregates by quarter
95
+
96
+ 3. **Quarter Analysis in AI Response**:
97
+ - System prompt should understand quarters
98
+ - Perform calculations: total premium, sum insured, count by quarter
99
+ - Compare quarters with proper analysis
100
+
101
+ **Files to Modify**:
102
+ - `services/rag_service.py` - Add quarter detection in query parser
103
+ - `services/rag_service.py` - Add quarter filtering in metadata handler
104
+ - `services/rag_service.py` - Enhance system prompts for quarter analysis
105
+
106
+ ---
107
+
108
+ ### Solution 3: Context Isolation & Entity Disambiguation
109
+
110
+ **Implementation**:
111
+ 1. **Entity Detection in Query**:
112
+ - Detect when query mentions a NEW entity (company name, person name)
113
+ - Compare with previous query's entity
114
+ - If different entity, isolate context
115
+
116
+ 2. **Context Isolation Strategy**:
117
+ - When new entity detected, only use conversation history for pronoun resolution (it, this, that)
118
+ - DO NOT use previous entity's data
119
+ - Add explicit instruction: "ONLY use information from the current query's retrieved documents"
120
+
121
+ 3. **Enhanced System Prompt**:
122
+ - Add strict rule: "If query mentions a specific entity, ONLY use data for that entity from current documents"
123
+ - Add rule: "Do NOT mix information from different entities mentioned in conversation history"
124
+ - Add rule: "When query mentions a new entity, ignore previous entity's information"
125
+
126
+ 4. **Document Source Validation**:
127
+ - Ensure AI only references documents that were actually retrieved for current query
128
+ - Add source validation in response
129
+
130
+ **Files to Modify**:
131
+ - `services/rag_service.py` - Add entity detection and comparison
132
+ - `services/rag_service.py` - Modify context injection logic
133
+ - `services/rag_service.py` - Enhance system prompts with entity isolation rules
134
+
135
+ ---
136
+
137
+ ### Solution 4: Comprehensive Query Understanding
138
+
139
+ **Implementation**:
140
+ 1. **Multi-Stage Query Processing**:
141
+ ```
142
+ Raw Query → Normalization → Typo Correction → Abbreviation Expansion →
143
+ Entity Detection → AI Parsing → Enhanced Filters
144
+ ```
145
+
146
+ 2. **Query Normalization Function**:
147
+ - Month name typos (frb → february)
148
+ - Quarter expansion (q1 → Q1)
149
+ - Common abbreviation expansion
150
+ - Entity name normalization
151
+
152
+ 3. **Enhanced AI Parser**:
153
+ - Better instructions for understanding variations
154
+ - Typo tolerance
155
+ - Abbreviation understanding
156
+ - Quarter detection
157
+ - Entity disambiguation
158
+
159
+ **Files to Modify**:
160
+ - `services/rag_service.py` - Add `_normalize_query()` function
161
+ - `services/rag_service.py` - Enhance AI parser system prompt
162
+ - `services/date_parser.py` - Add fuzzy month matching
163
+
164
+ ---
165
+
166
+ ## Implementation Priority
167
+
168
+ ### Phase 1: Critical Fixes (Immediate)
169
+ 1. ✅ Context Isolation (Solution 3) - Prevents hallucination
170
+ 2. ✅ Query Normalization (Solution 1) - Fixes "frb" issue
171
+ 3. ✅ Enhanced System Prompts - Better entity isolation
172
+
173
+ ### Phase 2: Enhanced Features (Next)
174
+ 4. ✅ Quarter Understanding (Solution 2) - Q1 vs Q2 analysis
175
+ 5. ✅ Enhanced Date Parser - Fuzzy matching
176
+
177
+ ### Phase 3: Polish (Future)
178
+ 6. ✅ Advanced Typo Correction
179
+ 7. ✅ Entity Disambiguation
180
+ 8. ✅ Query Expansion
181
+
182
+ ---
183
+
184
+ ## Expected Outcomes
185
+
186
+ After implementation:
187
+ 1. ✅ "frb" will be understood as "february"
188
+ 2. ✅ "q1 vs q2" will trigger quarter-based analysis with proper calculations
189
+ 3. ✅ No more mixing data between different entities (Tata vs Virat Hospitality)
190
+ 4. ✅ Better understanding of typos, abbreviations, and variations
191
+ 5. ✅ More intelligent query processing overall
192
+
193
+ ---
194
+
195
+ ## Technical Approach
196
+
197
+ ### Query Normalization Pipeline:
198
+ ```python
199
+ def _normalize_query(self, query: str) -> str:
200
+ """Normalize query before processing."""
201
+ # 1. Month name typos
202
+ # 2. Quarter expansion
203
+ # 3. Common abbreviations
204
+ # 4. Entity name normalization
205
+ return normalized_query
206
+ ```
207
+
208
+ ### Entity Isolation:
209
+ ```python
210
+ def _detect_entity_in_query(self, query: str) -> Optional[str]:
211
+ """Detect entity mentioned in query."""
212
+ # Extract company/person names
213
+ return entity_name
214
+
215
+ def _should_isolate_context(self, current_entity: str, previous_entity: str) -> bool:
216
+ """Check if context should be isolated."""
217
+ return current_entity != previous_entity
218
+ ```
219
+
220
+ ### Quarter Detection:
221
+ ```python
222
+ def _detect_quarters(self, query: str) -> List[str]:
223
+ """Detect quarter references in query."""
224
+ # q1, q2, Q1, Q2, quarter 1, first quarter, etc.
225
+ return ['q1', 'q2']
226
+ ```
227
+
228
+ ---
229
+
230
+ ## Questions for User
231
+
232
+ 1. Should we implement all solutions at once, or prioritize specific ones?
233
+ 2. For quarter analysis, what specific metrics should be calculated? (premium, sum insured, count, etc.)
234
+ 3. For context isolation, should we completely ignore previous entity data, or just emphasize current entity?
235
+ 4. Any other abbreviations or typos we should handle specifically?
236
+
services/rag_service.py CHANGED
@@ -207,6 +207,47 @@ class RAGService:
207
 
208
  return matching_doc_ids
209
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
  def _parse_query_with_ai(self, query: str) -> dict:
211
  """
212
  Use DeepSeek AI to understand query intent and extract structured parameters.
@@ -226,17 +267,40 @@ class RAGService:
226
  """
227
  import json
228
 
229
- system_prompt = """You are a query parser for an insurance document system.
230
- Analyze the user's question and extract structured parameters to help retrieve the right data.
231
-
232
- CRITICAL RULES:
233
- 1. ALWAYS extract industry/sector names mentioned in the query into the filters
234
- 2. When multiple industries are mentioned (e.g., "manufacturing and healthcare"), combine them with comma: "manufacturing, healthcare"
235
- 3. When user asks for "top N" of something, set both limit AND sort_by appropriately
236
- 4. Keywords like "manufacturing", "healthcare", "retail", "IT", "construction" are INDUSTRIES - put them in filters
237
- 5. COMPANY NAME EXTRACTION: When user mentions a company name (e.g., "ABC Corp", "XYZ Industries", "Company Name"), extract it to insured_name filter. Extract the company name as mentioned in the query, even if it's partial. The system will handle name variations (case, spacing, suffixes like "Pvt Ltd", singular/plural) automatically.
238
- 6. TYPO HANDLING: If user makes typos (e.g., "policie" -> "policies", "polciy" -> "policy"), still extract the correct intent and filters. The system is forgiving of spelling errors.
239
- 7. COMPANY vs INDIVIDUAL: When user mentions a company name with business keywords (e.g., "ABC Chemical", "XYZ Industries", "Company Corp"), they want COMPANY policies, not individual person policies. The system will automatically filter out individual person names when company keywords are detected.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
  FORMAT DETECTION (NEW):
242
  1. Detect if user explicitly asks for a specific format:
@@ -261,6 +325,10 @@ Available fields for filtering:
261
  - renewal_year (integer): 2024, 2025, 2026, etc.
262
  - renewal_month (string): january, february, march, april, may, june, july, august, september, october, november, december
263
  Use this when user asks for policies renewing in a specific month
 
 
 
 
264
 
265
  Available fields for sorting:
266
  - premium_amount: net premium, gross premium, premium
@@ -308,7 +376,9 @@ Query: "list all ABC Corp policies"
308
  {"intent":"list","needs_metadata":true,"filters":{"insured_name":"ABC Corp"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
309
 
310
  Query: "show me policies for XYZ Industries"
311
- {"intent":"list","needs_metadata":true,"filters":{"insured_name":"XYZ Industries"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}"""
 
 
312
 
313
  messages = [
314
  {"role": "system", "content": system_prompt},
@@ -1110,6 +1180,8 @@ Summary: {summary[:300] if summary else 'No summary available'}
1110
  return False
1111
  all_metadata = [m for m in all_metadata if has_month(m)]
1112
  print(f"[METADATA QUERY] Filtered by renewal_month {value} (month={target_month}): {len(all_metadata)} remaining")
 
 
1113
 
1114
  # Apply AI-extracted sorting
1115
  if sort_by and sort_by in ['premium_amount', 'sum_insured', 'renewal_date', 'policy_start_date']:
@@ -1488,9 +1560,32 @@ CRITICAL INSTRUCTIONS:
1488
  4. Create a clear comparison highlighting differences and similarities.
1489
  5. Use a table format if comparing multiple attributes.
1490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1491
  {format_instructions}
1492
 
1493
- Do NOT say information is missing if it's in the provided context."""
1494
 
1495
  elif intent == 'summarize':
1496
  system_prompt = f"""You are Iribl AI, a document analysis assistant providing a SUMMARY.
@@ -1516,6 +1611,14 @@ CRITICAL INSTRUCTIONS:
1516
  4. Provide a comprehensive answer covering all relevant information.
1517
  5. Format clearly with headers and bullet points.
1518
 
 
 
 
 
 
 
 
 
1519
  FINDING NUMBERS AND TOTALS (CRITICAL):
1520
  - When asked about "how many", "total", "sum insured", "students", "count" - search EVERY section
1521
  - The DETAILED DOCUMENT CONTENT section is MORE IMPORTANT than metadata for finding numbers
@@ -1601,6 +1704,17 @@ CRITICAL: This query asks for detailed information (numbers, counts, totals, stu
1601
  - The DETAILED DOCUMENT CONTENT section contains the actual numbers, counts, and totals
1602
  - You MUST search through the DETAILED DOCUMENT CONTENT section to find the answer
1603
  - If metadata doesn't have the answer, the answer is definitely in the detailed content - keep searching!
 
 
 
 
 
 
 
 
 
 
 
1604
  """
1605
 
1606
  user_message = f"""{context_injection}Based on the following document data, answer my question comprehensively.
@@ -1608,14 +1722,16 @@ CRITICAL: This query asks for detailed information (numbers, counts, totals, stu
1608
  DOCUMENT DATA:
1609
  {context}
1610
  {detailed_content_emphasis}
 
1611
  QUESTION: {query}
1612
 
1613
  Instructions:
1614
  - Use both the structured metadata AND detailed content to provide a complete answer
1615
- - If this is a follow-up, use conversation history to understand what I'm referring to
1616
  - Search THOROUGHLY through ALL document sections for numbers, totals, counts, students, sum insured, etc.
1617
  - For questions about numbers/counts/totals: The DETAILED DOCUMENT CONTENT section is more important than metadata
1618
- - NEVER say information is missing unless you've checked every single section{format_reminder}"""
 
1619
 
1620
  messages.append({"role": "user", "content": user_message})
1621
 
@@ -2573,8 +2689,11 @@ Instructions: Synthesize from multiple documents if relevant. Be detailed but co
2573
  """
2574
  import time
2575
 
2576
- # Step 0: AI-powered query parsing - understand intent and extract structured parameters
2577
- parsed = self._parse_query_with_ai(query)
 
 
 
2578
  print(f"[QUERY ROUTING] AI-parsed query: {parsed}")
2579
 
2580
  # Route based on AI-parsed intent
 
207
 
208
  return matching_doc_ids
209
 
210
+ def _normalize_query_with_ai(self, query: str) -> str:
211
+ """
212
+ Use AI to normalize and understand the query before parsing.
213
+ Handles typos, abbreviations, and variations intelligently.
214
+ This is an ADDITIVE enhancement - if normalization fails or isn't needed, returns original query.
215
+ """
216
+ # Only attempt normalization - if it fails or doesn't help, use original query
217
+ # This ensures existing functionality is preserved
218
+ try:
219
+ normalize_prompt = """You are a query normalization assistant. Your job is to understand what the user means and normalize their query intelligently.
220
+
221
+ CRITICAL RULES:
222
+ 1. Use your natural language understanding to fix typos and expand abbreviations
223
+ 2. Understand context and intent, not just literal text
224
+ 3. Normalize dates, months, quarters, and time references intelligently
225
+ 4. Keep the original meaning and intent
226
+ 5. Only normalize when it helps understanding, don't over-correct
227
+ 6. If the query is already clear, return it unchanged
228
+ 7. Return the normalized query, not an explanation
229
+
230
+ Use your intelligence to understand any typos, abbreviations, or variations the user might use."""
231
+
232
+ messages = [
233
+ {"role": "system", "content": normalize_prompt},
234
+ {"role": "user", "content": f"Normalize this query (return unchanged if already clear): {query}"}
235
+ ]
236
+ response = self._call_deepseek_sync(messages, max_tokens=200)
237
+ normalized = response.strip().strip('"').strip("'")
238
+
239
+ # Only use normalization if it's valid and different (and not just removing quotes)
240
+ if normalized and len(normalized) > 5 and normalized.lower() != query.lower():
241
+ print(f"[QUERY NORMALIZATION] Original: {query} -> Normalized: {normalized}")
242
+ return normalized
243
+ else:
244
+ # Normalization returned same query or invalid - use original
245
+ return query
246
+ except Exception as e:
247
+ # If normalization fails, always return original query (preserves existing functionality)
248
+ print(f"[QUERY NORMALIZATION] Failed: {e}, using original query")
249
+ return query
250
+
251
  def _parse_query_with_ai(self, query: str) -> dict:
252
  """
253
  Use DeepSeek AI to understand query intent and extract structured parameters.
 
267
  """
268
  import json
269
 
270
+ system_prompt = """You are an advanced AI query parser for an insurance document system. You understand queries like ChatGPT or Claude - intelligently handling typos, abbreviations, variations, and complex requests.
271
+
272
+ Your job is to understand the user's intent and extract structured parameters, even when queries have:
273
+ - Typos (frb, fbr, feb -> february)
274
+ - Abbreviations (q1, q2 -> quarters, frb -> february)
275
+ - Variations (upcoming renewals, renewals coming, policies renewing)
276
+ - Complex requests (comparisons, calculations, aggregations)
277
+
278
+ CRITICAL UNDERSTANDING RULES:
279
+ 1. TYPO & ABBREVIATION HANDLING: Use your intelligence to understand what the user means:
280
+ - Correct typos intelligently (e.g., month name typos, common misspellings)
281
+ - Expand abbreviations naturally (e.g., month abbreviations, quarter references)
282
+ - Understand variations in phrasing (e.g., "upcoming renewals", "renewals coming", "policies renewing")
283
+ - Use your natural language understanding to interpret user intent, not just literal text
284
+
285
+ 2. DATE & TIME UNDERSTANDING:
286
+ - Understand dates in any format or variation
287
+ - Extract dates from context even if not explicitly stated
288
+ - Understand time periods, quarters, months, years in natural language
289
+ - Map date references to appropriate filters (renewal_year, renewal_month, etc.)
290
+
291
+ 3. QUARTER & PERIOD UNDERSTANDING:
292
+ - Understand quarter references (Q1, Q2, Q3, Q4, quarter 1, first quarter, etc.)
293
+ - Understand that quarters represent time periods (Q1 = Jan-Mar, Q2 = Apr-Jun, etc.)
294
+ - For comparisons involving quarters or time periods, set appropriate intent and filters
295
+ - Let your intelligence handle all variations and formats
296
+
297
+ 4. COMPANY NAME EXTRACTION: When user mentions a company name (e.g., "ABC Corp", "XYZ Industries", "Company Name"), extract it to insured_name filter. Extract the company name as mentioned in the query, even if it's partial. The system will handle name variations (case, spacing, suffixes like "Pvt Ltd", singular/plural) automatically.
298
+
299
+ 5. ALWAYS extract industry/sector names mentioned in the query into the filters
300
+ 6. When multiple industries are mentioned (e.g., "manufacturing and healthcare"), combine them with comma: "manufacturing, healthcare"
301
+ 7. When user asks for "top N" of something, set both limit AND sort_by appropriately
302
+ 8. Keywords like "manufacturing", "healthcare", "retail", "IT", "construction" are INDUSTRIES - put them in filters
303
+ 9. COMPANY vs INDIVIDUAL: When user mentions a company name with business keywords (e.g., "ABC Chemical", "XYZ Industries", "Company Corp"), they want COMPANY policies, not individual person policies. The system will automatically filter out individual person names when company keywords are detected.
304
 
305
  FORMAT DETECTION (NEW):
306
  1. Detect if user explicitly asks for a specific format:
 
325
  - renewal_year (integer): 2024, 2025, 2026, etc.
326
  - renewal_month (string): january, february, march, april, may, june, july, august, september, october, november, december
327
  Use this when user asks for policies renewing in a specific month
328
+ IMPORTANT: Use your intelligence to understand month names in any format, with typos, or abbreviations
329
+ - quarter (string): Use when user mentions quarters or time periods
330
+ Understand quarters in any format (q1, Q1, quarter 1, first quarter, etc.)
331
+ For comparisons, extract all mentioned quarters
332
 
333
  Available fields for sorting:
334
  - premium_amount: net premium, gross premium, premium
 
376
  {"intent":"list","needs_metadata":true,"filters":{"insured_name":"ABC Corp"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
377
 
378
  Query: "show me policies for XYZ Industries"
379
+ {"intent":"list","needs_metadata":true,"filters":{"insured_name":"XYZ Industries"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
380
+
381
+ """
382
 
383
  messages = [
384
  {"role": "system", "content": system_prompt},
 
1180
  return False
1181
  all_metadata = [m for m in all_metadata if has_month(m)]
1182
  print(f"[METADATA QUERY] Filtered by renewal_month {value} (month={target_month}): {len(all_metadata)} remaining")
1183
+
1184
+
1185
 
1186
  # Apply AI-extracted sorting
1187
  if sort_by and sort_by in ['premium_amount', 'sum_insured', 'renewal_date', 'policy_start_date']:
 
1560
  4. Create a clear comparison highlighting differences and similarities.
1561
  5. Use a table format if comparing multiple attributes.
1562
 
1563
+ TIME PERIOD COMPARISONS (CRITICAL - e.g., Q1 vs Q2, quarters, months, years):
1564
+ - When comparing time periods like Q1 vs Q2, you MUST calculate quarters from dates yourself
1565
+ - Q1 = January-March (months 1-3), Q2 = April-June (months 4-6), Q3 = July-September (months 7-9), Q4 = October-December (months 10-12)
1566
+ - Look at renewal_date, policy_start_date, or other date fields in the metadata
1567
+ - For each policy, determine which quarter it belongs to based on the month in its date
1568
+ - Group policies by quarter (calculate from dates, don't look for a "quarter" field - it doesn't exist)
1569
+ - Calculate aggregates for each quarter:
1570
+ * Total Premium (sum of premium_amount)
1571
+ * Total Sum Insured (sum of sum_insured)
1572
+ * Number of Policies (count)
1573
+ * Average Premium per policy
1574
+ * Policy types breakdown
1575
+ - Compare the quarters side-by-side with all metrics
1576
+ - Provide insights: which quarter has more business, growth trends, differences
1577
+ - NEVER say "data is not categorized by quarters" - YOU must categorize it by calculating quarters from dates
1578
+
1579
+ CALCULATION REQUIREMENTS:
1580
+ - Use the metadata provided - it has renewal_date, premium_amount, sum_insured for all policies
1581
+ - Extract the month from renewal_date to determine quarter
1582
+ - Sum up premium_amount and sum_insured for each quarter
1583
+ - Count policies in each quarter
1584
+ - Present in a clear comparison table
1585
+
1586
  {format_instructions}
1587
 
1588
+ Do NOT say information is missing or that data isn't categorized by quarters - calculate quarters from dates and perform the analysis."""
1589
 
1590
  elif intent == 'summarize':
1591
  system_prompt = f"""You are Iribl AI, a document analysis assistant providing a SUMMARY.
 
1611
  4. Provide a comprehensive answer covering all relevant information.
1612
  5. Format clearly with headers and bullet points.
1613
 
1614
+ ENTITY ISOLATION (CRITICAL):
1615
+ - If the query mentions a specific entity (company, person, organization), ONLY use information for that entity
1616
+ - Use your intelligence to identify the entity mentioned in the current query
1617
+ - Do NOT mix information from different entities, even if mentioned in conversation history
1618
+ - If conversation history mentions a different entity than the current query, IGNORE that previous entity's information
1619
+ - ONLY use data from the DOCUMENT DATA provided for the current query's entity
1620
+ - Use your natural language understanding to distinguish between entities
1621
+
1622
  FINDING NUMBERS AND TOTALS (CRITICAL):
1623
  - When asked about "how many", "total", "sum insured", "students", "count" - search EVERY section
1624
  - The DETAILED DOCUMENT CONTENT section is MORE IMPORTANT than metadata for finding numbers
 
1704
  - The DETAILED DOCUMENT CONTENT section contains the actual numbers, counts, and totals
1705
  - You MUST search through the DETAILED DOCUMENT CONTENT section to find the answer
1706
  - If metadata doesn't have the answer, the answer is definitely in the detailed content - keep searching!
1707
+ """
1708
+
1709
+ # Entity isolation instruction - ADDITIVE enhancement to prevent mixing entities
1710
+ # This doesn't replace existing context handling, just adds entity isolation awareness
1711
+ entity_isolation = """
1712
+
1713
+ IMPORTANT: Entity Isolation (to prevent mixing data from different entities):
1714
+ - Identify the entity (company, person, organization) mentioned in the current query
1715
+ - ONLY use information from documents that mention this entity
1716
+ - If conversation history mentions a different entity than the current query, focus on the current entity's data
1717
+ - Use your intelligence to distinguish between entities and ensure you're answering about the correct one
1718
  """
1719
 
1720
  user_message = f"""{context_injection}Based on the following document data, answer my question comprehensively.
 
1722
  DOCUMENT DATA:
1723
  {context}
1724
  {detailed_content_emphasis}
1725
+ {entity_isolation}
1726
  QUESTION: {query}
1727
 
1728
  Instructions:
1729
  - Use both the structured metadata AND detailed content to provide a complete answer
1730
+ - If this is a follow-up, use conversation history to understand what I'm referring to (pronouns like "it", "this", "that")
1731
  - Search THOROUGHLY through ALL document sections for numbers, totals, counts, students, sum insured, etc.
1732
  - For questions about numbers/counts/totals: The DETAILED DOCUMENT CONTENT section is more important than metadata
1733
+ - NEVER say information is missing unless you've checked every single section
1734
+ - ONLY use information from the DOCUMENT DATA provided above{format_reminder}"""
1735
 
1736
  messages.append({"role": "user", "content": user_message})
1737
 
 
2689
  """
2690
  import time
2691
 
2692
+ # Step 0: Normalize query with AI (fix typos, expand abbreviations)
2693
+ normalized_query = self._normalize_query_with_ai(query)
2694
+
2695
+ # Step 0.5: AI-powered query parsing - understand intent and extract structured parameters
2696
+ parsed = self._parse_query_with_ai(normalized_query)
2697
  print(f"[QUERY ROUTING] AI-parsed query: {parsed}")
2698
 
2699
  # Route based on AI-parsed intent