Spaces:

jashdoshi77
/

notebooklm-fast

Running

App Files Files Community

jashdoshi77 commited on Jan 25

Commit

dd8c94f

1 Parent(s): 84cc717

made AI smarter and quarter comparison

Browse files

Files changed (2) hide show

QUERY_UNDERSTANDING_REVIEW.md +236 -0
services/rag_service.py +136 -17

QUERY_UNDERSTANDING_REVIEW.md ADDED Viewed

	@@ -0,0 +1,236 @@

+# Query Understanding & AI Intelligence Review
+## Issues Identified
+### 1. **Abbreviation & Typo Handling** ("frb" → "feb")
+**Problem**: The system doesn't understand typos or non-standard abbreviations in month names.
+- "frb" should be understood as "february"
+- Current date parser only handles standard abbreviations (jan, feb, mar, etc.)
+**Root Cause**:
+- Date parser has limited month mappings
+- AI query parser doesn't normalize/expand abbreviations before parsing
+- No fuzzy matching for month names
+---
+### 2. **Quarter Understanding** ("q1" vs "q2")
+**Problem**: System doesn't understand quarter references and can't perform quarter-based analysis.
+- "q1" should be understood as Q1 (Jan-Mar)
+- "q2" should be understood as Q2 (Apr-Jun)
+- Need to aggregate metadata by quarters and perform calculations
+**Root Cause**:
+- No quarter detection in query parser
+- No quarter-based date filtering in metadata queries
+- No quarter aggregation logic
+- AI doesn't understand business quarters
+---
+### 3. **Context Contamination / Hallucination**
+**Problem**: AI mixes information from previous queries (Tata) with new queries (Virat Hospitality).
+- When asking about "Virat Hospitality", it incorrectly says insurer is "Tata" (from previous query)
+- Conversation history is causing data leakage between different entities
+**Root Cause**:
+- Full conversation history is passed to AI without entity isolation
+- No mechanism to detect when query is about a NEW entity vs follow-up
+- AI is using previous context even when querying a different entity
+- System prompt doesn't emphasize using ONLY current query's retrieved documents
+---
+### 4. **General Query Understanding**
+**Problem**: AI should be more intelligent about understanding queries, typos, abbreviations, and variations.
+**Root Cause**:
+- Query parser doesn't do pre-processing/normalization
+- No typo correction
+- No abbreviation expansion
+- Limited entity disambiguation
+---
+## Proposed Solutions
+### Solution 1: Enhanced Query Pre-Processing & Normalization
+**Implementation**:
+1. **Query Normalization Layer** (before AI parsing):
+   - Typo correction for common words (fuzzy matching)
+   - Abbreviation expansion (frb → february, q1 → Q1, etc.)
+   - Month name normalization (handle variations: frb, feb, february)
+   - Quarter expansion (q1 → Q1, quarter 1, first quarter)
+2. **Enhanced Date Parser**:
+   - Add fuzzy matching for month names (using Levenshtein distance)
+   - Support more month abbreviations (frb, fbr, etc.)
+   - Add quarter detection and parsing
+3. **AI Query Parser Enhancement**:
+   - Add instructions to handle typos and abbreviations
+   - Add quarter detection rules
+   - Add date normalization in system prompt
+**Files to Modify**:
+- `services/rag_service.py` - Add query normalization function
+- `services/date_parser.py` - Add fuzzy month matching, quarter support
+- `services/rag_service.py` - Enhance AI parser prompt
+---
+### Solution 2: Quarter Understanding & Analysis
+**Implementation**:
+1. **Quarter Detection in Query Parser**:
+   - Detect "q1", "q2", "q3", "q4" in queries
+   - Map to date ranges: Q1 (Jan-Mar), Q2 (Apr-Jun), Q3 (Jul-Sep), Q4 (Oct-Dec)
+   - Add `quarter` filter to parsed query
+2. **Quarter-Based Metadata Filtering**:
+   - Filter metadata by quarter date ranges
+   - Support quarter comparisons (q1 vs q2)
+   - Calculate aggregates by quarter
+3. **Quarter Analysis in AI Response**:
+   - System prompt should understand quarters
+   - Perform calculations: total premium, sum insured, count by quarter
+   - Compare quarters with proper analysis
+**Files to Modify**:
+- `services/rag_service.py` - Add quarter detection in query parser
+- `services/rag_service.py` - Add quarter filtering in metadata handler
+- `services/rag_service.py` - Enhance system prompts for quarter analysis
+---
+### Solution 3: Context Isolation & Entity Disambiguation
+**Implementation**:
+1. **Entity Detection in Query**:
+   - Detect when query mentions a NEW entity (company name, person name)
+   - Compare with previous query's entity
+   - If different entity, isolate context
+2. **Context Isolation Strategy**:
+   - When new entity detected, only use conversation history for pronoun resolution (it, this, that)
+   - DO NOT use previous entity's data
+   - Add explicit instruction: "ONLY use information from the current query's retrieved documents"
+3. **Enhanced System Prompt**:
+   - Add strict rule: "If query mentions a specific entity, ONLY use data for that entity from current documents"
+   - Add rule: "Do NOT mix information from different entities mentioned in conversation history"
+   - Add rule: "When query mentions a new entity, ignore previous entity's information"
+4. **Document Source Validation**:
+   - Ensure AI only references documents that were actually retrieved for current query
+   - Add source validation in response
+**Files to Modify**:
+- `services/rag_service.py` - Add entity detection and comparison
+- `services/rag_service.py` - Modify context injection logic
+- `services/rag_service.py` - Enhance system prompts with entity isolation rules
+---
+### Solution 4: Comprehensive Query Understanding
+**Implementation**:
+1. **Multi-Stage Query Processing**:
+   ```
+   Raw Query → Normalization → Typo Correction → Abbreviation Expansion →
+   Entity Detection → AI Parsing → Enhanced Filters
+   ```
+2. **Query Normalization Function**:
+   - Month name typos (frb → february)
+   - Quarter expansion (q1 → Q1)
+   - Common abbreviation expansion
+   - Entity name normalization
+3. **Enhanced AI Parser**:
+   - Better instructions for understanding variations
+   - Typo tolerance
+   - Abbreviation understanding
+   - Quarter detection
+   - Entity disambiguation
+**Files to Modify**:
+- `services/rag_service.py` - Add `_normalize_query()` function
+- `services/rag_service.py` - Enhance AI parser system prompt
+- `services/date_parser.py` - Add fuzzy month matching
+---
+## Implementation Priority
+### Phase 1: Critical Fixes (Immediate)
+1. ✅ Context Isolation (Solution 3) - Prevents hallucination
+2. ✅ Query Normalization (Solution 1) - Fixes "frb" issue
+3. ✅ Enhanced System Prompts - Better entity isolation
+### Phase 2: Enhanced Features (Next)
+4. ✅ Quarter Understanding (Solution 2) - Q1 vs Q2 analysis
+5. ✅ Enhanced Date Parser - Fuzzy matching
+### Phase 3: Polish (Future)
+6. ✅ Advanced Typo Correction
+7. ✅ Entity Disambiguation
+8. ✅ Query Expansion
+---
+## Expected Outcomes
+After implementation:
+1. ✅ "frb" will be understood as "february"
+2. ✅ "q1 vs q2" will trigger quarter-based analysis with proper calculations
+3. ✅ No more mixing data between different entities (Tata vs Virat Hospitality)
+4. ✅ Better understanding of typos, abbreviations, and variations
+5. ✅ More intelligent query processing overall
+---
+## Technical Approach
+### Query Normalization Pipeline:
+```python
+def _normalize_query(self, query: str) -> str:
+    """Normalize query before processing."""
+    # 1. Month name typos
+    # 2. Quarter expansion
+    # 3. Common abbreviations
+    # 4. Entity name normalization
+    return normalized_query
+```
+### Entity Isolation:
+```python
+def _detect_entity_in_query(self, query: str) -> Optional[str]:
+    """Detect entity mentioned in query."""
+    # Extract company/person names
+    return entity_name
+def _should_isolate_context(self, current_entity: str, previous_entity: str) -> bool:
+    """Check if context should be isolated."""
+    return current_entity != previous_entity
+```
+### Quarter Detection:
+```python
+def _detect_quarters(self, query: str) -> List[str]:
+    """Detect quarter references in query."""
+    # q1, q2, Q1, Q2, quarter 1, first quarter, etc.
+    return ['q1', 'q2']
+```
+---
+## Questions for User
+1. Should we implement all solutions at once, or prioritize specific ones?
+2. For quarter analysis, what specific metrics should be calculated? (premium, sum insured, count, etc.)
+3. For context isolation, should we completely ignore previous entity data, or just emphasize current entity?
+4. Any other abbreviations or typos we should handle specifically?

services/rag_service.py CHANGED Viewed

@@ -207,6 +207,47 @@ class RAGService:
         return matching_doc_ids
     def _parse_query_with_ai(self, query: str) -> dict:
         """
         Use DeepSeek AI to understand query intent and extract structured parameters.
@@ -226,17 +267,40 @@ class RAGService:
         """
         import json
-        system_prompt = """You are a query parser for an insurance document system.
-Analyze the user's question and extract structured parameters to help retrieve the right data.
-CRITICAL RULES:
-1. ALWAYS extract industry/sector names mentioned in the query into the filters
-2. When multiple industries are mentioned (e.g., "manufacturing and healthcare"), combine them with comma: "manufacturing, healthcare"
-3. When user asks for "top N" of something, set both limit AND sort_by appropriately
-4. Keywords like "manufacturing", "healthcare", "retail", "IT", "construction" are INDUSTRIES - put them in filters
-5. COMPANY NAME EXTRACTION: When user mentions a company name (e.g., "ABC Corp", "XYZ Industries", "Company Name"), extract it to insured_name filter. Extract the company name as mentioned in the query, even if it's partial. The system will handle name variations (case, spacing, suffixes like "Pvt Ltd", singular/plural) automatically.
-6. TYPO HANDLING: If user makes typos (e.g., "policie" -> "policies", "polciy" -> "policy"), still extract the correct intent and filters. The system is forgiving of spelling errors.
-7. COMPANY vs INDIVIDUAL: When user mentions a company name with business keywords (e.g., "ABC Chemical", "XYZ Industries", "Company Corp"), they want COMPANY policies, not individual person policies. The system will automatically filter out individual person names when company keywords are detected.
 FORMAT DETECTION (NEW):
 1. Detect if user explicitly asks for a specific format:
@@ -261,6 +325,10 @@ Available fields for filtering:
 - renewal_year (integer): 2024, 2025, 2026, etc.
 - renewal_month (string): january, february, march, april, may, june, july, august, september, october, november, december
   Use this when user asks for policies renewing in a specific month
 Available fields for sorting:
 - premium_amount: net premium, gross premium, premium
@@ -308,7 +376,9 @@ Query: "list all ABC Corp policies"
 {"intent":"list","needs_metadata":true,"filters":{"insured_name":"ABC Corp"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
 Query: "show me policies for XYZ Industries"
-{"intent":"list","needs_metadata":true,"filters":{"insured_name":"XYZ Industries"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}"""
         messages = [
             {"role": "system", "content": system_prompt},
@@ -1110,6 +1180,8 @@ Summary: {summary[:300] if summary else 'No summary available'}
                             return False
                         all_metadata = [m for m in all_metadata if has_month(m)]
                         print(f"[METADATA QUERY] Filtered by renewal_month {value} (month={target_month}): {len(all_metadata)} remaining")
             # Apply AI-extracted sorting
             if sort_by and sort_by in ['premium_amount', 'sum_insured', 'renewal_date', 'policy_start_date']:
@@ -1488,9 +1560,32 @@ CRITICAL INSTRUCTIONS:
 4. Create a clear comparison highlighting differences and similarities.
 5. Use a table format if comparing multiple attributes.
 {format_instructions}
-Do NOT say information is missing if it's in the provided context."""
         elif intent == 'summarize':
             system_prompt = f"""You are Iribl AI, a document analysis assistant providing a SUMMARY.
@@ -1516,6 +1611,14 @@ CRITICAL INSTRUCTIONS:
 4. Provide a comprehensive answer covering all relevant information.
 5. Format clearly with headers and bullet points.
 FINDING NUMBERS AND TOTALS (CRITICAL):
 - When asked about "how many", "total", "sum insured", "students", "count" - search EVERY section
 - The DETAILED DOCUMENT CONTENT section is MORE IMPORTANT than metadata for finding numbers
@@ -1601,6 +1704,17 @@ CRITICAL: This query asks for detailed information (numbers, counts, totals, stu
 - The DETAILED DOCUMENT CONTENT section contains the actual numbers, counts, and totals
 - You MUST search through the DETAILED DOCUMENT CONTENT section to find the answer
 - If metadata doesn't have the answer, the answer is definitely in the detailed content - keep searching!
 """
         user_message = f"""{context_injection}Based on the following document data, answer my question comprehensively.
@@ -1608,14 +1722,16 @@ CRITICAL: This query asks for detailed information (numbers, counts, totals, stu
 DOCUMENT DATA:
 {context}
 {detailed_content_emphasis}
 QUESTION: {query}
 Instructions:
 - Use both the structured metadata AND detailed content to provide a complete answer
-- If this is a follow-up, use conversation history to understand what I'm referring to
 - Search THOROUGHLY through ALL document sections for numbers, totals, counts, students, sum insured, etc.
 - For questions about numbers/counts/totals: The DETAILED DOCUMENT CONTENT section is more important than metadata
-- NEVER say information is missing unless you've checked every single section{format_reminder}"""
         messages.append({"role": "user", "content": user_message})
@@ -2573,8 +2689,11 @@ Instructions: Synthesize from multiple documents if relevant. Be detailed but co
         """
         import time
-        # Step 0: AI-powered query parsing - understand intent and extract structured parameters
-        parsed = self._parse_query_with_ai(query)
         print(f"[QUERY ROUTING] AI-parsed query: {parsed}")
         # Route based on AI-parsed intent

         return matching_doc_ids
+    def _normalize_query_with_ai(self, query: str) -> str:
+        """
+        Use AI to normalize and understand the query before parsing.
+        Handles typos, abbreviations, and variations intelligently.
+        This is an ADDITIVE enhancement - if normalization fails or isn't needed, returns original query.
+        """
+        # Only attempt normalization - if it fails or doesn't help, use original query
+        # This ensures existing functionality is preserved
+        try:
+            normalize_prompt = """You are a query normalization assistant. Your job is to understand what the user means and normalize their query intelligently.
+CRITICAL RULES:
+1. Use your natural language understanding to fix typos and expand abbreviations
+2. Understand context and intent, not just literal text
+3. Normalize dates, months, quarters, and time references intelligently
+4. Keep the original meaning and intent
+5. Only normalize when it helps understanding, don't over-correct
+6. If the query is already clear, return it unchanged
+7. Return the normalized query, not an explanation
+Use your intelligence to understand any typos, abbreviations, or variations the user might use."""
+            messages = [
+                {"role": "system", "content": normalize_prompt},
+                {"role": "user", "content": f"Normalize this query (return unchanged if already clear): {query}"}
+            ]
+            response = self._call_deepseek_sync(messages, max_tokens=200)
+            normalized = response.strip().strip('"').strip("'")
+            # Only use normalization if it's valid and different (and not just removing quotes)
+            if normalized and len(normalized) > 5 and normalized.lower() != query.lower():
+                print(f"[QUERY NORMALIZATION] Original: {query} -> Normalized: {normalized}")
+                return normalized
+            else:
+                # Normalization returned same query or invalid - use original
+                return query
+        except Exception as e:
+            # If normalization fails, always return original query (preserves existing functionality)
+            print(f"[QUERY NORMALIZATION] Failed: {e}, using original query")
+            return query
     def _parse_query_with_ai(self, query: str) -> dict:
         """
         Use DeepSeek AI to understand query intent and extract structured parameters.
         """
         import json
+        system_prompt = """You are an advanced AI query parser for an insurance document system. You understand queries like ChatGPT or Claude - intelligently handling typos, abbreviations, variations, and complex requests.
+Your job is to understand the user's intent and extract structured parameters, even when queries have:
+- Typos (frb, fbr, feb -> february)
+- Abbreviations (q1, q2 -> quarters, frb -> february)
+- Variations (upcoming renewals, renewals coming, policies renewing)
+- Complex requests (comparisons, calculations, aggregations)
+CRITICAL UNDERSTANDING RULES:
+1. TYPO & ABBREVIATION HANDLING: Use your intelligence to understand what the user means:
+   - Correct typos intelligently (e.g., month name typos, common misspellings)
+   - Expand abbreviations naturally (e.g., month abbreviations, quarter references)
+   - Understand variations in phrasing (e.g., "upcoming renewals", "renewals coming", "policies renewing")
+   - Use your natural language understanding to interpret user intent, not just literal text
+2. DATE & TIME UNDERSTANDING:
+   - Understand dates in any format or variation
+   - Extract dates from context even if not explicitly stated
+   - Understand time periods, quarters, months, years in natural language
+   - Map date references to appropriate filters (renewal_year, renewal_month, etc.)
+3. QUARTER & PERIOD UNDERSTANDING:
+   - Understand quarter references (Q1, Q2, Q3, Q4, quarter 1, first quarter, etc.)
+   - Understand that quarters represent time periods (Q1 = Jan-Mar, Q2 = Apr-Jun, etc.)
+   - For comparisons involving quarters or time periods, set appropriate intent and filters
+   - Let your intelligence handle all variations and formats
+4. COMPANY NAME EXTRACTION: When user mentions a company name (e.g., "ABC Corp", "XYZ Industries", "Company Name"), extract it to insured_name filter. Extract the company name as mentioned in the query, even if it's partial. The system will handle name variations (case, spacing, suffixes like "Pvt Ltd", singular/plural) automatically.
+5. ALWAYS extract industry/sector names mentioned in the query into the filters
+6. When multiple industries are mentioned (e.g., "manufacturing and healthcare"), combine them with comma: "manufacturing, healthcare"
+7. When user asks for "top N" of something, set both limit AND sort_by appropriately
+8. Keywords like "manufacturing", "healthcare", "retail", "IT", "construction" are INDUSTRIES - put them in filters
+9. COMPANY vs INDIVIDUAL: When user mentions a company name with business keywords (e.g., "ABC Chemical", "XYZ Industries", "Company Corp"), they want COMPANY policies, not individual person policies. The system will automatically filter out individual person names when company keywords are detected.
 FORMAT DETECTION (NEW):
 1. Detect if user explicitly asks for a specific format:
 - renewal_year (integer): 2024, 2025, 2026, etc.
 - renewal_month (string): january, february, march, april, may, june, july, august, september, october, november, december
   Use this when user asks for policies renewing in a specific month
+  IMPORTANT: Use your intelligence to understand month names in any format, with typos, or abbreviations
+- quarter (string): Use when user mentions quarters or time periods
+  Understand quarters in any format (q1, Q1, quarter 1, first quarter, etc.)
+  For comparisons, extract all mentioned quarters
 Available fields for sorting:
 - premium_amount: net premium, gross premium, premium
 {"intent":"list","needs_metadata":true,"filters":{"insured_name":"ABC Corp"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
 Query: "show me policies for XYZ Industries"
+{"intent":"list","needs_metadata":true,"filters":{"insured_name":"XYZ Industries"},"sort_by":null,"sort_order":"desc","limit":null,"calculation":null,"calculation_field":null,"format_preference":null,"is_format_change":false}
+"""
         messages = [
             {"role": "system", "content": system_prompt},
                             return False
                         all_metadata = [m for m in all_metadata if has_month(m)]
                         print(f"[METADATA QUERY] Filtered by renewal_month {value} (month={target_month}): {len(all_metadata)} remaining")
             # Apply AI-extracted sorting
             if sort_by and sort_by in ['premium_amount', 'sum_insured', 'renewal_date', 'policy_start_date']:
 4. Create a clear comparison highlighting differences and similarities.
 5. Use a table format if comparing multiple attributes.
+TIME PERIOD COMPARISONS (CRITICAL - e.g., Q1 vs Q2, quarters, months, years):
+- When comparing time periods like Q1 vs Q2, you MUST calculate quarters from dates yourself
+- Q1 = January-March (months 1-3), Q2 = April-June (months 4-6), Q3 = July-September (months 7-9), Q4 = October-December (months 10-12)
+- Look at renewal_date, policy_start_date, or other date fields in the metadata
+- For each policy, determine which quarter it belongs to based on the month in its date
+- Group policies by quarter (calculate from dates, don't look for a "quarter" field - it doesn't exist)
+- Calculate aggregates for each quarter:
+  * Total Premium (sum of premium_amount)
+  * Total Sum Insured (sum of sum_insured)
+  * Number of Policies (count)
+  * Average Premium per policy
+  * Policy types breakdown
+- Compare the quarters side-by-side with all metrics
+- Provide insights: which quarter has more business, growth trends, differences
+- NEVER say "data is not categorized by quarters" - YOU must categorize it by calculating quarters from dates
+CALCULATION REQUIREMENTS:
+- Use the metadata provided - it has renewal_date, premium_amount, sum_insured for all policies
+- Extract the month from renewal_date to determine quarter
+- Sum up premium_amount and sum_insured for each quarter
+- Count policies in each quarter
+- Present in a clear comparison table
 {format_instructions}
+Do NOT say information is missing or that data isn't categorized by quarters - calculate quarters from dates and perform the analysis."""
         elif intent == 'summarize':
             system_prompt = f"""You are Iribl AI, a document analysis assistant providing a SUMMARY.
 4. Provide a comprehensive answer covering all relevant information.
 5. Format clearly with headers and bullet points.
+ENTITY ISOLATION (CRITICAL):
+- If the query mentions a specific entity (company, person, organization), ONLY use information for that entity
+- Use your intelligence to identify the entity mentioned in the current query
+- Do NOT mix information from different entities, even if mentioned in conversation history
+- If conversation history mentions a different entity than the current query, IGNORE that previous entity's information
+- ONLY use data from the DOCUMENT DATA provided for the current query's entity
+- Use your natural language understanding to distinguish between entities
 FINDING NUMBERS AND TOTALS (CRITICAL):
 - When asked about "how many", "total", "sum insured", "students", "count" - search EVERY section
 - The DETAILED DOCUMENT CONTENT section is MORE IMPORTANT than metadata for finding numbers
 - The DETAILED DOCUMENT CONTENT section contains the actual numbers, counts, and totals
 - You MUST search through the DETAILED DOCUMENT CONTENT section to find the answer
 - If metadata doesn't have the answer, the answer is definitely in the detailed content - keep searching!
+"""
+        # Entity isolation instruction - ADDITIVE enhancement to prevent mixing entities
+        # This doesn't replace existing context handling, just adds entity isolation awareness
+        entity_isolation = """
+IMPORTANT: Entity Isolation (to prevent mixing data from different entities):
+- Identify the entity (company, person, organization) mentioned in the current query
+- ONLY use information from documents that mention this entity
+- If conversation history mentions a different entity than the current query, focus on the current entity's data
+- Use your intelligence to distinguish between entities and ensure you're answering about the correct one
 """
         user_message = f"""{context_injection}Based on the following document data, answer my question comprehensively.
 DOCUMENT DATA:
 {context}
 {detailed_content_emphasis}
+{entity_isolation}
 QUESTION: {query}
 Instructions:
 - Use both the structured metadata AND detailed content to provide a complete answer
+- If this is a follow-up, use conversation history to understand what I'm referring to (pronouns like "it", "this", "that")
 - Search THOROUGHLY through ALL document sections for numbers, totals, counts, students, sum insured, etc.
 - For questions about numbers/counts/totals: The DETAILED DOCUMENT CONTENT section is more important than metadata
+- NEVER say information is missing unless you've checked every single section
+- ONLY use information from the DOCUMENT DATA provided above{format_reminder}"""
         messages.append({"role": "user", "content": user_message})
         """
         import time
+        # Step 0: Normalize query with AI (fix typos, expand abbreviations)
+        normalized_query = self._normalize_query_with_ai(query)
+        # Step 0.5: AI-powered query parsing - understand intent and extract structured parameters
+        parsed = self._parse_query_with_ai(normalized_query)
         print(f"[QUERY ROUTING] AI-parsed query: {parsed}")
         # Route based on AI-parsed intent