# RAG Strategy: Vectorless Retrieval for Business Digitization ## RAG Overview Traditional RAG (Retrieval-Augmented Generation) systems use vector embeddings for semantic search. This project implements a **vectorless RAG approach** using an inverted page index system, optimized for business document processing. ## Why Vectorless RAG? ### Advantages Over Vector-Based RAG | Aspect | Vectorless (Page Index) | Vector-Based | |--------|------------------------|--------------| | **Setup Cost** | No embedding generation | Expensive embedding computation | | **Latency** | Fast keyword lookups | Vector similarity computation overhead | | **Explainability** | Clear why results returned | Black box similarity scores | | **Memory** | Lightweight index | Large vector storage | | **Deterministic** | Same query = same results | Embedding model dependent | | **Debugging** | Easy to trace | Difficult to debug relevance | ### When Vectorless Works Best ✅ **Good Fit:** - Structured business documents - Exact keyword matching important - Limited document corpus (10-100 docs) - Need for explainable retrieval - Cost-sensitive applications ❌ **Not Ideal:** - Semantic similarity critical - Massive document collections (1000s+) - Multilingual with no keyword overlap - Query reformulation needed ## Page Index Architecture ### Core Components ```python class PageIndex: """ Vectorless inverted index for fast document retrieval """ def __init__(self): # Document storage self.documents: Dict[str, ParsedDocument] = {} # Inverted indices self.page_index: Dict[str, List[PageReference]] = {} self.table_index: Dict[TableType, List[TableReference]] = {} self.media_index: Dict[ImageCategory, List[MediaReference]] = {} # Metadata self.index_metadata: IndexMetadata = IndexMetadata() # Statistics self.stats: IndexStats = IndexStats() ``` ### Indexing Process ```python class IndexingAgent: """ Build and manage the page index """ def build_index( self, parsed_docs: List[ParsedDocument], tables: List[StructuredTable], media: MediaCollection ) -> PageIndex: """ Create comprehensive inverted index """ index = PageIndex() # Index documents by page for doc in parsed_docs: index.documents[doc.doc_id] = doc self.index_document_pages(doc, index) # Index tables by type and content for table in tables: self.index_table(table, index) # Index media by category for image in media.images: self.index_media(image, index) # Build statistics index.stats = self.compute_statistics(index) return index def index_document_pages( self, doc: ParsedDocument, index: PageIndex ): """ Create inverted index for each page """ for page in doc.pages: # Extract keywords from page text keywords = self.extract_keywords(page.text) # Create page references for keyword in keywords: if keyword not in index.page_index: index.page_index[keyword] = [] index.page_index[keyword].append(PageReference( doc_id=doc.doc_id, page_number=page.number, snippet=self.extract_snippet(page.text, keyword), relevance_score=self.calculate_keyword_relevance( keyword, page.text ) )) ``` ### Keyword Extraction ```python class KeywordExtractor: """ Extract searchable keywords from text """ def __init__(self): # Load stopwords self.stopwords = set(ENGLISH_STOPWORDS) # Common business terms to always include self.business_terms = { 'price', 'cost', 'product', 'service', 'hours', 'location', 'contact', 'email', 'phone', 'website', 'description', 'features', 'specifications' } def extract_keywords(self, text: str) -> Set[str]: """ Multi-strategy keyword extraction """ keywords = set() # Strategy 1: Tokenization with stopword removal tokens = self.tokenize(text) keywords.update( token for token in tokens if token not in self.stopwords or token in self.business_terms ) # Strategy 2: Named entity extraction entities = self.extract_entities(text) keywords.update(entities) # Strategy 3: N-grams (bigrams, trigrams) bigrams = self.extract_ngrams(tokens, n=2) trigrams = self.extract_ngrams(tokens, n=3) keywords.update(bigrams + trigrams) # Strategy 4: Key phrases (noun phrases) phrases = self.extract_noun_phrases(text) keywords.update(phrases) return keywords def tokenize(self, text: str) -> List[str]: """ Tokenize and normalize text """ # Lowercase text = text.lower() # Remove punctuation except in meaningful contexts text = re.sub(r'[^\w\s\-@.]', ' ', text) # Split into tokens tokens = text.split() # Normalize (lemmatization) tokens = [self.lemmatize(token) for token in tokens] # Filter short tokens tokens = [t for t in tokens if len(t) > 2] return tokens def extract_entities(self, text: str) -> List[str]: """ Extract named entities using simple heuristics """ entities = [] # Email addresses emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text) entities.extend(emails) # Phone numbers phones = re.findall(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text) entities.extend(phones) # URLs urls = re.findall(r'https?://\S+', text) entities.extend(urls) # Capitalized words (potential names/places) capitalized = re.findall(r'\b[A-Z][a-z]+\b', text) entities.extend(capitalized) return [e.lower() for e in entities] def extract_ngrams(self, tokens: List[str], n: int) -> List[str]: """ Extract n-grams from token list """ ngrams = [] for i in range(len(tokens) - n + 1): ngram = '_'.join(tokens[i:i+n]) ngrams.append(ngram) return ngrams ``` ## Retrieval Strategies ### Query Processing ```python class QueryProcessor: """ Process and expand user queries """ def process_query(self, query: str) -> ProcessedQuery: """ Normalize and expand query """ # Normalize query normalized = self.normalize_query(query) # Extract query terms terms = self.extract_query_terms(normalized) # Expand with synonyms expanded_terms = self.expand_with_synonyms(terms) # Weight terms by importance weighted_terms = self.weight_terms(expanded_terms) return ProcessedQuery( original=query, normalized=normalized, terms=terms, expanded_terms=expanded_terms, weighted_terms=weighted_terms ) def expand_with_synonyms(self, terms: List[str]) -> List[str]: """ Add synonyms for business terms """ synonym_map = { 'price': ['cost', 'rate', 'fee', 'charge'], 'hours': ['time', 'schedule', 'timing'], 'location': ['address', 'place', 'where'], 'contact': ['phone', 'email', 'reach'], } expanded = set(terms) for term in terms: if term in synonym_map: expanded.update(synonym_map[term]) return list(expanded) ``` ### Context Retrieval ```python class ContextRetriever: """ Retrieve relevant context from page index """ def retrieve_context( self, query: ProcessedQuery, index: PageIndex, max_pages: int = 5 ) -> RetrievalResult: """ Get most relevant pages for query """ # Find pages matching query terms matching_pages = self.find_matching_pages(query, index) # Rank pages by relevance ranked_pages = self.rank_pages(matching_pages, query) # Select top pages top_pages = ranked_pages[:max_pages] # Build context from selected pages context = self.build_context(top_pages, index) return RetrievalResult( query=query, matched_pages=matching_pages, ranked_pages=ranked_pages, context=context, confidence=self.calculate_confidence(top_pages) ) def find_matching_pages( self, query: ProcessedQuery, index: PageIndex ) -> List[PageReference]: """ Find all pages containing query terms """ matching_pages = {} # page_key -> PageReference for term in query.weighted_terms: if term in index.page_index: for page_ref in index.page_index[term]: page_key = f"{page_ref.doc_id}:{page_ref.page_number}" if page_key not in matching_pages: matching_pages[page_key] = page_ref else: # Aggregate relevance scores matching_pages[page_key].relevance_score += page_ref.relevance_score return list(matching_pages.values()) def rank_pages( self, pages: List[PageReference], query: ProcessedQuery ) -> List[PageReference]: """ Rank pages by relevance to query """ scored_pages = [] for page in pages: score = self.calculate_page_score(page, query) scored_pages.append((score, page)) # Sort by score (descending) scored_pages.sort(key=lambda x: x[0], reverse=True) return [page for score, page in scored_pages] def calculate_page_score( self, page: PageReference, query: ProcessedQuery ) -> float: """ Score page relevance using multiple signals """ score = 0.0 # Signal 1: Term frequency in page score += page.relevance_score * 0.5 # Signal 2: Query term coverage coverage = self.calculate_term_coverage(page, query) score += coverage * 0.3 # Signal 3: Snippet quality snippet_quality = self.assess_snippet_quality(page.snippet) score += snippet_quality * 0.2 return score def build_context( self, pages: List[PageReference], index: PageIndex ) -> str: """ Construct context string from top pages """ context_parts = [] for i, page in enumerate(pages): # Get full page text doc = index.documents.get(page.doc_id) if not doc: continue page_obj = doc.pages[page.page_number - 1] # Add page context with source info context_parts.append( f"[Document: {doc.source_file}, Page {page.page_number}]\n" f"{page_obj.text[:1000]}" # Limit to 1000 chars per page ) return "\n\n".join(context_parts) ``` ## Table-Specific Retrieval ```python class TableRetriever: """ Specialized retrieval for tables """ def retrieve_tables( self, table_type: TableType, index: PageIndex ) -> List[StructuredTable]: """ Get all tables of specific type """ if table_type not in index.table_index: return [] table_refs = index.table_index[table_type] # Retrieve full table objects tables = [] for ref in table_refs: # Load table from storage table = self.load_table(ref.table_id) if table: tables.append(table) return tables def retrieve_pricing_tables(self, index: PageIndex) -> List[StructuredTable]: """ Get all pricing tables (common need) """ return self.retrieve_tables(TableType.PRICING, index) def retrieve_itinerary_tables(self, index: PageIndex) -> List[StructuredTable]: """ Get all itinerary/schedule tables """ return self.retrieve_tables(TableType.ITINERARY, index) ``` ## Media-Specific Retrieval ```python class MediaRetriever: """ Retrieve media by category """ def retrieve_product_images( self, index: PageIndex, image_analyses: List[ImageAnalysis] ) -> List[ImageAnalysis]: """ Get all product images """ return [ img for img in image_analyses if img.is_product ] def retrieve_service_images( self, index: PageIndex, image_analyses: List[ImageAnalysis] ) -> List[ImageAnalysis]: """ Get all service-related images """ return [ img for img in image_analyses if img.is_service_related ] ``` ## LLM Integration with RAG ### Context-Aware Prompting ```python class RAGPromptBuilder: """ Build prompts with retrieved context """ def build_extraction_prompt( self, field_name: str, context: RetrievalResult, schema_description: str ) -> str: """ Create prompt with RAG context """ prompt = f""" Extract the following field from the provided business documents: Field: {field_name} Schema Description: {schema_description} Relevant Document Context: {context.context} Instructions: 1. Extract ONLY information present in the documents 2. Do NOT fabricate or infer missing information 3. If field cannot be found, return null 4. Return response as JSON Response format: {{ "{field_name}": , "confidence": 0.0-1.0, "source": "document name and page number" }} """ return prompt ``` ### RAG-Enhanced Schema Mapping ```python class RAGSchemaMapper: """ Use RAG for intelligent field extraction with Groq (gpt-oss-120b) """ def __init__(self): from openai import OpenAI self.retriever = ContextRetriever() self.prompt_builder = RAGPromptBuilder() # Groq API endpoint self.client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY") ) self.model = "gpt-oss-120b" async def extract_field( self, field_name: str, field_schema: dict, index: PageIndex ) -> Any: """ Extract single field using RAG and Groq """ # Build query for field query = ProcessedQuery( original=field_name, normalized=field_name.lower(), terms=[field_name.lower()], expanded_terms=self.expand_field_query(field_name), weighted_terms={} ) # Retrieve relevant context context = self.retriever.retrieve_context(query, index) # Build prompt prompt = self.prompt_builder.build_extraction_prompt( field_name, context, field_schema['description'] ) # Call LLM via Groq response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.3, max_tokens=1000 ) # Parse response result = json.loads(response.choices[0].message.content) return result[field_name] ``` ## Index Optimization ### Compression ```python def compress_index(index: PageIndex) -> CompressedIndex: """ Reduce index size while maintaining performance """ compressed = CompressedIndex() # Store only top N references per keyword for keyword, refs in index.page_index.items(): # Sort by relevance sorted_refs = sorted( refs, key=lambda r: r.relevance_score, reverse=True ) # Keep top 10 compressed.page_index[keyword] = sorted_refs[:10] return compressed ``` ### Caching ```python class IndexCache: """ Cache frequently accessed index lookups """ def __init__(self, max_size: int = 1000): self.cache = {} self.max_size = max_size self.access_counts = {} def get(self, query_key: str) -> Optional[RetrievalResult]: """ Get cached result """ if query_key in self.cache: self.access_counts[query_key] += 1 return self.cache[query_key] return None def put(self, query_key: str, result: RetrievalResult): """ Cache result with LRU eviction """ if len(self.cache) >= self.max_size: # Evict least recently used lru_key = min( self.access_counts.keys(), key=lambda k: self.access_counts[k] ) del self.cache[lru_key] del self.access_counts[lru_key] self.cache[query_key] = result self.access_counts[query_key] = 1 ``` ## Performance Metrics ### Index Statistics ```python @dataclass class IndexStats: """ Index performance metrics """ total_documents: int = 0 total_pages: int = 0 total_keywords: int = 0 avg_keywords_per_page: float = 0.0 index_size_bytes: int = 0 build_time_seconds: float = 0.0 def print_summary(self): print(f""" Index Statistics: - Documents: {self.total_documents} - Pages: {self.total_pages} - Keywords: {self.total_keywords} - Avg keywords/page: {self.avg_keywords_per_page:.1f} - Index size: {self.index_size_bytes / 1024:.1f} KB - Build time: {self.build_time_seconds:.2f}s """) ``` ## Conclusion This vectorless RAG strategy provides: - **Fast retrieval** through inverted indexing - **Explainable results** with keyword matching - **Low overhead** compared to vector embeddings - **Deterministic behavior** for consistent results - **Context-aware LLM integration** for intelligent extraction The approach is optimized for the business digitization use case where documents are structured, the corpus is manageable, and exact keyword matching is valuable.