EdSummariser / utils /README.md
LiamKhoaLe's picture
Add session-spec within project
d063204
# EdSummariser Utils
Core utilities for the EdSummariser RAG system providing document processing, retrieval, and AI integration.
## Core Modules
### `rag.py` - Enhanced Retrieval System
- **Multi-strategy search**: Flat, hybrid, Atlas, and local vector search
- **Flat index**: Exhaustive search for maximum accuracy
- **MongoDB integration**: Chunk storage and retrieval with vector embeddings
- **Search types**: `flat`, `hybrid`, `atlas`, `local`
### `chunker.py` - Document Segmentation
- **Semantic chunking**: Heading-based text segmentation
- **Overlap strategy**: 50-word overlap between chunks for context preservation
- **Academic patterns**: Enhanced regex for academic document structures
- **Size control**: 150-500 word chunks with intelligent splitting
### `embeddings.py` - Vector Generation
- **Sentence Transformers**: all-MiniLM-L6-v2 model (384 dimensions)
- **Lazy loading**: Model loaded on first use
- **Fallback support**: Random embeddings when model unavailable
### `router.py` - AI Model Routing
- **Multi-provider**: Gemini and NVIDIA API integration
- **Model selection**: Automatic routing based on query complexity
- **Retry logic**: Robust error handling with key rotation
### `parser.py` - Document Processing
- **PDF parsing**: PyMuPDF with image extraction
- **DOCX support**: Microsoft Word document processing
- **Image handling**: PIL integration for document images
## AI Integration
### `summarizer.py` - Content Summarization
- **Cheap summarization**: Lightweight text summarization
- **Content cleaning**: LLM-based chunk text cleaning
- **Topic extraction**: Single-sentence topic generation
### `caption.py` - Image Analysis
- **BLIP integration**: Image captioning for document images
- **Visual context**: Image-to-text conversion for RAG
## Memory & Context
### `memo/` - Memory Management
- **Conversation history**: LRU-based memory system
- **Context retrieval**: Semantic and recent context selection
- **NVIDIA integration**: File relevance classification
- **Session-specific memory**: Isolated memory per chat session
- **Auto-naming**: AI-powered session naming based on first query
- **Memory cleanup**: Session and project-level memory management
## Key Features
### Enhanced RAG Capabilities
- **Chain of Thought**: Query variation generation for better retrieval
- **Multi-query search**: 3-5 query variations per search
- **Smart deduplication**: Result ranking and deduplication
- **Fallback strategies**: 4-tier fallback system for zero results
### Document Processing
- **Academic-aware chunking**: Specialized patterns for academic documents
- **Context preservation**: Overlapping chunks maintain document flow
- **Metadata extraction**: Page spans, topics, and summaries
### Performance Optimizations
- **Lazy loading**: Models loaded only when needed
- **Caching**: API key rotation and retry mechanisms
- **Sampling**: Intelligent document sampling for large datasets
## R&D Areas
### Short-term Improvements
- **Query expansion**: More sophisticated query reformulation
- **Reranking**: Cross-encoder models for result reranking
- **Metadata filtering**: Enhanced metadata-based search
### Long-term Enhancements
- **TreeRAG**: Hierarchical document organization
- **Hybrid retrieval**: Sparse + dense retrieval combination
- **Fine-tuning**: Domain-specific embedding models
- **Evaluation framework**: Retrieval accuracy metrics
## Maintenance
### Dependencies
- **Core**: `sentence-transformers`, `pymongo`, `numpy`
- **PDF**: `PyMuPDF`, `PIL`
- **AI**: `httpx` for API calls
- **Optional**: `weasyprint` for PDF generation
### Configuration
- **Environment variables**: API keys, model names, search preferences
- **MongoDB**: Vector index configuration for Atlas
- **Model settings**: Embedding dimensions, chunk sizes
### Monitoring
- **Logging**: Comprehensive logging across all modules
- **Error handling**: Graceful degradation and fallbacks
- **Performance**: Search strategy selection based on results
## Usage
```python
# Basic RAG usage
from utils.rag.rag import RAGStore
from utils.rag.embeddings import EmbeddingClient
rag = RAGStore(mongo_uri, db_name)
embedder = EmbeddingClient()
# Enhanced search
hits = rag.vector_search(
user_id, project_id, query_vector,
k=6, search_type="flat"
)
```
## File Structure
```
utils/
├── rag.py # Core retrieval system
├── chunker.py # Document segmentation
├── embeddings.py # Vector generation
├── router.py # AI model routing
├── parser.py # Document parsing
├── summarizer.py # Content summarization
├── caption.py # Image analysis
├── common.py # Shared utilities
├── logger.py # Logging configuration
└── memo/ # Memory management
├── core.py # Memory system core
├── history.py # Conversation history
├── nvidia.py # NVIDIA integration
└── session.py # Session-specific memory management
```