Spaces:
Sleeping
Sleeping
Commit Β·
ade6079
1
Parent(s): f270e20
added readmeeeee
Browse files- .gitignore +2 -1
- LICENSE +21 -0
- LLM/README.md +413 -0
- LLM/image_answerer.py +4 -2
- LLM/lite_llm.py +6 -1
- LLM/tabular_answer.py +5 -1
- RAG/README.md +302 -0
- README.md +639 -6
- api/README.md +442 -0
- config/README.md +199 -0
- config/config.py +3 -6
- logger/README.md +118 -0
- preprocessing/README.md +362 -0
.gitignore
CHANGED
|
@@ -7,4 +7,5 @@ test*
|
|
| 7 |
all-MiniLM-L6-v2
|
| 8 |
cross-encoder/ms-marco-MiniLM-L-6-v2
|
| 9 |
test
|
| 10 |
-
RAG/rag_embeddings/[a-z]*
|
|
|
|
|
|
| 7 |
all-MiniLM-L6-v2
|
| 8 |
cross-encoder/ms-marco-MiniLM-L-6-v2
|
| 9 |
test
|
| 10 |
+
RAG/rag_embeddings/[a-z]*
|
| 11 |
+
.cache/
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2025 Rahul Samedavar and Sambhaji Patil
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in all
|
| 13 |
+
copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
LLM/README.md
ADDED
|
@@ -0,0 +1,413 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ShastraDocs - LLM Handler Package
|
| 2 |
+
|
| 3 |
+
## π Overview
|
| 4 |
+
|
| 5 |
+
The ShastraDocs LLM Handler Package is a comprehensive, production-ready solution for multi-provider language model management with intelligent rate limiting, specialized processors, and automated fallback mechanisms. This package enables seamless interaction with multiple LLM providers (Groq, Gemini, OpenAI) while handling rate limits gracefully and providing specialized processing for different data types.
|
| 6 |
+
|
| 7 |
+
## π― Key Benefits
|
| 8 |
+
|
| 9 |
+
### **Smart Rate Limit Handling**
|
| 10 |
+
- **Multi-Provider Cycling**: Automatically rotates between Groq, Gemini, and OpenAI instances
|
| 11 |
+
- **Intelligent Cooldown Management**: Tracks rate limits per provider and implements automatic cooldowns
|
| 12 |
+
- **Cost-Effective Operations**: Process 200+ questions through RAG pipeline with **$0** using free tier rotation
|
| 13 |
+
- **Zero Downtime**: Seamless fallback between providers ensures continuous operation
|
| 14 |
+
|
| 15 |
+
### **Specialized Handlers for Specific Tasks**
|
| 16 |
+
- **Modular Architecture**: Choose optimal models, prompts, and formatting per data type
|
| 17 |
+
- **Task-Specific Optimization**: Dedicated processors for images, tables, documents, and general text
|
| 18 |
+
- **Provider Flexibility**: Run with single API key or multiple keys across different providers
|
| 19 |
+
|
| 20 |
+
### **Production-Ready Features**
|
| 21 |
+
- **Async/Await Support**: Full FastAPI compatibility for high-performance applications
|
| 22 |
+
- **Error Recovery**: Robust exception handling with automatic retries
|
| 23 |
+
- **Comprehensive Logging**: Detailed status tracking and performance monitoring
|
| 24 |
+
- **Thread-Safe Operations**: Concurrent request handling with proper synchronization
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
### Error Handling
|
| 28 |
+
|
| 29 |
+
Comprehensive error handling with:
|
| 30 |
+
|
| 31 |
+
- **Automatic Retries**: Built-in retry logic for transient failures
|
| 32 |
+
- **Provider Fallback**: Seamless switching between providers
|
| 33 |
+
- **Graceful Degradation**: Continues operation even with partial failures
|
| 34 |
+
- **Detailed Logging**: Comprehensive error tracking and reporting
|
| 35 |
+
|
| 36 |
+
## π Performance Metrics
|
| 37 |
+
|
| 38 |
+
### Cost Efficiency
|
| 39 |
+
- **Free Tier Optimization**: 200+ questions processed at $0 cost
|
| 40 |
+
- **Smart Provider Selection**: Chooses most cost-effective available provider
|
| 41 |
+
- **Rate Limit Avoidance**: Prevents unnecessary paid API calls
|
| 42 |
+
|
| 43 |
+
### Response Times
|
| 44 |
+
- **Concurrent Processing**: Multiple requests handled simultaneously
|
| 45 |
+
- **Provider Optimization**: Fastest available provider selected first
|
| 46 |
+
- **Caching Support**: LRU cache for frequently used configurations
|
| 47 |
+
|
| 48 |
+
### Reliability
|
| 49 |
+
- **99%+ Uptime**: Multiple provider fallback ensures availability
|
| 50 |
+
- **Error Recovery**: Automatic recovery from rate limits and failures
|
| 51 |
+
- **Status Monitoring**: Real-time health checking of all providers
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
## π¦ Package Components
|
| 55 |
+
|
| 56 |
+
### π§ Core Components
|
| 57 |
+
|
| 58 |
+
#### **1. Unified LLM Handler (`llm_handler.py`)**
|
| 59 |
+
|
| 60 |
+
The heart of the package - a sophisticated multi-provider LLM manager with intelligent routing and rate limit handling.
|
| 61 |
+
|
| 62 |
+
**Key Features:**
|
| 63 |
+
- **Multi-Instance Support**: Handle multiple API keys per provider
|
| 64 |
+
- **Priority-Based Routing**: Groq β Gemini β OpenAI fallback sequence
|
| 65 |
+
- **Automatic Cooldown Management**: 60-second cooldowns for rate-limited providers
|
| 66 |
+
- **Real-Time Status Tracking**: Monitor provider availability and performance
|
| 67 |
+
- **Reasoning Model Support**: Special handling for reasoning models with format options
|
| 68 |
+
|
| 69 |
+
**Usage Example:**
|
| 70 |
+
```python
|
| 71 |
+
from llm_handler import llm_handler
|
| 72 |
+
|
| 73 |
+
# Generate text with automatic provider selection
|
| 74 |
+
result, provider, instance = await llm_handler.generate_text(
|
| 75 |
+
system_prompt="You are a helpful assistant",
|
| 76 |
+
user_prompt="Explain quantum computing",
|
| 77 |
+
temperature=0.7,
|
| 78 |
+
reasoning_format="hidden" # For reasoning models
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
# Get provider status
|
| 82 |
+
status = llm_handler.get_provider_status()
|
| 83 |
+
print(f"Active providers: {len(status)}")
|
| 84 |
+
|
| 85 |
+
# Reset cooldowns if needed
|
| 86 |
+
llm_handler.reset_cooldowns()
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
**Supported Providers:**
|
| 90 |
+
- **Groq**: High-speed inference with reasoning model support
|
| 91 |
+
- **Gemini**: Google's advanced models with vision capabilities
|
| 92 |
+
- **OpenAI**: GPT models with reliable performance
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
### Refer Confiugration section to learn more on how to setupp api keys
|
| 96 |
+
-----
|
| 97 |
+
#### **2. OneShot QA System (`one_shotter.py`)**
|
| 98 |
+
|
| 99 |
+
An advanced question-answering system that combines context analysis, web scraping, and search capabilities for comprehensive responses.
|
| 100 |
+
|
| 101 |
+
**Key Features:**
|
| 102 |
+
- **Intelligent Content Strategy**: Automatically determines need for additional information
|
| 103 |
+
- **Multi-Source Integration**: Combines provided context with scraped web content
|
| 104 |
+
- **Smart URL Detection**: Extracts and validates URLs from context and questions
|
| 105 |
+
- **Async Web Scraping**: High-performance concurrent scraping with rate limiting
|
| 106 |
+
- **Enhanced Answer Generation**: Utilizes all available sources for comprehensive responses
|
| 107 |
+
|
| 108 |
+
**Workflow Process:**
|
| 109 |
+
1. **URL Extraction**: Identifies relevant links in context/questions
|
| 110 |
+
2. **Content Strategy**: Determines if additional information is needed
|
| 111 |
+
3. **Web Scraping**: Fetches content from identified URLs
|
| 112 |
+
4. **Context Integration**: Combines original context with scraped content
|
| 113 |
+
5. **Answer Generation**: Produces comprehensive responses using all sources
|
| 114 |
+
|
| 115 |
+
**Usage Example:**
|
| 116 |
+
```python
|
| 117 |
+
from one_shotter import get_oneshot_answer
|
| 118 |
+
|
| 119 |
+
# Comprehensive QA with automatic content enhancement
|
| 120 |
+
questions = [
|
| 121 |
+
"What are the latest developments in AI?",
|
| 122 |
+
"How do quantum computers work?"
|
| 123 |
+
]
|
| 124 |
+
|
| 125 |
+
context = """
|
| 126 |
+
AI has been advancing rapidly...
|
| 127 |
+
Check out: https://openai.com/research
|
| 128 |
+
"""
|
| 129 |
+
|
| 130 |
+
answers = await get_oneshot_answer(context, questions)
|
| 131 |
+
# Returns detailed answers incorporating scraped web content
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
### π― Specialized Handlers
|
| 135 |
+
|
| 136 |
+
#### **3. Image Analysis Handler (`image_answerer.py`)**
|
| 137 |
+
|
| 138 |
+
Specialized processor for visual question answering using Gemini's vision capabilities.
|
| 139 |
+
|
| 140 |
+
**Features:**
|
| 141 |
+
- **Multi-Format Support**: URLs and local file paths
|
| 142 |
+
- **Structured Responses**: Numbered, detailed explanations
|
| 143 |
+
- **Retry Logic**: Automatic retries with error handling
|
| 144 |
+
- **Image Preprocessing**: Automatic RGB conversion and validation
|
| 145 |
+
|
| 146 |
+
**Usage Example:**
|
| 147 |
+
```python
|
| 148 |
+
from image_answerer import get_answer_for_image
|
| 149 |
+
|
| 150 |
+
questions = [
|
| 151 |
+
"What objects are in this image?",
|
| 152 |
+
"What is the dominant color scheme?"
|
| 153 |
+
]
|
| 154 |
+
|
| 155 |
+
answers = get_answer_for_image(
|
| 156 |
+
"https://example.com/image.jpg",
|
| 157 |
+
questions,
|
| 158 |
+
retries=3
|
| 159 |
+
)
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
#### **4. Tabular Data Handler (`tabular_answer.py`)**
|
| 163 |
+
|
| 164 |
+
Optimized for analyzing structured data with batch processing capabilities.
|
| 165 |
+
|
| 166 |
+
**Features:**
|
| 167 |
+
- **Batch Processing**: Handle multiple questions efficiently
|
| 168 |
+
- **Structured Parsing**: Robust numbered response extraction
|
| 169 |
+
- **Data Validation**: Handles malicious instructions and missing data
|
| 170 |
+
- **Performance Optimization**: Configurable batch sizes
|
| 171 |
+
|
| 172 |
+
**Usage Example:**
|
| 173 |
+
```python
|
| 174 |
+
from tabular_answer import get_answer_for_tabluar
|
| 175 |
+
|
| 176 |
+
data = """
|
| 177 |
+
| Product | Sales | Region |
|
| 178 |
+
|---------|-------|--------|
|
| 179 |
+
| A | 1000 | North |
|
| 180 |
+
| B | 1500 | South |
|
| 181 |
+
"""
|
| 182 |
+
|
| 183 |
+
questions = [
|
| 184 |
+
"Which product has highest sales?",
|
| 185 |
+
"What is the total sales?"
|
| 186 |
+
]
|
| 187 |
+
|
| 188 |
+
answers = get_answer_for_tabluar(data, questions, batch_size=10)
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
#### **5. Lite LLM Handler (`lite_llm.py`)**
|
| 192 |
+
|
| 193 |
+
Lightweight handler for simple, fast responses with minimal overhead.
|
| 194 |
+
|
| 195 |
+
**Features:**
|
| 196 |
+
- **Single Provider**: Focused Groq integration
|
| 197 |
+
- **Minimal Configuration**: Simple prompt-to-response interface
|
| 198 |
+
- **High Performance**: Optimized for speed over complex features
|
| 199 |
+
- **Configurable Parameters**: Adjustable temperature and token limits
|
| 200 |
+
|
| 201 |
+
## βοΈ Configuration Setup
|
| 202 |
+
|
| 203 |
+
### Environment Variables Setup
|
| 204 |
+
|
| 205 |
+
The package uses a flexible configuration system that automatically detects and loads multiple API keys for each provider. Create a `.env` file with your API keys using the following naming convention:
|
| 206 |
+
|
| 207 |
+
#### **Basic Configuration (.env file)**
|
| 208 |
+
|
| 209 |
+
```bash
|
| 210 |
+
# === GROQ PROVIDER ===
|
| 211 |
+
# Multiple Groq API Keys (detects GROQ_API_KEY_1 through GROQ_API_KEY_10)
|
| 212 |
+
GROQ_API_KEY_1=your_first_groq_key_here
|
| 213 |
+
GROQ_API_KEY_2=your_second_groq_key_here
|
| 214 |
+
GROQ_API_KEY_3=your_third_groq_key_here
|
| 215 |
+
# Add more as needed: GROQ_API_KEY_4, GROQ_API_KEY_5, etc.
|
| 216 |
+
|
| 217 |
+
# Optional: Custom models per Groq instance (defaults to qwen/qwen3-32b)
|
| 218 |
+
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
|
| 219 |
+
GROQ_MODEL_1=llama3-70b-8192
|
| 220 |
+
GROQ_MODEL_2=mixtral-8x7b-32768
|
| 221 |
+
# GROQ_MODEL_3 will use DEFAULT_GROQ_MODEL if not specified
|
| 222 |
+
|
| 223 |
+
# === GEMINI PROVIDER ===
|
| 224 |
+
# Multiple Gemini API Keys (detects GEMINI_API_KEY_1 through GEMINI_API_KEY_10)
|
| 225 |
+
GEMINI_API_KEY_1=your_first_gemini_key_here
|
| 226 |
+
GEMINI_API_KEY_2=your_second_gemini_key_here
|
| 227 |
+
GEMINI_API_KEY_3=your_third_gemini_key_here
|
| 228 |
+
# Add more as needed: GEMINI_API_KEY_4, GEMINI_API_KEY_5, etc.
|
| 229 |
+
|
| 230 |
+
# Optional: Custom models per Gemini instance (defaults to gemini-2.0-flash)
|
| 231 |
+
DEFAULT_GEMINI_MODEL=gemini-2.0-flash
|
| 232 |
+
GEMINI_MODEL_1=gemini-1.5-pro
|
| 233 |
+
GEMINI_MODEL_2=gemini-2.0-flash
|
| 234 |
+
# GEMINI_MODEL_3 will use DEFAULT_GEMINI_MODEL if not specified
|
| 235 |
+
|
| 236 |
+
# === OPENAI PROVIDER ===
|
| 237 |
+
# Multiple OpenAI API Keys (detects OPENAI_API_KEY_1 through OPENAI_API_KEY_10)
|
| 238 |
+
OPENAI_API_KEY_1=your_first_openai_key_here
|
| 239 |
+
OPENAI_API_KEY_2=your_second_openai_key_here
|
| 240 |
+
# Add more as needed: OPENAI_API_KEY_3, OPENAI_API_KEY_4, etc.
|
| 241 |
+
|
| 242 |
+
# Optional: Custom models per OpenAI instance (defaults to gpt-4o-mini)
|
| 243 |
+
DEFAULT_OPENAI_MODEL=gpt-4o-mini
|
| 244 |
+
OPENAI_MODEL_1=gpt-4o
|
| 245 |
+
OPENAI_MODEL_2=gpt-4-turbo
|
| 246 |
+
# OPENAI_MODEL_3 will use DEFAULT_OPENAI_MODEL if not specified
|
| 247 |
+
|
| 248 |
+
# === SPECIALIZED HANDLERS ===
|
| 249 |
+
# For specific handlers that need dedicated keys
|
| 250 |
+
GROQ_API_KEY_LITE=your_groq_key_for_lite_handler
|
| 251 |
+
GROQ_API_KEY_TABULAR=your_groq_key_for_tabular_handler
|
| 252 |
+
GEMINI_API_KEY_IMAGE=your_gemini_key_for_image_handler
|
| 253 |
+
|
| 254 |
+
# === GLOBAL DEFAULTS ===
|
| 255 |
+
MAX_TOKENS=2048
|
| 256 |
+
TEMPERATURE=0.7
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
### **Quick Setup Guide**
|
| 260 |
+
|
| 261 |
+
1. **Create `.env` file** in your project root
|
| 262 |
+
2. **Add API keys** using the `PROVIDER_API_KEY_NUMBER` format
|
| 263 |
+
3. **Set default models** (optional) using `DEFAULT_PROVIDER_MODEL`
|
| 264 |
+
4. **Customize specific models** (optional) using `PROVIDER_MODEL_NUMBER`
|
| 265 |
+
5. **Run your application** - the handler will auto-detect all configurations
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
## π Quick Start
|
| 269 |
+
|
| 270 |
+
### Basic Installation
|
| 271 |
+
|
| 272 |
+
```bash
|
| 273 |
+
pip install -r requirements.txt
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
### Required Dependencies
|
| 277 |
+
|
| 278 |
+
```
|
| 279 |
+
groq
|
| 280 |
+
google-generativeai
|
| 281 |
+
openai
|
| 282 |
+
langchain-groq
|
| 283 |
+
langchain-google-genai
|
| 284 |
+
httpx
|
| 285 |
+
beautifulsoup4
|
| 286 |
+
pydantic
|
| 287 |
+
python-dotenv
|
| 288 |
+
```
|
| 289 |
+
|
| 290 |
+
### Simple Usage
|
| 291 |
+
|
| 292 |
+
```python
|
| 293 |
+
import asyncio
|
| 294 |
+
from llm_handler import llm_handler
|
| 295 |
+
from one_shotter import get_oneshot_answer
|
| 296 |
+
|
| 297 |
+
async def main():
|
| 298 |
+
# Simple text generation
|
| 299 |
+
result, provider, instance = await llm_handler.generate_text(
|
| 300 |
+
system_prompt="You are a helpful assistant",
|
| 301 |
+
user_prompt="Explain machine learning in simple terms"
|
| 302 |
+
)
|
| 303 |
+
print(f"Generated by {provider} ({instance}): {result}")
|
| 304 |
+
|
| 305 |
+
# Advanced QA with content enhancement
|
| 306 |
+
context = "Machine learning is a subset of AI..."
|
| 307 |
+
questions = ["What are the main types of ML?"]
|
| 308 |
+
|
| 309 |
+
answers = await get_oneshot_answer(context, questions)
|
| 310 |
+
print(f"Enhanced answer: {answers[0]}")
|
| 311 |
+
|
| 312 |
+
if __name__ == "__main__":
|
| 313 |
+
asyncio.run(main())
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
## π Advanced Features
|
| 317 |
+
|
| 318 |
+
### Rate Limit Management
|
| 319 |
+
|
| 320 |
+
The package automatically handles rate limits through:
|
| 321 |
+
|
| 322 |
+
- **Provider Cycling**: Rotates between available instances
|
| 323 |
+
- **Cooldown Tracking**: Monitors rate limit windows per provider
|
| 324 |
+
- **Automatic Recovery**: Restores providers when cooldowns expire
|
| 325 |
+
- **Status Monitoring**: Real-time availability tracking
|
| 326 |
+
|
| 327 |
+
### FastAPI Integration
|
| 328 |
+
|
| 329 |
+
Full async/await support for FastAPI applications:
|
| 330 |
+
|
| 331 |
+
```python
|
| 332 |
+
from fastapi import FastAPI
|
| 333 |
+
from one_shotter import get_oneshot_answer
|
| 334 |
+
|
| 335 |
+
app = FastAPI()
|
| 336 |
+
|
| 337 |
+
@app.post("/qa")
|
| 338 |
+
async def question_answer(context: str, questions: list[str]):
|
| 339 |
+
answers = await get_oneshot_answer(context, questions)
|
| 340 |
+
return {"answers": answers}
|
| 341 |
+
```
|
| 342 |
+
|
| 343 |
+
### Error Handling
|
| 344 |
+
|
| 345 |
+
Comprehensive error handling with:
|
| 346 |
+
|
| 347 |
+
- **Automatic Retries**: Built-in retry logic for transient failures
|
| 348 |
+
- **Provider Fallback**: Seamless switching between providers
|
| 349 |
+
- **Graceful Degradation**: Continues operation even with partial failures
|
| 350 |
+
- **Detailed Logging**: Comprehensive error tracking and reporting
|
| 351 |
+
|
| 352 |
+
## π Performance Metrics
|
| 353 |
+
|
| 354 |
+
### Cost Efficiency
|
| 355 |
+
- **Free Tier Optimization**: 200+ questions processed at $0 cost
|
| 356 |
+
- **Smart Provider Selection**: Chooses most cost-effective available provider
|
| 357 |
+
- **Rate Limit Avoidance**: Prevents unnecessary paid API calls
|
| 358 |
+
|
| 359 |
+
### Response Times
|
| 360 |
+
- **Concurrent Processing**: Multiple requests handled simultaneously
|
| 361 |
+
- **Provider Optimization**: Fastest available provider selected first
|
| 362 |
+
- **Caching Support**: LRU cache for frequently used configurations
|
| 363 |
+
|
| 364 |
+
### Reliability
|
| 365 |
+
- **99%+ Uptime**: Multiple provider fallback ensures availability
|
| 366 |
+
- **Error Recovery**: Automatic recovery from rate limits and failures
|
| 367 |
+
- **Status Monitoring**: Real-time health checking of all providers
|
| 368 |
+
|
| 369 |
+
## π οΈ Troubleshooting
|
| 370 |
+
|
| 371 |
+
### Common Issues
|
| 372 |
+
|
| 373 |
+
1. **No Providers Available**
|
| 374 |
+
- Check API key configuration
|
| 375 |
+
- Verify network connectivity
|
| 376 |
+
- Review provider status with `get_provider_status()`
|
| 377 |
+
|
| 378 |
+
2. **Rate Limit Errors**
|
| 379 |
+
- Monitor cooldown status
|
| 380 |
+
- Add more API keys to configuration
|
| 381 |
+
- Use `reset_cooldowns()` for testing
|
| 382 |
+
|
| 383 |
+
3. **Scraping Failures**
|
| 384 |
+
- Check URL accessibility
|
| 385 |
+
- Verify network firewall settings
|
| 386 |
+
- Review timeout configurations
|
| 387 |
+
|
| 388 |
+
### Debug Mode
|
| 389 |
+
|
| 390 |
+
Enable verbose logging for troubleshooting:
|
| 391 |
+
|
| 392 |
+
```python
|
| 393 |
+
import logging
|
| 394 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 395 |
+
|
| 396 |
+
# Get detailed provider information
|
| 397 |
+
info = llm_handler.get_provider_info()
|
| 398 |
+
print(json.dumps(info, indent=2))
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
## π€ Contributing
|
| 402 |
+
|
| 403 |
+
This package is part of the larger ShastraDocs project. For contributions:
|
| 404 |
+
|
| 405 |
+
1. Follow the modular architecture pattern
|
| 406 |
+
2. Maintain async/await compatibility
|
| 407 |
+
3. Add comprehensive error handling
|
| 408 |
+
4. Include type hints and documentation
|
| 409 |
+
5. Test with multiple providers
|
| 410 |
+
|
| 411 |
+
## π License
|
| 412 |
+
|
| 413 |
+
Part of the ShastraDocs project. Refer to the main project license for terms and conditions.
|
LLM/image_answerer.py
CHANGED
|
@@ -9,9 +9,11 @@ from dotenv import load_dotenv
|
|
| 9 |
|
| 10 |
load_dotenv()
|
| 11 |
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
genai.configure(api_key=os.getenv("GEMIN_API_KEY_IMAGE"))
|
| 15 |
|
| 16 |
def load_image(image_source: str) -> Image.Image:
|
| 17 |
"""Load image from a URL or local path."""
|
|
|
|
| 9 |
|
| 10 |
load_dotenv()
|
| 11 |
|
| 12 |
+
APIKEY = os.getenv("GEMINI_API_KEY_IMAGE")
|
| 13 |
+
if not APIKEY:
|
| 14 |
+
APIKEY = os.getenv("GEMINI_API_KEY_1")
|
| 15 |
|
| 16 |
+
genai.configure(api_key=APIKEY)
|
|
|
|
| 17 |
|
| 18 |
def load_image(image_source: str) -> Image.Image:
|
| 19 |
"""Load image from a URL or local path."""
|
LLM/lite_llm.py
CHANGED
|
@@ -5,9 +5,14 @@ from typing import Optional
|
|
| 5 |
from dotenv import load_dotenv
|
| 6 |
load_dotenv()
|
| 7 |
|
| 8 |
-
GROQ_API_KEY_LITE = os.getenv("GROQ_API_KEY_LITE"
|
|
|
|
|
|
|
|
|
|
| 9 |
GROQ_MODEL_LITE = "llama3-8b-8192"
|
| 10 |
|
|
|
|
|
|
|
| 11 |
client = Groq(api_key=GROQ_API_KEY_LITE)
|
| 12 |
|
| 13 |
def generate_lite(
|
|
|
|
| 5 |
from dotenv import load_dotenv
|
| 6 |
load_dotenv()
|
| 7 |
|
| 8 |
+
GROQ_API_KEY_LITE = os.getenv("GROQ_API_KEY_LITE", "")
|
| 9 |
+
if GROQ_API_KEY_LITE == "":
|
| 10 |
+
GROQ_API_KEY_LITE = os.getenv("GROQ_API_KEY_LITE")
|
| 11 |
+
|
| 12 |
GROQ_MODEL_LITE = "llama3-8b-8192"
|
| 13 |
|
| 14 |
+
assert GROQ_API_KEY_LITE, "GROQ KEY LITE NOT SET"
|
| 15 |
+
|
| 16 |
client = Groq(api_key=GROQ_API_KEY_LITE)
|
| 17 |
|
| 18 |
def generate_lite(
|
LLM/tabular_answer.py
CHANGED
|
@@ -10,8 +10,12 @@ from dotenv import load_dotenv
|
|
| 10 |
load_dotenv()
|
| 11 |
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
GROQ_LLM = ChatGroq(
|
| 14 |
-
groq_api_key=
|
| 15 |
model_name="qwen/qwen3-32b"
|
| 16 |
)
|
| 17 |
|
|
|
|
| 10 |
load_dotenv()
|
| 11 |
|
| 12 |
|
| 13 |
+
API_KEY = os.environ.get("GROQ_API_KEY_TABULAR")
|
| 14 |
+
if not API_KEY:
|
| 15 |
+
os.environ.get("GROQ_API_KEY_1")
|
| 16 |
+
|
| 17 |
GROQ_LLM = ChatGroq(
|
| 18 |
+
groq_api_key=API_KEY,
|
| 19 |
model_name="qwen/qwen3-32b"
|
| 20 |
)
|
| 21 |
|
RAG/README.md
ADDED
|
@@ -0,0 +1,302 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG Package - Shastra Docs
|
| 2 |
+
|
| 3 |
+
An advanced Retrieval-Augmented Generation (RAG) system designed for intelligent document analysis and question answering, particularly optimized for policy documents and official documentation.
|
| 4 |
+
|
| 5 |
+
## π Overview
|
| 6 |
+
|
| 7 |
+
The RAG package provides a modular, production-ready system that combines multiple retrieval techniques with large language models to deliver accurate, context-aware answers from document collections. It's specifically designed for analyzing official documents, policies, and complex regulatory content.
|
| 8 |
+
|
| 9 |
+
## ποΈ Architecture
|
| 10 |
+
|
| 11 |
+
### Core Components
|
| 12 |
+
|
| 13 |
+
The system follows a modular architecture with six main components:
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
RAG Processor (Orchestrator)
|
| 17 |
+
βββ Query Expansion Manager # Breaks complex queries into focused sub-queries
|
| 18 |
+
βββ Embedding Manager # Handles semantic embeddings using SentenceTransformers
|
| 19 |
+
βββ Search Manager # Hybrid search (BM25 + Semantic) with score fusion
|
| 20 |
+
βββ Reranking Manager # Cross-encoder reranking for relevance refinement
|
| 21 |
+
βββ Context Manager # Multi-perspective context creation
|
| 22 |
+
βββ Answer Generator # LLM-based answer generation with enhanced prompting
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
## π¦ Package Structure
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
rag/
|
| 29 |
+
βββ __init__.py
|
| 30 |
+
βββ advanced_rag_processor.py # Main orchestrator class
|
| 31 |
+
βββ rag_modules/
|
| 32 |
+
βββ __init__.py
|
| 33 |
+
βββ query_expansion.py # Query decomposition and expansion
|
| 34 |
+
βββ embedding_manager.py # Text embedding operations
|
| 35 |
+
βββ search_manager.py # Hybrid search implementation
|
| 36 |
+
βββ reranking_manager.py # Result reranking with cross-encoders
|
| 37 |
+
βββ context_manager.py # Context creation and management
|
| 38 |
+
βββ answer_generator.py # LLM-based answer generation
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## π§ Key Features
|
| 42 |
+
|
| 43 |
+
### 1. **Advanced Query Processing**
|
| 44 |
+
- **Query Expansion**: Automatically breaks complex questions into focused sub-queries
|
| 45 |
+
- **Multi-aspect Analysis**: Identifies different components (processes, documents, contacts, etc.)
|
| 46 |
+
- **Focused Retrieval**: Each sub-query targets specific information types
|
| 47 |
+
|
| 48 |
+
### 2. **Hybrid Search System**
|
| 49 |
+
- **Semantic Search**: Dense vector similarity using SentenceTransformers
|
| 50 |
+
- **Keyword Search**: BM25 for exact term matching
|
| 51 |
+
- **Score Fusion**: Reciprocal Rank Fusion with weighted combination
|
| 52 |
+
- **Budget Management**: Intelligent distribution of retrieval budget across queries
|
| 53 |
+
|
| 54 |
+
### 3. **Intelligent Reranking**
|
| 55 |
+
- **Cross-encoder Models**: Advanced relevance scoring
|
| 56 |
+
- **Multi-stage Filtering**: Progressive refinement of results
|
| 57 |
+
- **Score Combination**: Weighted fusion of retrieval and reranking scores
|
| 58 |
+
|
| 59 |
+
### 4. **Context-Aware Generation**
|
| 60 |
+
- **Multi-perspective Context**: Equal representation from all sub-queries
|
| 61 |
+
- **Enhanced Prompting**: Specialized prompts for policy and document analysis
|
| 62 |
+
- **Error Handling**: Graceful handling of edge cases and invalid requests
|
| 63 |
+
|
| 64 |
+
### 5. **Production Features**
|
| 65 |
+
- **Resource Management**: Efficient cleanup and memory management
|
| 66 |
+
- **Performance Monitoring**: Detailed timing and usage statistics
|
| 67 |
+
- **Provider Fallback**: Multi-provider LLM support with automatic fallback
|
| 68 |
+
- **Health Monitoring**: System status and component health checks
|
| 69 |
+
|
| 70 |
+
## π¦ Usage
|
| 71 |
+
|
| 72 |
+
### Basic Usage
|
| 73 |
+
|
| 74 |
+
```python
|
| 75 |
+
from rag.advanced_rag_processor import AdvancedRAGProcessor
|
| 76 |
+
|
| 77 |
+
# Initialize the RAG processor
|
| 78 |
+
rag = AdvancedRAGProcessor()
|
| 79 |
+
|
| 80 |
+
# Process a question
|
| 81 |
+
question = "What is the dental claim submission process and required documents?"
|
| 82 |
+
doc_id = "policy_document_2024"
|
| 83 |
+
|
| 84 |
+
answer, timings = await rag.answer_question(question, doc_id)
|
| 85 |
+
print(answer)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Advanced Usage with Monitoring
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
import logging
|
| 92 |
+
from rag.advanced_rag_processor import AdvancedRAGProcessor
|
| 93 |
+
|
| 94 |
+
# Initialize with logging
|
| 95 |
+
rag = AdvancedRAGProcessor()
|
| 96 |
+
|
| 97 |
+
# Get system information
|
| 98 |
+
system_info = rag.get_system_info()
|
| 99 |
+
print(f"RAG Version: {system_info['version']}")
|
| 100 |
+
|
| 101 |
+
# Process question with detailed tracking
|
| 102 |
+
answer, timings = await rag.answer_question(
|
| 103 |
+
question="How to update surname in policy records?",
|
| 104 |
+
doc_id="hr_policy_2024",
|
| 105 |
+
logger=your_logger,
|
| 106 |
+
request_id="req_123"
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
# Monitor performance
|
| 110 |
+
print(f"Total processing time: {timings['total_pipeline']:.4f}s")
|
| 111 |
+
print(f"Search time: {timings['hybrid_search']:.4f}s")
|
| 112 |
+
print(f"Generation time: {timings['llm_generation']:.4f}s")
|
| 113 |
+
|
| 114 |
+
# Get provider usage statistics
|
| 115 |
+
stats = rag.get_provider_usage_stats()
|
| 116 |
+
print(f"Provider usage: {stats}")
|
| 117 |
+
|
| 118 |
+
# Check system health
|
| 119 |
+
health = rag.get_health_status()
|
| 120 |
+
print(f"System status: {health['status']}")
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## βοΈ Configuration
|
| 124 |
+
|
| 125 |
+
The system relies on configuration from `config/config.py`:
|
| 126 |
+
|
| 127 |
+
### Key Configuration Options
|
| 128 |
+
|
| 129 |
+
```python
|
| 130 |
+
# Search Configuration
|
| 131 |
+
TOP_K = 9 # Number of chunks to retrieve
|
| 132 |
+
SCORE_THRESHOLD = 0.3 # Minimum relevance score
|
| 133 |
+
ENABLE_HYBRID_SEARCH = True # Enable BM25 + Semantic search
|
| 134 |
+
USE_TOTAL_BUDGET_APPROACH = True # Distribute budget across queries
|
| 135 |
+
|
| 136 |
+
# Query Expansion
|
| 137 |
+
ENABLE_QUERY_EXPANSION = True # Enable query decomposition
|
| 138 |
+
QUERY_EXPANSION_COUNT = 3 # Number of sub-queries to generate
|
| 139 |
+
|
| 140 |
+
# Reranking
|
| 141 |
+
ENABLE_RERANKING = True # Enable cross-encoder reranking
|
| 142 |
+
RERANK_TOP_K = 6 # Number of results to rerank
|
| 143 |
+
|
| 144 |
+
# Models
|
| 145 |
+
EMBEDDING_MODEL = "bge-large-en"
|
| 146 |
+
RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
|
| 147 |
+
|
| 148 |
+
# LLM Generation
|
| 149 |
+
TEMPERATURE = 0.1 # LLM temperature
|
| 150 |
+
MAX_TOKENS = 800 # Maximum response tokens
|
| 151 |
+
MAX_CONTEXT_LENGTH = 8000 # Maximum context length
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
### Weight Configuration
|
| 155 |
+
|
| 156 |
+
```python
|
| 157 |
+
# Score fusion weights
|
| 158 |
+
BM25_WEIGHT = 0.3 # Weight for keyword search
|
| 159 |
+
SEMANTIC_WEIGHT = 0.7 # Weight for semantic search
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
## π― Specialized Answer Generation
|
| 163 |
+
|
| 164 |
+
The system includes specialized prompting for different query types:
|
| 165 |
+
|
| 166 |
+
### Supported Query Categories
|
| 167 |
+
|
| 168 |
+
1. **Valid Document Queries**: Comprehensive answers with document references
|
| 169 |
+
2. **Invalid/Out-of-scope**: Polite redirection to document-specific assistance
|
| 170 |
+
3. **Illegal Requests**: Clear refusal with legal context
|
| 171 |
+
4. **Missing Information**: Transparent acknowledgment with available alternatives
|
| 172 |
+
5. **Non-existent Concepts**: Clarification with related valid information
|
| 173 |
+
|
| 174 |
+
## π Performance Monitoring
|
| 175 |
+
|
| 176 |
+
### Timing Breakdown
|
| 177 |
+
|
| 178 |
+
The system provides detailed performance metrics:
|
| 179 |
+
|
| 180 |
+
```python
|
| 181 |
+
timings = {
|
| 182 |
+
'query_expansion': 0.156, # Query decomposition time
|
| 183 |
+
'hybrid_search': 0.423, # Search across all sub-queries
|
| 184 |
+
'reranking': 0.089, # Cross-encoder reranking
|
| 185 |
+
'context_creation': 0.012, # Context assembly
|
| 186 |
+
'llm_generation': 1.245, # Answer generation
|
| 187 |
+
'total_pipeline': 1.925 # End-to-end processing
|
| 188 |
+
}
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## π Error Handling & Safety
|
| 192 |
+
|
| 193 |
+
### Built-in Safety Features
|
| 194 |
+
|
| 195 |
+
1. **Input Validation**: Comprehensive query validation and sanitization
|
| 196 |
+
2. **Content Filtering**: Detection and handling of inappropriate requests
|
| 197 |
+
3. **Resource Limits**: Protection against excessive resource usage
|
| 198 |
+
4. **Graceful Degradation**: Fallback strategies for component failures
|
| 199 |
+
5. **Provider Fallback**: Automatic switching between LLM providers
|
| 200 |
+
|
| 201 |
+
### Error Recovery
|
| 202 |
+
|
| 203 |
+
```python
|
| 204 |
+
try:
|
| 205 |
+
answer, timings = await rag.answer_question(question, doc_id)
|
| 206 |
+
except Exception as e:
|
| 207 |
+
# System provides graceful error messages
|
| 208 |
+
print(f"Processing failed: {e}")
|
| 209 |
+
|
| 210 |
+
# Check system health
|
| 211 |
+
health = rag.get_health_status()
|
| 212 |
+
if health['status'] == 'degraded':
|
| 213 |
+
# Handle degraded performance
|
| 214 |
+
rag.force_reset_llm_cooldowns()
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
## π§Ή Resource Management
|
| 218 |
+
|
| 219 |
+
### Cleanup Operations
|
| 220 |
+
|
| 221 |
+
```python
|
| 222 |
+
# Cleanup resources when done
|
| 223 |
+
rag.cleanup()
|
| 224 |
+
|
| 225 |
+
# Reset statistics
|
| 226 |
+
rag.reset_provider_stats()
|
| 227 |
+
|
| 228 |
+
# Force reset provider cooldowns (emergency)
|
| 229 |
+
rag.force_reset_llm_cooldowns()
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
## π System Health Monitoring
|
| 233 |
+
|
| 234 |
+
```python
|
| 235 |
+
# Get comprehensive health status
|
| 236 |
+
health = rag.get_health_status()
|
| 237 |
+
|
| 238 |
+
{
|
| 239 |
+
"status": "healthy", # healthy/degraded/error
|
| 240 |
+
"available_llm_providers": 2,
|
| 241 |
+
"total_llm_providers": 3,
|
| 242 |
+
"provider_details": {...},
|
| 243 |
+
"modules_loaded": 6,
|
| 244 |
+
"last_check": 1703123456.789
|
| 245 |
+
}
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
## π§ Dependencies
|
| 249 |
+
|
| 250 |
+
### Core Dependencies
|
| 251 |
+
|
| 252 |
+
- **sentence-transformers**: Embedding and cross-encoder models
|
| 253 |
+
- **qdrant-client**: Vector database operations
|
| 254 |
+
- **rank-bm25**: BM25 implementation for keyword search
|
| 255 |
+
- **numpy**: Numerical operations and score fusion
|
| 256 |
+
|
| 257 |
+
### LLM Integration
|
| 258 |
+
|
| 259 |
+
- Requires configured LLM handler (supports multiple providers)
|
| 260 |
+
- Automatic fallback between providers
|
| 261 |
+
- Configurable temperature and token limits
|
| 262 |
+
|
| 263 |
+
## π Getting Started
|
| 264 |
+
|
| 265 |
+
1. **Install Dependencies**: Ensure all required packages are installed
|
| 266 |
+
2. **Configure Settings**: Update `config/config.py` with your preferences
|
| 267 |
+
3. **Initialize Database**: Ensure document collections are processed and stored
|
| 268 |
+
4. **Initialize RAG**: Create an `AdvancedRAGProcessor` instance
|
| 269 |
+
5. **Process Queries**: Use `answer_question()` method for document Q&A
|
| 270 |
+
|
| 271 |
+
## π Performance Characteristics
|
| 272 |
+
|
| 273 |
+
### Typical Processing Times
|
| 274 |
+
|
| 275 |
+
- **Simple Queries**: 0.5-1.5 seconds
|
| 276 |
+
- **Complex Queries**: 1.5-3.0 seconds
|
| 277 |
+
- **Multi-aspect Queries**: 2.0-4.0 seconds
|
| 278 |
+
|
| 279 |
+
### Resource Usage
|
| 280 |
+
|
| 281 |
+
- **Memory**: ~500MB-1GB (depends on model sizes)
|
| 282 |
+
- **CPU**: Moderate during processing, minimal during idle
|
| 283 |
+
- **Storage**: Vector databases stored locally
|
| 284 |
+
|
| 285 |
+
## π€ Contributing
|
| 286 |
+
|
| 287 |
+
The modular architecture makes it easy to extend and customize:
|
| 288 |
+
|
| 289 |
+
1. **Add New Search Methods**: Extend `SearchManager`
|
| 290 |
+
2. **Custom Rerankers**: Implement new reranking strategies
|
| 291 |
+
3. **Enhanced Prompting**: Modify answer generation prompts
|
| 292 |
+
4. **New Query Types**: Extend query expansion logic
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
## π License
|
| 297 |
+
|
| 298 |
+
This package is part of the ShastraDocs project. See the main project license for details.
|
| 299 |
+
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
*This RAG system is optimized for document analysis and policy-related question answering. It provides production-ready performance with comprehensive monitoring and error handling capabilities.*
|
README.md
CHANGED
|
@@ -1,10 +1,643 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
-
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "ShastraDocs"
|
| 3 |
+
emoji: "π"
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
license: mit
|
| 9 |
+
tags: [rag, document-analysis, llm, enterprise, ai]
|
| 10 |
---
|
| 11 |
|
| 12 |
+
<div align="center">
|
| 13 |
+
|
| 14 |
+
# π ShastraDocs v2
|
| 15 |
+
## Enterprise RAG System for Document Analysis
|
| 16 |
+
|
| 17 |
+

|
| 18 |
+

|
| 19 |
+

|
| 20 |
+

|
| 21 |
+
|
| 22 |
+
**π Production-ready API β’ π 8+ Document Formats β’ π€ Multi-LLM Support β’ β‘ Advanced Retrieval**
|
| 23 |
+
|
| 24 |
+
[**Try the API**](#-quick-start) | [**Full Docs**](https://github.com/Team-DevBytes/ShastraDocs2) | [**GitHub**](https://github.com/Team-DevBytes/ShastraDocs2)
|
| 25 |
+
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## π Overview
|
| 31 |
+
|
| 32 |
+
ShastraDocs v2 is a production-ready, modular RAG system designed for comprehensive document analysis and intelligent question answering. Built with enterprise requirements in mind, it supports 8+ document formats, features intelligent multi-provider LLM management, and provides advanced retrieval techniques with comprehensive monitoring capabilities.
|
| 33 |
+
|
| 34 |
+
### β¨ Key Highlights
|
| 35 |
+
|
| 36 |
+
- **π― Multi-Format Support**: PDF, DOCX, PPTX, XLSX, Images, Text, CSV, and URLs
|
| 37 |
+
- **β‘ Intelligent Processing**: Automatic format detection with specialized handlers
|
| 38 |
+
- **π Multi-Provider LLM**: Smart rotation between Groq, Gemini, and OpenAI with rate limit handling
|
| 39 |
+
- **π Advanced Retrieval**: Hybrid search with BM25 + semantic search and cross-encoder reranking
|
| 40 |
+
- **π Production Features**: Comprehensive logging, monitoring, and health checks
|
| 41 |
+
- **π³ Docker Ready**: Containerized deployment with HuggingFace Spaces optimization
|
| 42 |
+
- **π° Cost Effective**: Process 200+ questions at $0 cost using free tier rotation
|
| 43 |
+
|
| 44 |
+
## ποΈ System Architecture
|
| 45 |
+
|
| 46 |
+
```
|
| 47 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 48 |
+
β ShastraDocs v2 β
|
| 49 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 50 |
+
β FastAPI REST API (Authentication, Endpoints, Health Checks) β
|
| 51 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 52 |
+
β Multi-Provider LLM Handler (Groq, Gemini, OpenAI) β
|
| 53 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 54 |
+
β Advanced RAG Processor (Query Expansion, Reranking) β
|
| 55 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 56 |
+
β Document Preprocessing (8+ Formats, OCR, Table Extraction) β
|
| 57 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 58 |
+
β Vector Storage & Search (Qdrant, Hybrid Search, Caching) β
|
| 59 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 60 |
+
β Comprehensive Logging & Monitoring (Request Tracking, Stats) β
|
| 61 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## π¦ Project Structure
|
| 65 |
+
|
| 66 |
+
```
|
| 67 |
+
shastradocs-v2/
|
| 68 |
+
βββ π api/ # FastAPI REST API
|
| 69 |
+
β βββ __init__.py
|
| 70 |
+
β βββ api.py # Main API endpoints and authentication
|
| 71 |
+
βββ π config/ # Centralized configuration
|
| 72 |
+
β βββ __init__.py
|
| 73 |
+
β βββ config.py # Auto-detecting multi-provider configs
|
| 74 |
+
βββ π LLM/ # Multi-provider LLM management
|
| 75 |
+
β βββ __init__.py
|
| 76 |
+
β βββ llm_handler.py # Unified multi-provider handler
|
| 77 |
+
β βββ one_shotter.py # Enhanced QA with web scraping
|
| 78 |
+
β βββ image_answerer.py # Specialized image analysis
|
| 79 |
+
β βββ tabular_answer.py # Structured data handler
|
| 80 |
+
β βββ lite_llm.py # Lightweight handler
|
| 81 |
+
βββ π RAG/ # Advanced retrieval system
|
| 82 |
+
β βββ __init__.py
|
| 83 |
+
β βββ advanced_rag_processor.py # Main RAG orchestrator
|
| 84 |
+
β βββ rag_modules/ # Modular RAG components
|
| 85 |
+
β βββ query_expansion.py # Query decomposition
|
| 86 |
+
β βββ embedding_manager.py # Semantic embeddings
|
| 87 |
+
β βββ search_manager.py # Hybrid search engine
|
| 88 |
+
β βββ reranking_manager.py # Cross-encoder reranking
|
| 89 |
+
β βββ context_manager.py # Context assembly
|
| 90 |
+
β βββ answer_generator.py # LLM answer generation
|
| 91 |
+
βββ π preprocessing/ # Document processing pipeline
|
| 92 |
+
β βββ __init__.py
|
| 93 |
+
β βββ preprocessing.py # Main entry point and CLI
|
| 94 |
+
β βββ preprocessing_modules/ # Specialized extractors
|
| 95 |
+
β βββ modular_preprocessor.py # Main orchestrator
|
| 96 |
+
β βββ file_downloader.py # Universal file downloading
|
| 97 |
+
β βββ pdf_extractor.py # Advanced PDF processing
|
| 98 |
+
β βββ docx_extractor.py # Word document handling
|
| 99 |
+
β βββ pptx_extractor.py # PowerPoint processing
|
| 100 |
+
β βββ xlsx_extractor.py # Excel with OCR support
|
| 101 |
+
β βββ image_extractor.py # Image and table extraction
|
| 102 |
+
β βββ text_chunker.py # Smart text chunking
|
| 103 |
+
β βββ embedding_manager.py # Batch embedding generation
|
| 104 |
+
β βββ vector_storage.py # Qdrant integration
|
| 105 |
+
β βββ metadata_manager.py # Document metadata
|
| 106 |
+
βββ π logger/ # Advanced logging system
|
| 107 |
+
β βββ __init__.py
|
| 108 |
+
β βββ logger.py # In-memory logging with analytics
|
| 109 |
+
βββ π app.py # Application entry point
|
| 110 |
+
βββ π startup.sh # Production startup script
|
| 111 |
+
βββ π Dockerfile # Container configuration
|
| 112 |
+
βββ π requirements.txt # Python dependencies
|
| 113 |
+
βββ π LICENSE # MIT License
|
| 114 |
+
βββ π README.md # This file
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## π― Core Features
|
| 118 |
+
|
| 119 |
+
### π§ Multi-Provider LLM Management
|
| 120 |
+
|
| 121 |
+
**Smart Rate Limit Handling**
|
| 122 |
+
- Automatic rotation between Groq, Gemini, and OpenAI
|
| 123 |
+
- 60-second cooldown management per provider
|
| 124 |
+
- Intelligent fallback with zero downtime
|
| 125 |
+
- Real-time provider health monitoring
|
| 126 |
+
|
| 127 |
+
**Multi-Instance Support**
|
| 128 |
+
- Up to 10 API keys per provider
|
| 129 |
+
- Custom model assignment per instance
|
| 130 |
+
- Priority-based routing (Groq β Gemini β OpenAI)
|
| 131 |
+
- Cost-effective free tier optimization
|
| 132 |
+
|
| 133 |
+
### π Document Processing Pipeline
|
| 134 |
+
|
| 135 |
+
**Supported Formats**
|
| 136 |
+
| Format | Extensions | Special Features |
|
| 137 |
+
|--------|------------|-----------------|
|
| 138 |
+
| PDF | .pdf | CID font mapping, table extraction, parallel processing |
|
| 139 |
+
| Word | .docx | Text boxes, tables, gridSpan handling |
|
| 140 |
+
| PowerPoint | .pptx | OCR Space API for images, notes extraction |
|
| 141 |
+
| Excel | .xlsx | Cell processing, embedded image OCR |
|
| 142 |
+
| Images | .png, .jpg, .jpeg | Table detection, OCR text extraction |
|
| 143 |
+
| Text | .txt, .csv | Direct processing, structured data handling |
|
| 144 |
+
| URLs | http/https | Google Docs conversion, web scraping |
|
| 145 |
+
|
| 146 |
+
**Advanced Processing**
|
| 147 |
+
- **Smart Chunking**: Sentence-boundary aware with configurable overlap
|
| 148 |
+
- **OCR Integration**: OCR Space API and Tesseract support
|
| 149 |
+
- **Table Extraction**: Automatic detection and markdown formatting
|
| 150 |
+
- **Caching System**: Document-level caching to avoid reprocessing
|
| 151 |
+
- **Parallel Processing**: Multi-threaded operations for efficiency
|
| 152 |
+
|
| 153 |
+
### π Advanced RAG System
|
| 154 |
+
|
| 155 |
+
**Query Processing**
|
| 156 |
+
- **Query Expansion**: Automatic decomposition into focused sub-queries
|
| 157 |
+
- **Multi-aspect Analysis**: Process/document/contact identification
|
| 158 |
+
- **Budget Management**: Intelligent retrieval budget distribution
|
| 159 |
+
|
| 160 |
+
**Hybrid Search Engine**
|
| 161 |
+
- **Semantic Search**: Dense vector similarity (SentenceTransformers)
|
| 162 |
+
- **Keyword Search**: BM25 for exact term matching
|
| 163 |
+
- **Score Fusion**: Reciprocal Rank Fusion with weighted combination
|
| 164 |
+
- **Reranking**: Cross-encoder models for relevance refinement
|
| 165 |
+
|
| 166 |
+
**Context-Aware Generation**
|
| 167 |
+
- **Multi-perspective Context**: Equal representation from sub-queries
|
| 168 |
+
- **Enhanced Prompting**: Specialized prompts for policy documents
|
| 169 |
+
- **Error Handling**: Graceful handling of edge cases
|
| 170 |
+
|
| 171 |
+
### π Production-Ready API
|
| 172 |
+
|
| 173 |
+
**REST Endpoints**
|
| 174 |
+
- `POST /hackrx/run` - Document processing and Q&A
|
| 175 |
+
- `GET /health` - System health monitoring
|
| 176 |
+
- `POST /preprocess` - Batch document preprocessing (admin)
|
| 177 |
+
- `GET /logs` - Request logs export with filtering (admin)
|
| 178 |
+
- `GET /collections` - List processed documents (admin)
|
| 179 |
+
|
| 180 |
+
**Security Features**
|
| 181 |
+
- Bearer token authentication for main endpoints
|
| 182 |
+
- Admin token for administrative functions
|
| 183 |
+
- Request validation using Pydantic models
|
| 184 |
+
- CORS and security headers configuration
|
| 185 |
+
|
| 186 |
+
### π Comprehensive Monitoring
|
| 187 |
+
|
| 188 |
+
**Request Tracking**
|
| 189 |
+
- Unique request ID generation
|
| 190 |
+
- Pipeline stage timing breakdown
|
| 191 |
+
- Per-question performance metrics
|
| 192 |
+
- Success/failure tracking
|
| 193 |
+
|
| 194 |
+
**Performance Analytics**
|
| 195 |
+
- Real-time processing statistics
|
| 196 |
+
- Provider usage distribution
|
| 197 |
+
- Memory and resource monitoring
|
| 198 |
+
- Export capabilities with filtering
|
| 199 |
+
|
| 200 |
+
**Health Monitoring**
|
| 201 |
+
- System component status
|
| 202 |
+
- Provider availability tracking
|
| 203 |
+
- Database connection health
|
| 204 |
+
- Resource usage monitoring
|
| 205 |
+
|
| 206 |
+
## βοΈ Quick Setup
|
| 207 |
+
|
| 208 |
+
### Prerequisites
|
| 209 |
+
|
| 210 |
+
- Python 3.10+
|
| 211 |
+
- Docker (optional)
|
| 212 |
+
- At least one LLM provider API key (Groq/Gemini/OpenAI)
|
| 213 |
+
- OCR Space API key (for PowerPoint images)
|
| 214 |
+
|
| 215 |
+
### π Local Development Setup
|
| 216 |
+
|
| 217 |
+
1. **Clone Repository**
|
| 218 |
+
```bash
|
| 219 |
+
git clone <repository-url>
|
| 220 |
+
cd shastradocs-v2
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
2. **Install Dependencies**
|
| 224 |
+
```bash
|
| 225 |
+
pip install -r requirements.txt
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
3. **Configure Environment**
|
| 229 |
+
Create `.env` file with your API keys:
|
| 230 |
+
```bash
|
| 231 |
+
# === LLM PROVIDERS ===
|
| 232 |
+
# Groq (Primary provider - fastest)
|
| 233 |
+
GROQ_API_KEY_1=your_first_groq_key
|
| 234 |
+
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
|
| 235 |
+
|
| 236 |
+
# Gemini (Secondary provider)
|
| 237 |
+
GEMINI_API_KEY_1=your_gemini_key
|
| 238 |
+
DEFAULT_GEMINI_MODEL=gemini-2.0-flash
|
| 239 |
+
|
| 240 |
+
# OpenAI (Backup provider)
|
| 241 |
+
OPENAI_API_KEY_1=your_openai_key
|
| 242 |
+
DEFAULT_OPENAI_MODEL=gpt-4o-mini
|
| 243 |
+
|
| 244 |
+
#You can add more api keys just change the number
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# === Specialized Pipelines ===
|
| 248 |
+
GROQ_API_KEY_TABULAR = "a groq api key" # Optional: If Groq key already exists in handler, but recomended
|
| 249 |
+
GEMINI_API_KEY_IMAGE = "a gemini api" # Optional: If Gemini key already exists in handler, but recomended
|
| 250 |
+
|
| 251 |
+
# === Query Expansion ===
|
| 252 |
+
GROQ_API_KEY_LITE = "a groq api key" # Optional: If Groq key already exists in handler, but recomended
|
| 253 |
+
|
| 254 |
+
# === SERVICES ===
|
| 255 |
+
OCR_SPACE_API_KEY=your_ocr_space_key
|
| 256 |
+
BEARER_TOKEN=your_secure_api_token
|
| 257 |
+
|
| 258 |
+
4. **Run Application**
|
| 259 |
+
```bash
|
| 260 |
+
python app.py
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### π³ Docker Deployment
|
| 264 |
+
|
| 265 |
+
1. **Build Image**
|
| 266 |
+
```bash
|
| 267 |
+
docker build -t shastradocs-v2 .
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
2. **Run Container**
|
| 271 |
+
```bash
|
| 272 |
+
docker run -p 7860:7860 --env-file .env shastradocs-v2
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
### βοΈ HuggingFace Spaces Deployment
|
| 276 |
+
|
| 277 |
+
The application is optimized for HuggingFace Spaces:
|
| 278 |
+
|
| 279 |
+
1. Upload project files to your Space
|
| 280 |
+
2. Set environment variables in Space settings
|
| 281 |
+
3. The `startup.sh` script handles database initialization
|
| 282 |
+
4. Access via your Space URL
|
| 283 |
+
|
| 284 |
+
## π Usage Examples
|
| 285 |
+
|
| 286 |
+
### Python Client
|
| 287 |
+
|
| 288 |
+
```python
|
| 289 |
+
import httpx
|
| 290 |
+
import asyncio
|
| 291 |
+
|
| 292 |
+
async def analyze_document():
|
| 293 |
+
url = "http://localhost:8000/hackrx/run"
|
| 294 |
+
headers = {"Authorization": "Bearer your_token"}
|
| 295 |
+
|
| 296 |
+
data = {
|
| 297 |
+
"documents": "https://example.com/policy.pdf",
|
| 298 |
+
"questions": [
|
| 299 |
+
"What is the claim submission process?",
|
| 300 |
+
"What documents are required?",
|
| 301 |
+
"Who should I contact for help?"
|
| 302 |
+
]
|
| 303 |
+
}
|
| 304 |
+
|
| 305 |
+
async with httpx.AsyncClient(timeout=600) as client:
|
| 306 |
+
response = await client.post(url, json=data, headers=headers)
|
| 307 |
+
result = response.json()
|
| 308 |
+
|
| 309 |
+
print("π Document Analysis Results:")
|
| 310 |
+
for i, answer in enumerate(result["answers"]):
|
| 311 |
+
print(f"\nQ{i+1}: {data['questions'][i]}")
|
| 312 |
+
print(f"A{i+1}: {answer}")
|
| 313 |
+
|
| 314 |
+
# Performance metrics
|
| 315 |
+
if "pipeline_timings" in result:
|
| 316 |
+
timings = result["pipeline_timings"]
|
| 317 |
+
print(f"\nβ±οΈ Processing Time: {timings.get('total_pipeline', 0):.2f}s")
|
| 318 |
+
|
| 319 |
+
asyncio.run(analyze_document())
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
### cURL Examples
|
| 323 |
+
|
| 324 |
+
```bash
|
| 325 |
+
# Process document with questions
|
| 326 |
+
curl -X POST "http://localhost:8000/hackrx/run" \
|
| 327 |
+
-H "Authorization: Bearer your_token" \
|
| 328 |
+
-H "Content-Type: application/json" \
|
| 329 |
+
-d '{
|
| 330 |
+
"documents": "https://example.com/policy.pdf",
|
| 331 |
+
"questions": [
|
| 332 |
+
"What are the key policy highlights?",
|
| 333 |
+
"How do I submit a claim?"
|
| 334 |
+
]
|
| 335 |
+
}'
|
| 336 |
+
|
| 337 |
+
# Check system health
|
| 338 |
+
curl -X GET "http://localhost:8000/health"
|
| 339 |
+
|
| 340 |
+
# Get request logs (admin)
|
| 341 |
+
curl -X GET "http://localhost:8000/logs?minutes=60&limit=50" \
|
| 342 |
+
-H "Authorization: Bearer 9420689497"
|
| 343 |
+
|
| 344 |
+
# Preprocess document (admin)
|
| 345 |
+
curl -X POST "http://localhost:8000/preprocess" \
|
| 346 |
+
-H "Authorization: Bearer 9420689497" \
|
| 347 |
+
-d "document_url=https://example.com/document.pdf&force=false"
|
| 348 |
+
```
|
| 349 |
+
|
| 350 |
+
### CLI Usage
|
| 351 |
+
|
| 352 |
+
```bash
|
| 353 |
+
# Process single document
|
| 354 |
+
python -m preprocessing --url "https://example.com/document.pdf"
|
| 355 |
+
|
| 356 |
+
# Process multiple documents
|
| 357 |
+
python -m preprocessing --urls-file urls.txt
|
| 358 |
+
|
| 359 |
+
# List processed documents
|
| 360 |
+
python -m preprocessing --list
|
| 361 |
+
|
| 362 |
+
# Show statistics
|
| 363 |
+
python -m preprocessing --stats
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
## ποΈ Configuration Guide
|
| 367 |
+
|
| 368 |
+
### Environment Variables
|
| 369 |
+
|
| 370 |
+
**Required Variables**
|
| 371 |
+
```bash
|
| 372 |
+
# At least one LLM provider
|
| 373 |
+
GROQ_API_KEY_1=your_key # OR
|
| 374 |
+
GEMINI_API_KEY_1=your_key # OR
|
| 375 |
+
OPENAI_API_KEY_1=your_key
|
| 376 |
+
|
| 377 |
+
# Authentication
|
| 378 |
+
BEARER_TOKEN=your_secure_token
|
| 379 |
+
|
| 380 |
+
# OCR for PowerPoint processing
|
| 381 |
+
OCR_SPACE_API_KEY=your_ocr_key
|
| 382 |
+
```
|
| 383 |
+
|
| 384 |
+
**Optional Variables**
|
| 385 |
+
```bash
|
| 386 |
+
# Additional LLM keys (up to 10 per provider)
|
| 387 |
+
GROQ_API_KEY_2=backup_key
|
| 388 |
+
GEMINI_API_KEY_2=backup_key
|
| 389 |
+
|
| 390 |
+
# Custom models per provider
|
| 391 |
+
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
|
| 392 |
+
GROQ_MODEL_1=llama3-70b-8192
|
| 393 |
+
|
| 394 |
+
# API configuration
|
| 395 |
+
API_HOST=0.0.0.0
|
| 396 |
+
API_PORT=8000
|
| 397 |
+
API_RELOAD=false
|
| 398 |
+
|
| 399 |
+
# RAG configuration
|
| 400 |
+
TOP_K=9
|
| 401 |
+
CHUNK_SIZE=1600
|
| 402 |
+
ENABLE_RERANKING=true
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
+
### Processing Modes
|
| 406 |
+
|
| 407 |
+
The system automatically selects optimal processing modes:
|
| 408 |
+
|
| 409 |
+
**1. Standard RAG Processing**
|
| 410 |
+
- Complex documents requiring full pipeline
|
| 411 |
+
- Vector database storage and hybrid search
|
| 412 |
+
- Best for policy documents, manuals
|
| 413 |
+
|
| 414 |
+
**2. OneShot Processing**
|
| 415 |
+
- Simple text documents
|
| 416 |
+
- Direct LLM processing without vector search
|
| 417 |
+
- Faster for short documents
|
| 418 |
+
|
| 419 |
+
**3. Tabular Analysis**
|
| 420 |
+
- Excel, CSV files with structured data
|
| 421 |
+
- Specialized data analysis prompts
|
| 422 |
+
- Optimized for numerical data
|
| 423 |
+
|
| 424 |
+
**4. Image Processing**
|
| 425 |
+
- Visual content with OCR
|
| 426 |
+
- Table detection in images
|
| 427 |
+
- Automatic cleanup after processing
|
| 428 |
+
|
| 429 |
+
## π Performance Metrics
|
| 430 |
+
|
| 431 |
+
### Processing Speed
|
| 432 |
+
- **Simple Queries**: 0.5-1.5 seconds
|
| 433 |
+
- **Complex Multi-aspect**: 1.5-3.0 seconds
|
| 434 |
+
- **Document Preprocessing**: 2-5 pages/second (PDF)
|
| 435 |
+
- **Embedding Generation**: 100-500 chunks/second
|
| 436 |
+
|
| 437 |
+
### Cost Optimization
|
| 438 |
+
- **Free Tier Usage**: 200+ questions at $0 cost
|
| 439 |
+
- **Provider Rotation**: Automatic cost-effective routing
|
| 440 |
+
- **Rate Limit Avoidance**: Prevents unnecessary paid calls
|
| 441 |
+
- **Intelligent Caching**: Reduces redundant processing
|
| 442 |
+
|
| 443 |
+
### Resource Usage
|
| 444 |
+
- **Memory**: 500MB-1GB (model dependent)
|
| 445 |
+
- **Storage**: Vector databases (~100MB per 1000 documents)
|
| 446 |
+
- **CPU**: Moderate during processing, minimal idle
|
| 447 |
+
|
| 448 |
+
## π οΈ Troubleshooting
|
| 449 |
+
|
| 450 |
+
### Common Issues
|
| 451 |
+
|
| 452 |
+
**1. No LLM Providers Available**
|
| 453 |
+
```python
|
| 454 |
+
# Check provider status
|
| 455 |
+
from LLM.llm_handler import llm_handler
|
| 456 |
+
status = llm_handler.get_provider_status()
|
| 457 |
+
print(f"Available providers: {len(status)}")
|
| 458 |
+
|
| 459 |
+
# Reset cooldowns if needed
|
| 460 |
+
llm_handler.reset_cooldowns()
|
| 461 |
+
```
|
| 462 |
+
|
| 463 |
+
**2. Document Processing Failures**
|
| 464 |
+
```bash
|
| 465 |
+
# Check document accessibility
|
| 466 |
+
curl -I "https://your-document-url.pdf"
|
| 467 |
+
|
| 468 |
+
# Force reprocessing
|
| 469 |
+
curl -X POST "http://localhost:8000/preprocess" \
|
| 470 |
+
-H "Authorization: Bearer admin_token" \
|
| 471 |
+
-d "document_url=your_url&force=true"
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
**3. OCR Space API Issues**
|
| 475 |
+
```bash
|
| 476 |
+
# Verify OCR API key
|
| 477 |
+
export OCR_SPACE_API_KEY="your_key"
|
| 478 |
+
|
| 479 |
+
# Test OCR endpoint
|
| 480 |
+
curl -X POST "https://api.ocr.space/parse/image" \
|
| 481 |
+
-F "apikey=your_key" \
|
| 482 |
+
-F "url=https://example.com/image.jpg"
|
| 483 |
+
```
|
| 484 |
+
|
| 485 |
+
**4. Memory Issues**
|
| 486 |
+
```python
|
| 487 |
+
# Reduce batch sizes in config.py
|
| 488 |
+
BATCH_SIZE = 16
|
| 489 |
+
CHUNK_SIZE = 1200
|
| 490 |
+
```
|
| 491 |
+
|
| 492 |
+
### Debug Mode
|
| 493 |
+
|
| 494 |
+
Enable verbose logging:
|
| 495 |
+
```python
|
| 496 |
+
import logging
|
| 497 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 498 |
+
|
| 499 |
+
# Check system health
|
| 500 |
+
from api.api import app
|
| 501 |
+
# Health check includes detailed component status
|
| 502 |
+
```
|
| 503 |
+
|
| 504 |
+
### Health Monitoring
|
| 505 |
+
|
| 506 |
+
```bash
|
| 507 |
+
# System health check
|
| 508 |
+
curl http://localhost:8000/health
|
| 509 |
+
|
| 510 |
+
# Detailed logs export
|
| 511 |
+
curl -H "Authorization: Bearer admin_token" \
|
| 512 |
+
"http://localhost:8000/logs?minutes=60" > debug_logs.json
|
| 513 |
+
```
|
| 514 |
+
|
| 515 |
+
## π Production Deployment
|
| 516 |
+
|
| 517 |
+
### Docker Production Setup
|
| 518 |
+
|
| 519 |
+
```dockerfile
|
| 520 |
+
# Multi-stage build for optimization
|
| 521 |
+
FROM python:3.10-slim as builder
|
| 522 |
+
WORKDIR /app
|
| 523 |
+
COPY requirements.txt .
|
| 524 |
+
RUN pip install --user -r requirements.txt
|
| 525 |
+
|
| 526 |
+
FROM python:3.10-slim
|
| 527 |
+
COPY --from=builder /root/.local /root/.local
|
| 528 |
+
COPY . /app
|
| 529 |
+
WORKDIR /app
|
| 530 |
+
|
| 531 |
+
# Environment setup
|
| 532 |
+
ENV PATH=/root/.local/bin:$PATH
|
| 533 |
+
ENV HF_HOME=/app/.cache/huggingface
|
| 534 |
+
EXPOSE 7860
|
| 535 |
+
|
| 536 |
+
CMD ["bash", "startup.sh"]
|
| 537 |
+
```
|
| 538 |
+
|
| 539 |
+
### Environment-Specific Configuration
|
| 540 |
+
|
| 541 |
+
**Development**
|
| 542 |
+
```bash
|
| 543 |
+
API_RELOAD=true
|
| 544 |
+
API_HOST=127.0.0.1
|
| 545 |
+
LOG_LEVEL=DEBUG
|
| 546 |
+
```
|
| 547 |
+
|
| 548 |
+
**Staging**
|
| 549 |
+
```bash
|
| 550 |
+
API_RELOAD=false
|
| 551 |
+
API_HOST=0.0.0.0
|
| 552 |
+
LOG_LEVEL=INFO
|
| 553 |
+
```
|
| 554 |
+
|
| 555 |
+
**Production**
|
| 556 |
+
```bash
|
| 557 |
+
API_RELOAD=false
|
| 558 |
+
API_HOST=0.0.0.0
|
| 559 |
+
LOG_LEVEL=WARNING
|
| 560 |
+
# Multiple API keys for redundancy
|
| 561 |
+
GROQ_API_KEY_1=prod_key_1
|
| 562 |
+
GROQ_API_KEY_2=prod_key_2
|
| 563 |
+
```
|
| 564 |
+
|
| 565 |
+
### Monitoring Setup
|
| 566 |
+
|
| 567 |
+
```bash
|
| 568 |
+
# Health check endpoint for load balancers
|
| 569 |
+
curl -f http://localhost:7860/health || exit 1
|
| 570 |
+
|
| 571 |
+
# Prometheus metrics (custom implementation)
|
| 572 |
+
curl http://localhost:7860/metrics
|
| 573 |
+
|
| 574 |
+
# Log aggregation
|
| 575 |
+
curl -H "Authorization: Bearer admin" \
|
| 576 |
+
"http://localhost:7860/logs" | jq '.metadata'
|
| 577 |
+
```
|
| 578 |
+
|
| 579 |
+
## π€ Contributing
|
| 580 |
+
|
| 581 |
+
We welcome contributions! Please follow these guidelines:
|
| 582 |
+
|
| 583 |
+
### Development Setup
|
| 584 |
+
1. Fork the repository
|
| 585 |
+
2. Create feature branch: `git checkout -b feature/amazing-feature`
|
| 586 |
+
3. Follow modular architecture patterns
|
| 587 |
+
4. Maintain async/await compatibility
|
| 588 |
+
5. Add comprehensive error handling
|
| 589 |
+
6. Include type hints and documentation
|
| 590 |
+
|
| 591 |
+
### Code Standards
|
| 592 |
+
- **Python**: Follow PEP 8 style guidelines
|
| 593 |
+
- **Documentation**: Update README for new features
|
| 594 |
+
- **Testing**: Add tests for new components
|
| 595 |
+
- **Error Handling**: Implement graceful error recovery
|
| 596 |
+
|
| 597 |
+
### Pull Request Process
|
| 598 |
+
1. Update documentation
|
| 599 |
+
2. Add tests for new functionality
|
| 600 |
+
3. Ensure all tests pass
|
| 601 |
+
4. Update CHANGELOG.md
|
| 602 |
+
5. Submit PR with detailed description
|
| 603 |
+
|
| 604 |
+
## π Security Considerations
|
| 605 |
+
|
| 606 |
+
### Authentication
|
| 607 |
+
- **Bearer Tokens**: Secure API access with rotation support
|
| 608 |
+
- **Admin Endpoints**: Separate authentication for sensitive operations
|
| 609 |
+
- **Input Validation**: Comprehensive request sanitization
|
| 610 |
+
|
| 611 |
+
### Data Security
|
| 612 |
+
- **No Persistent Storage**: Documents processed in memory only
|
| 613 |
+
- **Automatic Cleanup**: Temporary files removed after processing
|
| 614 |
+
- **Secure Headers**: CORS and security headers configured
|
| 615 |
+
|
| 616 |
+
### Rate Limiting
|
| 617 |
+
- **Request Throttling**: Built-in concurrency limits
|
| 618 |
+
- **Provider Management**: Smart rate limit handling
|
| 619 |
+
- **Graceful Degradation**: Continues operation during issues
|
| 620 |
+
|
| 621 |
+
## π License
|
| 622 |
+
|
| 623 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
| 624 |
+
|
| 625 |
+
**Copyright (c) 2025 Rahul Samedavar and Sambhaji Patil**
|
| 626 |
+
|
| 627 |
+
## π Acknowledgments
|
| 628 |
+
|
| 629 |
+
- **HuggingFace**: For model hosting and Spaces platform
|
| 630 |
+
- **Qdrant**: For vector database capabilities
|
| 631 |
+
- **FastAPI**: For modern API framework
|
| 632 |
+
- **SentenceTransformers**: For embedding models
|
| 633 |
+
- **Community Contributors**: For feedback and improvements
|
| 634 |
+
|
| 635 |
+
---
|
| 636 |
+
|
| 637 |
+
<div align="center">
|
| 638 |
+
|
| 639 |
+
**ShastraDocs v2** - *Enterprise-grade RAG system for intelligent document analysis*
|
| 640 |
+
|
| 641 |
+
[π Star on GitHub](https://github.com/Team-DevBytes/ShastraDocs2)
|
| 642 |
+
|
| 643 |
+
</div>
|
api/README.md
ADDED
|
@@ -0,0 +1,442 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ShastraDocs API Package
|
| 2 |
+
|
| 3 |
+
A production-ready FastAPI REST API for the ShastraDocs document analysis system. This package provides secure, authenticated endpoints for document processing, question answering, and system management with comprehensive logging and monitoring.
|
| 4 |
+
|
| 5 |
+
## π Overview
|
| 6 |
+
|
| 7 |
+
The API package serves as the main interface for the ShastraDocs RAG system, offering:
|
| 8 |
+
- **Document Processing**: Upload and analyze documents in 8+ formats
|
| 9 |
+
- **Question Answering**: Intelligent responses using advanced RAG techniques
|
| 10 |
+
- **System Management**: Admin endpoints for monitoring and maintenance
|
| 11 |
+
- **Enhanced Logging**: Detailed request tracking and performance analytics
|
| 12 |
+
|
| 13 |
+
## π¦ Package Structure
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
api/
|
| 17 |
+
βββ __init__.py # Package initialization
|
| 18 |
+
βββ api.py # Main FastAPI application with all endpoints
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
## π― Core Features
|
| 22 |
+
|
| 23 |
+
### π Security & Authentication
|
| 24 |
+
- **Bearer Token Authentication**: Secure API access with configurable tokens
|
| 25 |
+
- **Admin Endpoints**: Separate authentication for administrative functions
|
| 26 |
+
- **Request Validation**: Comprehensive input validation using Pydantic models
|
| 27 |
+
|
| 28 |
+
### β‘ Intelligent Document Processing
|
| 29 |
+
- **Optimized Flow**: Checks for pre-processed documents to avoid redundant work
|
| 30 |
+
- **Multi-Format Support**: Handles PDFs, Word docs, presentations, spreadsheets, images
|
| 31 |
+
- **Parallel Processing**: Concurrent question answering with configurable limits
|
| 32 |
+
- **Fallback Handling**: Graceful degradation for unsupported formats
|
| 33 |
+
|
| 34 |
+
### π Advanced Processing Modes
|
| 35 |
+
- **Standard RAG**: Full pipeline for complex documents
|
| 36 |
+
- **OneShot Processing**: Fast processing for simple text documents
|
| 37 |
+
- **Tabular Analysis**: Specialized handling for structured data
|
| 38 |
+
- **Image Analysis**: OCR and visual question answering
|
| 39 |
+
|
| 40 |
+
### π Monitoring & Observability
|
| 41 |
+
- **Real-time Logging**: Detailed request tracking with unique IDs
|
| 42 |
+
- **Performance Metrics**: Pipeline timing breakdown and statistics
|
| 43 |
+
- **Health Monitoring**: System status and component health checks
|
| 44 |
+
- **Export Capabilities**: JSON log export with filtering options
|
| 45 |
+
|
| 46 |
+
## π API Endpoints
|
| 47 |
+
|
| 48 |
+
### Core Processing Endpoints
|
| 49 |
+
|
| 50 |
+
#### `POST /hackrx/run` - Document Processing & QA
|
| 51 |
+
Process documents and answer questions using the advanced RAG pipeline.
|
| 52 |
+
|
| 53 |
+
**Request:**
|
| 54 |
+
```json
|
| 55 |
+
{
|
| 56 |
+
"documents": "https://example.com/policy.pdf",
|
| 57 |
+
"questions": [
|
| 58 |
+
"What is the claim submission process?",
|
| 59 |
+
"What documents are required?",
|
| 60 |
+
"Who should I contact for help?"
|
| 61 |
+
]
|
| 62 |
+
}
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
**Response:**
|
| 66 |
+
```json
|
| 67 |
+
{
|
| 68 |
+
"answers": [
|
| 69 |
+
"The claim submission process involves three main steps...",
|
| 70 |
+
"Required documents include: policy certificate, claim form...",
|
| 71 |
+
"For assistance, contact the customer service team at..."
|
| 72 |
+
]
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
**Features:**
|
| 77 |
+
- β
**Smart Caching**: Reuses pre-processed embeddings
|
| 78 |
+
- β‘ **Parallel Processing**: Handles multiple questions concurrently
|
| 79 |
+
- π **Automatic Fallback**: Switches between processing modes based on document type
|
| 80 |
+
- π **Detailed Timing**: Returns comprehensive performance metrics
|
| 81 |
+
|
| 82 |
+
#### `GET /health` - Health Check
|
| 83 |
+
Simple health check endpoint for monitoring system status.
|
| 84 |
+
|
| 85 |
+
**Response:**
|
| 86 |
+
```json
|
| 87 |
+
{
|
| 88 |
+
"status": "healthy",
|
| 89 |
+
"message": "RAG API is running successfully"
|
| 90 |
+
}
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
### Administrative Endpoints (Admin Token Required)
|
| 94 |
+
|
| 95 |
+
#### `POST /preprocess` - Batch Document Preprocessing
|
| 96 |
+
Pre-process documents for faster future queries.
|
| 97 |
+
|
| 98 |
+
**Parameters:**
|
| 99 |
+
- `document_url`: URL of document to preprocess
|
| 100 |
+
- `force`: Boolean to force reprocessing
|
| 101 |
+
|
| 102 |
+
#### `GET /collections` - List Processed Documents
|
| 103 |
+
Retrieve information about all processed document collections.
|
| 104 |
+
|
| 105 |
+
#### `GET /collections/stats` - Collection Statistics
|
| 106 |
+
Get comprehensive statistics about the document database.
|
| 107 |
+
|
| 108 |
+
### Logging & Monitoring Endpoints (Admin Token Required)
|
| 109 |
+
|
| 110 |
+
#### `GET /logs` - Export Request Logs
|
| 111 |
+
Export detailed API request logs with optional filtering.
|
| 112 |
+
|
| 113 |
+
**Query Parameters:**
|
| 114 |
+
- `limit`: Maximum number of logs to return
|
| 115 |
+
- `minutes`: Get logs from last N minutes
|
| 116 |
+
- `document_url`: Filter by specific document URL
|
| 117 |
+
|
| 118 |
+
**Response:**
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"export_timestamp": "2024-01-15T10:30:00Z",
|
| 122 |
+
"metadata": {
|
| 123 |
+
"total_requests": 156,
|
| 124 |
+
"successful_requests": 152,
|
| 125 |
+
"success_rate": 97.44,
|
| 126 |
+
"average_processing_time": 2.34
|
| 127 |
+
},
|
| 128 |
+
"logs": [...]
|
| 129 |
+
}
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
#### `GET /logs/summary` - Logs Summary
|
| 133 |
+
Get aggregated statistics and performance metrics.
|
| 134 |
+
|
| 135 |
+
## π§ Configuration
|
| 136 |
+
|
| 137 |
+
### Environment Variables
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
# API Configuration
|
| 141 |
+
API_HOST=0.0.0.0
|
| 142 |
+
API_PORT=8000
|
| 143 |
+
API_RELOAD=True
|
| 144 |
+
|
| 145 |
+
# Authentication
|
| 146 |
+
BEARER_TOKEN=your_secure_api_token
|
| 147 |
+
|
| 148 |
+
# LLM Provider Keys (auto-detects multiple keys)
|
| 149 |
+
GROQ_API_KEY_1=your_groq_key_1
|
| 150 |
+
GROQ_API_KEY_2=your_groq_key_2
|
| 151 |
+
GEMINI_API_KEY_1=your_gemini_key_1
|
| 152 |
+
|
| 153 |
+
# OCR Service
|
| 154 |
+
OCR_SPACE_API_KEY=your_ocr_space_key
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
### Key Settings
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
# Processing Configuration
|
| 161 |
+
SEMAPHORE_COUNT = 5 # Concurrent question processing limit
|
| 162 |
+
TIMEOUT_SECONDS = 600 # Request timeout for large documents
|
| 163 |
+
MAX_RETRIES = 3 # Automatic retry attempts
|
| 164 |
+
|
| 165 |
+
# Authentication
|
| 166 |
+
ADMIN_TOKEN = "9420689497" # Default admin token (change in production)
|
| 167 |
+
BEARER_TOKEN = "your_token" # Main API bearer token
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
## π Usage Examples
|
| 171 |
+
|
| 172 |
+
### Python Client
|
| 173 |
+
|
| 174 |
+
```python
|
| 175 |
+
import httpx
|
| 176 |
+
import asyncio
|
| 177 |
+
|
| 178 |
+
async def process_document():
|
| 179 |
+
url = "http://localhost:8000/hackrx/run"
|
| 180 |
+
headers = {"Authorization": "Bearer your_token"}
|
| 181 |
+
|
| 182 |
+
data = {
|
| 183 |
+
"documents": "https://example.com/policy.pdf",
|
| 184 |
+
"questions": [
|
| 185 |
+
"What is the main policy coverage?",
|
| 186 |
+
"How do I file a claim?"
|
| 187 |
+
]
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
async with httpx.AsyncClient() as client:
|
| 191 |
+
response = await client.post(url, json=data, headers=headers)
|
| 192 |
+
result = response.json()
|
| 193 |
+
|
| 194 |
+
for i, answer in enumerate(result["answers"]):
|
| 195 |
+
print(f"Q{i+1}: {data['questions'][i]}")
|
| 196 |
+
print(f"A{i+1}: {answer}\n")
|
| 197 |
+
|
| 198 |
+
asyncio.run(process_document())
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
### cURL Examples
|
| 202 |
+
|
| 203 |
+
```bash
|
| 204 |
+
# Process document with questions
|
| 205 |
+
curl -X POST "http://localhost:8000/hackrx/run" \
|
| 206 |
+
-H "Authorization: Bearer your_token" \
|
| 207 |
+
-H "Content-Type: application/json" \
|
| 208 |
+
-d '{
|
| 209 |
+
"documents": "https://example.com/document.pdf",
|
| 210 |
+
"questions": ["What is this document about?"]
|
| 211 |
+
}'
|
| 212 |
+
|
| 213 |
+
# Check system health
|
| 214 |
+
curl -X GET "http://localhost:8000/health"
|
| 215 |
+
|
| 216 |
+
# Get recent logs (admin)
|
| 217 |
+
curl -X GET "http://localhost:8000/logs?minutes=60" \
|
| 218 |
+
-H "Authorization: Bearer 9420689497"
|
| 219 |
+
|
| 220 |
+
# Preprocess document (admin)
|
| 221 |
+
curl -X POST "http://localhost:8000/preprocess" \
|
| 222 |
+
-H "Authorization: Bearer 9420689497" \
|
| 223 |
+
-d "document_url=https://example.com/policy.pdf&force=false"
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
## π― Processing Modes
|
| 227 |
+
|
| 228 |
+
### 1. Standard RAG Processing
|
| 229 |
+
For complex documents requiring full pipeline processing:
|
| 230 |
+
- Downloads and processes document
|
| 231 |
+
- Creates embeddings and stores in vector database
|
| 232 |
+
- Uses hybrid search with reranking
|
| 233 |
+
- Returns detailed answers with citations
|
| 234 |
+
|
| 235 |
+
### 2. OneShot Processing
|
| 236 |
+
For simple text documents or when context is sufficient:
|
| 237 |
+
- Processes small documents directly
|
| 238 |
+
- Uses LLM without vector search
|
| 239 |
+
- Faster response times
|
| 240 |
+
- Suitable for short documents or summaries
|
| 241 |
+
|
| 242 |
+
### 3. Tabular Data Processing
|
| 243 |
+
For structured data like spreadsheets and CSV files:
|
| 244 |
+
- Specialized tabular analysis
|
| 245 |
+
- Handles data relationships and calculations
|
| 246 |
+
- Optimized for numerical and categorical data
|
| 247 |
+
- Batch processing for efficiency
|
| 248 |
+
|
| 249 |
+
### 4. Image Processing
|
| 250 |
+
For visual content analysis:
|
| 251 |
+
- OCR text extraction
|
| 252 |
+
- Table detection in images
|
| 253 |
+
- Visual question answering
|
| 254 |
+
- Automatic cleanup of processed images
|
| 255 |
+
|
| 256 |
+
## π Performance Monitoring
|
| 257 |
+
|
| 258 |
+
### Request Lifecycle Tracking
|
| 259 |
+
Each request is tracked with comprehensive metrics:
|
| 260 |
+
|
| 261 |
+
```json
|
| 262 |
+
{
|
| 263 |
+
"request_id": "req_000123",
|
| 264 |
+
"processing_time_seconds": 2.45,
|
| 265 |
+
"pipeline_timings": {
|
| 266 |
+
"query_expansion": 0.156,
|
| 267 |
+
"hybrid_search": 0.423,
|
| 268 |
+
"reranking": 0.089,
|
| 269 |
+
"context_creation": 0.012,
|
| 270 |
+
"llm_generation": 1.245
|
| 271 |
+
},
|
| 272 |
+
"question_timings": [
|
| 273 |
+
{
|
| 274 |
+
"question_index": 0,
|
| 275 |
+
"total_time_seconds": 1.234,
|
| 276 |
+
"pipeline_breakdown": {...}
|
| 277 |
+
}
|
| 278 |
+
]
|
| 279 |
+
}
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
### System Health Metrics
|
| 283 |
+
- **Success Rate**: Percentage of successful requests
|
| 284 |
+
- **Average Response Time**: Mean processing time across requests
|
| 285 |
+
- **Provider Status**: Health of LLM providers
|
| 286 |
+
- **Resource Usage**: Memory and processing statistics
|
| 287 |
+
|
| 288 |
+
## π οΈ Development
|
| 289 |
+
|
| 290 |
+
### Running the API
|
| 291 |
+
|
| 292 |
+
```bash
|
| 293 |
+
# Development mode with auto-reload
|
| 294 |
+
python api/api.py
|
| 295 |
+
|
| 296 |
+
# Production mode with uvicorn
|
| 297 |
+
uvicorn api.api:app --host 0.0.0.0 --port 8000
|
| 298 |
+
|
| 299 |
+
# With specific workers (for production)
|
| 300 |
+
uvicorn api.api:app --host 0.0.0.0 --port 8000 --workers 4
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
### Testing
|
| 304 |
+
|
| 305 |
+
```python
|
| 306 |
+
import pytest
|
| 307 |
+
from fastapi.testclient import TestClient
|
| 308 |
+
from api.api import app
|
| 309 |
+
|
| 310 |
+
client = TestClient(app)
|
| 311 |
+
|
| 312 |
+
def test_health_check():
|
| 313 |
+
response = client.get("/health")
|
| 314 |
+
assert response.status_code == 200
|
| 315 |
+
assert response.json()["status"] == "healthy"
|
| 316 |
+
|
| 317 |
+
def test_process_document():
|
| 318 |
+
headers = {"Authorization": "Bearer your_test_token"}
|
| 319 |
+
data = {
|
| 320 |
+
"documents": "https://example.com/test.pdf",
|
| 321 |
+
"questions": ["What is this about?"]
|
| 322 |
+
}
|
| 323 |
+
|
| 324 |
+
response = client.post("/hackrx/run", json=data, headers=headers)
|
| 325 |
+
assert response.status_code == 200
|
| 326 |
+
assert "answers" in response.json()
|
| 327 |
+
```
|
| 328 |
+
|
| 329 |
+
### Custom Error Handling
|
| 330 |
+
|
| 331 |
+
The API includes comprehensive error handling:
|
| 332 |
+
|
| 333 |
+
```python
|
| 334 |
+
# Example error responses
|
| 335 |
+
{
|
| 336 |
+
"status_code": 401,
|
| 337 |
+
"detail": "Invalid authentication token"
|
| 338 |
+
}
|
| 339 |
+
|
| 340 |
+
{
|
| 341 |
+
"status_code": 500,
|
| 342 |
+
"detail": "Failed to process document: Unsupported file format"
|
| 343 |
+
}
|
| 344 |
+
|
| 345 |
+
{
|
| 346 |
+
"status_code": 503,
|
| 347 |
+
"detail": "RAG system not initialized"
|
| 348 |
+
}
|
| 349 |
+
```
|
| 350 |
+
|
| 351 |
+
## π Security Considerations
|
| 352 |
+
|
| 353 |
+
### Authentication
|
| 354 |
+
- **Bearer Token**: All main endpoints require valid bearer token
|
| 355 |
+
- **Admin Token**: Administrative functions use separate token
|
| 356 |
+
- **Token Validation**: Server-side token verification
|
| 357 |
+
|
| 358 |
+
### Data Security
|
| 359 |
+
- **No Persistent Storage**: Documents processed in memory only
|
| 360 |
+
- **Automatic Cleanup**: Temporary files removed after processing
|
| 361 |
+
- **Secure Headers**: CORS and security headers configured
|
| 362 |
+
|
| 363 |
+
### Rate Limiting
|
| 364 |
+
- **Request Throttling**: Built-in concurrency limits
|
| 365 |
+
- **Provider Management**: Smart rate limit handling for LLM APIs
|
| 366 |
+
- **Graceful Degradation**: Continues operation during provider issues
|
| 367 |
+
|
| 368 |
+
## π Deployment
|
| 369 |
+
|
| 370 |
+
### HuggingFace Spaces
|
| 371 |
+
The API is optimized for HuggingFace Spaces deployment:
|
| 372 |
+
|
| 373 |
+
```python
|
| 374 |
+
# app.py - HuggingFace Spaces entry point
|
| 375 |
+
from api.api import app
|
| 376 |
+
|
| 377 |
+
if __name__ == "__main__":
|
| 378 |
+
import uvicorn
|
| 379 |
+
uvicorn.run(app, host="0.0.0.0", port=7860)
|
| 380 |
+
```
|
| 381 |
+
|
| 382 |
+
### Docker Deployment
|
| 383 |
+
|
| 384 |
+
```dockerfile
|
| 385 |
+
FROM python:3.9-slim
|
| 386 |
+
|
| 387 |
+
WORKDIR /app
|
| 388 |
+
COPY requirements.txt .
|
| 389 |
+
RUN pip install -r requirements.txt
|
| 390 |
+
|
| 391 |
+
COPY . .
|
| 392 |
+
EXPOSE 8000
|
| 393 |
+
|
| 394 |
+
CMD ["uvicorn", "api.api:app", "--host", "0.0.0.0", "--port", "8000"]
|
| 395 |
+
```
|
| 396 |
+
|
| 397 |
+
### Environment-Specific Configuration
|
| 398 |
+
|
| 399 |
+
```bash
|
| 400 |
+
# Development
|
| 401 |
+
export API_RELOAD=true
|
| 402 |
+
export API_HOST=127.0.0.1
|
| 403 |
+
|
| 404 |
+
# Production
|
| 405 |
+
export API_RELOAD=false
|
| 406 |
+
export API_HOST=0.0.0.0
|
| 407 |
+
export API_PORT=8000
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
## π Troubleshooting
|
| 411 |
+
|
| 412 |
+
### Common Issues
|
| 413 |
+
|
| 414 |
+
1. **Authentication Errors**
|
| 415 |
+
- Verify bearer token configuration
|
| 416 |
+
- Check token format in Authorization header
|
| 417 |
+
- Ensure admin token for admin endpoints
|
| 418 |
+
|
| 419 |
+
2. **Processing Failures**
|
| 420 |
+
- Check document URL accessibility
|
| 421 |
+
- Verify file format compatibility
|
| 422 |
+
- Review error logs for specific issues
|
| 423 |
+
|
| 424 |
+
3. **Performance Issues**
|
| 425 |
+
- Monitor semaphore count for concurrency
|
| 426 |
+
- Check LLM provider status
|
| 427 |
+
- Review timeout configurations
|
| 428 |
+
|
| 429 |
+
### Debug Mode
|
| 430 |
+
|
| 431 |
+
```python
|
| 432 |
+
import logging
|
| 433 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 434 |
+
|
| 435 |
+
# Enable detailed logging for troubleshooting
|
| 436 |
+
```
|
| 437 |
+
|
| 438 |
+
---
|
| 439 |
+
|
| 440 |
+
**ShastraDocs API Package** - Production-ready REST API for advanced document analysis and question answering.
|
| 441 |
+
|
| 442 |
+
*Built with FastAPI, featuring comprehensive authentication, monitoring, and error handling for enterprise deployment.*
|
config/README.md
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ShastraDocs Config Package
|
| 2 |
+
|
| 3 |
+
Centralized configuration management for the ShastraDocs RAG system. This package handles all system settings, environment variables, and multi-provider API configurations with automatic detection and validation.
|
| 4 |
+
|
| 5 |
+
## π Overview
|
| 6 |
+
|
| 7 |
+
The Config package provides:
|
| 8 |
+
- **Centralized Configuration**: Single source of truth for all system settings
|
| 9 |
+
- **Auto-Detection**: Automatic discovery of multiple API keys per provider
|
| 10 |
+
- **Environment Management**: Secure handling of API keys and sensitive settings
|
| 11 |
+
- **Provider Configuration**: Smart configuration for Groq, Gemini, and OpenAI providers
|
| 12 |
+
- **Validation**: Built-in validation and fallback mechanisms
|
| 13 |
+
|
| 14 |
+
## π¦ Package Structure
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
config/
|
| 18 |
+
βββ __init__.py # Package initialization
|
| 19 |
+
βββ config.py # Main configuration file with all settings
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## π― Core Features
|
| 23 |
+
|
| 24 |
+
### π§ Multi-Provider Auto-Detection
|
| 25 |
+
Automatically detects and configures multiple instances of each LLM provider:
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
# Automatically finds and configures:
|
| 29 |
+
GROQ_API_KEY_1, GROQ_API_KEY_2, ... GROQ_API_KEY_10
|
| 30 |
+
GEMINI_API_KEY_1, GEMINI_API_KEY_2, ... GEMINI_API_KEY_10
|
| 31 |
+
OPENAI_API_KEY_1, OPENAI_API_KEY_2, ... OPENAI_API_KEY_10
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### βοΈ Intelligent Model Assignment
|
| 35 |
+
- **Default Models**: Configurable default models per provider
|
| 36 |
+
- **Instance-Specific Models**: Custom models for specific API key instances
|
| 37 |
+
- **Fallback Logic**: Automatic fallback to defaults when specific models aren't configured
|
| 38 |
+
|
| 39 |
+
### π Secure Environment Handling
|
| 40 |
+
- **Environment Variable Loading**: Automatic `.env` file processing
|
| 41 |
+
- **Validation**: Required variable checking with clear error messages
|
| 42 |
+
- **Secure Defaults**: Safe fallback values for optional settings
|
| 43 |
+
|
| 44 |
+
## π Configuration Categories
|
| 45 |
+
|
| 46 |
+
### LLM Provider Configuration
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
#### Specialized Pipelines
|
| 50 |
+
```bash
|
| 51 |
+
GROQ_API_KEY_TABULAR = "a groq api key" # Optional: If Groq key already exists in handler, but recomended
|
| 52 |
+
GEMINI_API_KEY_IMAGE = "a gemini api key" # Optional: If Gemini key already exists in handler, but recomended
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
#### Query Expander
|
| 56 |
+
```bash
|
| 57 |
+
GROQ_API_KEY_LITE = "a groq api key" # Optional: If Groq key already exists in handler, but recomended
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
#### Groq Configuration
|
| 61 |
+
```bash
|
| 62 |
+
# Multiple Groq API Keys
|
| 63 |
+
GROQ_API_KEY_1=your_first_groq_key
|
| 64 |
+
GROQ_API_KEY_2=your_second_groq_key
|
| 65 |
+
GROQ_API_KEY_3=your_third_groq_key
|
| 66 |
+
|
| 67 |
+
# Default model for all Groq instances
|
| 68 |
+
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
|
| 69 |
+
|
| 70 |
+
# Instance-specific models (optional)
|
| 71 |
+
GROQ_MODEL_1=llama3-70b-8192
|
| 72 |
+
GROQ_MODEL_2=mixtral-8x7b-32768
|
| 73 |
+
# GROQ_MODEL_3 will use DEFAULT_GROQ_MODEL
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
#### Gemini Configuration
|
| 77 |
+
```bash
|
| 78 |
+
# Multiple Gemini API Keys
|
| 79 |
+
GEMINI_API_KEY_1=your_first_gemini_key
|
| 80 |
+
GEMINI_API_KEY_2=your_second_gemini_key
|
| 81 |
+
|
| 82 |
+
# Default model configuration
|
| 83 |
+
DEFAULT_GEMINI_MODEL=gemini-2.0-flash
|
| 84 |
+
|
| 85 |
+
# Instance-specific models
|
| 86 |
+
GEMINI_MODEL_1=gemini-1.5-pro
|
| 87 |
+
GEMINI_MODEL_2=gemini-2.0-flash
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
#### OpenAI Configuration
|
| 91 |
+
```bash
|
| 92 |
+
# Multiple OpenAI API Keys
|
| 93 |
+
OPENAI_API_KEY_1=your_first_openai_key
|
| 94 |
+
OPENAI_API_KEY_2=your_second_openai_key
|
| 95 |
+
|
| 96 |
+
# Default model configuration
|
| 97 |
+
DEFAULT_OPENAI_MODEL=gpt-4o-mini
|
| 98 |
+
|
| 99 |
+
# Instance-specific models
|
| 100 |
+
OPENAI_MODEL_1=gpt-4o
|
| 101 |
+
OPENAI_MODEL_2=gpt-4-turbo
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
### RAG System Configuration
|
| 105 |
+
|
| 106 |
+
#### Retrieval Settings
|
| 107 |
+
```python
|
| 108 |
+
TOP_K = 9 # Number of chunks to retrieve
|
| 109 |
+
SCORE_THRESHOLD = 0.3 # Minimum relevance score
|
| 110 |
+
RERANK_TOP_K = 7 # Results to rerank
|
| 111 |
+
BM25_WEIGHT = 0.3 # Keyword search weight
|
| 112 |
+
SEMANTIC_WEIGHT = 0.7 # Semantic search weight
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
#### Advanced RAG Features
|
| 116 |
+
```python
|
| 117 |
+
ENABLE_RERANKING = True # Cross-encoder reranking
|
| 118 |
+
ENABLE_HYBRID_SEARCH = True # BM25 + Semantic search
|
| 119 |
+
ENABLE_QUERY_EXPANSION = True # Query decomposition
|
| 120 |
+
QUERY_EXPANSION_COUNT = 3 # Number of sub-queries
|
| 121 |
+
USE_TOTAL_BUDGET_APPROACH = True # Budget distribution
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
#### Processing Configuration
|
| 125 |
+
```python
|
| 126 |
+
CHUNK_SIZE = 1600 # Characters per chunk
|
| 127 |
+
CHUNK_OVERLAP = 400 # Overlap between chunks
|
| 128 |
+
MAX_CONTEXT_LENGTH = 16000 # Maximum context for LLM
|
| 129 |
+
BATCH_SIZE = 4 # Embedding batch size
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
### API Configuration
|
| 133 |
+
```python
|
| 134 |
+
API_HOST = "0.0.0.0" # API server host
|
| 135 |
+
API_PORT = 8000 # API server port
|
| 136 |
+
API_RELOAD = True # Auto-reload in development
|
| 137 |
+
BEARER_TOKEN = "your_token" # API authentication token
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
### External Services
|
| 141 |
+
```python
|
| 142 |
+
OCR_SPACE_API_KEY = "your_ocr_key" # OCR Space API key
|
| 143 |
+
EMBEDDING_MODEL = "bge-large-en" # Sentence transformer model
|
| 144 |
+
RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
## π― Auto-Detection Logic
|
| 148 |
+
|
| 149 |
+
### Provider Instance Naming
|
| 150 |
+
The system uses a sequence-based naming convention:
|
| 151 |
+
|
| 152 |
+
```python
|
| 153 |
+
sequence = [
|
| 154 |
+
"primary", "secondary", "ternary", "quaternary", "quinary",
|
| 155 |
+
"senary", "septenary", "octonary", "nonary", "denary"
|
| 156 |
+
]
|
| 157 |
+
|
| 158 |
+
# Results in names like:
|
| 159 |
+
# groq-primary, groq-secondary, groq-ternary, ...
|
| 160 |
+
# gemini-primary, gemini-secondary, ...
|
| 161 |
+
# openai-primary, openai-secondary, ...
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
### Configuration Generation Process
|
| 165 |
+
|
| 166 |
+
1. **Scan Environment**: Look for `PROVIDER_API_KEY_1` through `PROVIDER_API_KEY_10`
|
| 167 |
+
2. **Create Instances**: One instance per detected API key
|
| 168 |
+
3. **Assign Models**: Use specific model or fall back to default
|
| 169 |
+
4. **Name Assignment**: Use sequence names for easy identification
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
## βοΈ Environment Setup Examples
|
| 173 |
+
|
| 174 |
+
### Minimal Configuration (.env)
|
| 175 |
+
```bash
|
| 176 |
+
# Minimum required for basic functionality
|
| 177 |
+
GROQ_API_KEY_1=your_groq_key
|
| 178 |
+
GEMINI_API_KEY_1=your_gemini_key
|
| 179 |
+
OCR_SPACE_API_KEY=your_ocr_key
|
| 180 |
+
BEARER_TOKEN=your_secure_token
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
### Recommended Configuration (.env)
|
| 184 |
+
```bash
|
| 185 |
+
# Development setup with multiple providers
|
| 186 |
+
GROQ_API_KEY_1=your_groq_key_1
|
| 187 |
+
GEMINI_API_KEY_1=your_gemini_key_1
|
| 188 |
+
|
| 189 |
+
GROQ_API_KEY_LITE=groq_api_key_for_query_expansion
|
| 190 |
+
|
| 191 |
+
OCR_SPACE_API_KEY=your_ocr_key
|
| 192 |
+
BEARER_TOKEN=dev_token_123
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
**ShastraDocs Config Package** - Centralized, secure, and intelligent configuration management for enterprise RAG systems.
|
| 198 |
+
|
| 199 |
+
*Built with auto-detection, validation, and production-ready defaults for seamless deployment across environments.*
|
config/config.py
CHANGED
|
@@ -48,9 +48,6 @@ API_HOST = "0.0.0.0"
|
|
| 48 |
API_PORT = 8000
|
| 49 |
API_RELOAD = True
|
| 50 |
|
| 51 |
-
assert GEMINI_API_KEY, "GEMINI KEY NOT SET"
|
| 52 |
-
assert GROQ_API_KEY, "GROQ KEY NOT SET"
|
| 53 |
-
assert GROQ_API_KEY_LITE, "GROQ KEY LITE NOT SET"
|
| 54 |
|
| 55 |
sequence = ["primary", "secondary", "ternary", "quaternary", "quinary", "senary", "septenary", "octonary", "nonary", "denary"]
|
| 56 |
|
|
@@ -69,7 +66,7 @@ def get_provider_configs():
|
|
| 69 |
# Groq configurations
|
| 70 |
# You can add multiple Groq instances with different API keys
|
| 71 |
# set API KEYS ass GROQ_API_KEY_1, GROQ_API_KEY_2... in your environment variables , .env
|
| 72 |
-
DEFAULT_GROQ_MODEL = "qwen/qwen3-32b"
|
| 73 |
configs["groq"] = [{
|
| 74 |
"name": sequence[i],
|
| 75 |
"api_key": os.getenv(f"GROQ_API_KEY_{i}"),
|
|
@@ -79,7 +76,7 @@ def get_provider_configs():
|
|
| 79 |
# Gemini configurations
|
| 80 |
# You can add multiple Gemini instances with different API keys
|
| 81 |
# set API KEYS ass GEMINI_API_KEY_1, GEMINI_API_KEY_2... in your environment variables , .env
|
| 82 |
-
DEFAULT_GEMINI_MODEL = "gemini-2.0-flash"
|
| 83 |
configs["gemini"] = [{
|
| 84 |
"name": sequence[i],
|
| 85 |
"api_key": os.getenv(f"GEMINI_API_KEY_{i}"),
|
|
@@ -90,7 +87,7 @@ def get_provider_configs():
|
|
| 90 |
# OpenAI configurations
|
| 91 |
# You can add multiple OpenAI instances with different API keys
|
| 92 |
# set API KEYS ass OPENAI_API_KEY_1, OPENAI_API_KEY_2... in your environment variables , .env
|
| 93 |
-
DEFAULT_OPENAI_MODEL = "gpt-4o-mini"
|
| 94 |
configs["openai"] = [{
|
| 95 |
"name": sequence[i],
|
| 96 |
"api_key": os.getenv(f"OPENAI_API_KEY_{i}"),
|
|
|
|
| 48 |
API_PORT = 8000
|
| 49 |
API_RELOAD = True
|
| 50 |
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
sequence = ["primary", "secondary", "ternary", "quaternary", "quinary", "senary", "septenary", "octonary", "nonary", "denary"]
|
| 53 |
|
|
|
|
| 66 |
# Groq configurations
|
| 67 |
# You can add multiple Groq instances with different API keys
|
| 68 |
# set API KEYS ass GROQ_API_KEY_1, GROQ_API_KEY_2... in your environment variables , .env
|
| 69 |
+
DEFAULT_GROQ_MODEL = os.getenv("DEFAULT_GROQ_MODEL", "qwen/qwen3-32b")
|
| 70 |
configs["groq"] = [{
|
| 71 |
"name": sequence[i],
|
| 72 |
"api_key": os.getenv(f"GROQ_API_KEY_{i}"),
|
|
|
|
| 76 |
# Gemini configurations
|
| 77 |
# You can add multiple Gemini instances with different API keys
|
| 78 |
# set API KEYS ass GEMINI_API_KEY_1, GEMINI_API_KEY_2... in your environment variables , .env
|
| 79 |
+
DEFAULT_GEMINI_MODEL = os.getenv("DEFAULT_GEMINI_MODEL", "gemini-2.0-flash")
|
| 80 |
configs["gemini"] = [{
|
| 81 |
"name": sequence[i],
|
| 82 |
"api_key": os.getenv(f"GEMINI_API_KEY_{i}"),
|
|
|
|
| 87 |
# OpenAI configurations
|
| 88 |
# You can add multiple OpenAI instances with different API keys
|
| 89 |
# set API KEYS ass OPENAI_API_KEY_1, OPENAI_API_KEY_2... in your environment variables , .env
|
| 90 |
+
DEFAULT_OPENAI_MODEL = os.getenv("DEFAULT_OPENAI_MODEL", "gpt-4o-mini")
|
| 91 |
configs["openai"] = [{
|
| 92 |
"name": sequence[i],
|
| 93 |
"api_key": os.getenv(f"OPENAI_API_KEY_{i}"),
|
logger/README.md
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ShastraDocs Logger Package
|
| 2 |
+
|
| 3 |
+
An advanced in-memory logging system designed for RAG API request tracking with detailed pipeline timing, performance analytics, and comprehensive monitoring capabilities. Built for HuggingFace Spaces and environments without persistent storage.
|
| 4 |
+
|
| 5 |
+
## π Overview
|
| 6 |
+
|
| 7 |
+
The Logger package provides:
|
| 8 |
+
- **Enhanced Request Tracking**: Detailed logging of RAG pipeline stages with precise timing
|
| 9 |
+
- **In-Memory Storage**: No file system dependencies, perfect for HuggingFace Spaces
|
| 10 |
+
- **Performance Analytics**: Comprehensive pipeline performance monitoring
|
| 11 |
+
- **Real-time Monitoring**: Live request tracking with unique identifiers
|
| 12 |
+
- **Export Capabilities**: JSON export with filtering and aggregation options
|
| 13 |
+
|
| 14 |
+
## π¦ Package Structure
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
logger/
|
| 18 |
+
βββ __init__.py # Package initialization
|
| 19 |
+
βββ logger.py # Main logging system with RAGLogger class
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## π― Core Features
|
| 23 |
+
|
| 24 |
+
### β±οΈ Detailed Pipeline Timing
|
| 25 |
+
Tracks every stage of the RAG pipeline with microsecond precision:
|
| 26 |
+
- Query expansion timing
|
| 27 |
+
- Hybrid search performance
|
| 28 |
+
- Semantic/BM25 search breakdown
|
| 29 |
+
- Reranking duration
|
| 30 |
+
- Context creation time
|
| 31 |
+
- LLM generation timing
|
| 32 |
+
- End-to-end request processing
|
| 33 |
+
|
| 34 |
+
### π Per-Question Analytics
|
| 35 |
+
Individual question processing metrics:
|
| 36 |
+
- Question-specific timing breakdown
|
| 37 |
+
- Pipeline stage performance per question
|
| 38 |
+
- Answer length and complexity tracking
|
| 39 |
+
- Success/failure tracking per question
|
| 40 |
+
|
| 41 |
+
### π Request Lifecycle Management
|
| 42 |
+
Complete request tracking from start to finish:
|
| 43 |
+
- Unique request ID generation
|
| 44 |
+
- Request start/end timestamps
|
| 45 |
+
- Status tracking (success/error/partial)
|
| 46 |
+
- Document preprocessing detection
|
| 47 |
+
- Error message capture
|
| 48 |
+
|
| 49 |
+
## π Core Components
|
| 50 |
+
|
| 51 |
+
### RAGLogger Class
|
| 52 |
+
Main logging orchestrator with comprehensive tracking capabilities.
|
| 53 |
+
|
| 54 |
+
#### Key Methods
|
| 55 |
+
|
| 56 |
+
**Request Lifecycle:**
|
| 57 |
+
```python
|
| 58 |
+
# Start request timing
|
| 59 |
+
request_id = rag_logger.generate_request_id()
|
| 60 |
+
rag_logger.start_request_timing(request_id)
|
| 61 |
+
|
| 62 |
+
# Track pipeline stages
|
| 63 |
+
rag_logger.log_pipeline_stage(request_id, "query_expansion", 0.156)
|
| 64 |
+
rag_logger.log_pipeline_stage(request_id, "hybrid_search", 0.423)
|
| 65 |
+
|
| 66 |
+
# Track individual questions
|
| 67 |
+
rag_logger.log_question_timing(
|
| 68 |
+
request_id, question_index, question, answer,
|
| 69 |
+
duration, pipeline_timings
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
# Complete request
|
| 73 |
+
timing_data = rag_logger.end_request_timing(request_id)
|
| 74 |
+
final_request_id = rag_logger.log_request(
|
| 75 |
+
document_url, questions, answers, processing_time,
|
| 76 |
+
status, error_message, document_id, was_preprocessed, timing_data
|
| 77 |
+
)
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### LogEntry Dataclass
|
| 81 |
+
Structured data model for log entries:
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
@dataclass
|
| 85 |
+
class LogEntry:
|
| 86 |
+
timestamp: str # ISO timestamp
|
| 87 |
+
request_id: str # Unique request identifier
|
| 88 |
+
document_url: str # Document URL processed
|
| 89 |
+
questions: List[str] # Questions asked
|
| 90 |
+
answers: List[str] # Answers generated
|
| 91 |
+
processing_time_seconds: float # Total processing time
|
| 92 |
+
total_questions: int # Number of questions
|
| 93 |
+
status: str # success/error/partial
|
| 94 |
+
error_message: Optional[str] # Error details if any
|
| 95 |
+
document_id: Optional[str] # Generated document ID
|
| 96 |
+
was_preprocessed: bool # Whether document was cached
|
| 97 |
+
request_start_time: str # Request start timestamp
|
| 98 |
+
request_end_time: str # Request end timestamp
|
| 99 |
+
pipeline_timings: Dict[str, Any] # Pipeline stage timings
|
| 100 |
+
question_timings: List[Dict[str, Any]] # Per-question timings
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### PipelineTimings Dataclass
|
| 104 |
+
Detailed timing breakdown for RAG pipeline stages:
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
@dataclass
|
| 108 |
+
class PipelineTimings:
|
| 109 |
+
query_expansion_time: float = 0.0 # Query decomposition time
|
| 110 |
+
hybrid_search_time: float = 0.0 # Combined search time
|
| 111 |
+
semantic_search_time: float = 0.0 # Vector similarity time
|
| 112 |
+
bm25_search_time: float = 0.0 # Keyword search time
|
| 113 |
+
score_fusion_time: float = 0.0 # Score combination time
|
| 114 |
+
reranking_time: float = 0.0 # Cross-encoder reranking
|
| 115 |
+
context_creation_time: float = 0.0 # Context assembly time
|
| 116 |
+
llm_generation_time: float = 0.0 # Answer generation time
|
| 117 |
+
total_pipeline_time: float = 0.0 # End-to-end pipeline time
|
| 118 |
+
```
|
preprocessing/README.md
ADDED
|
@@ -0,0 +1,362 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ShastraDocs Preprocessing Package
|
| 2 |
+
|
| 3 |
+
An advanced document preprocessing pipeline for RAG (Retrieval-Augmented Generation) systems. This modular package handles document ingestion, text extraction, chunking, embedding generation, and vector storage for multiple document formats.
|
| 4 |
+
|
| 5 |
+
## π Features
|
| 6 |
+
|
| 7 |
+
### Document Format Support
|
| 8 |
+
- **PDF**: Advanced text extraction with table handling and CID font support (Malayalam, complex scripts)
|
| 9 |
+
- **DOCX**: Complete Word document processing with tables and text boxes
|
| 10 |
+
- **PPTX**: PowerPoint extraction with OCR for images using OCR Space API
|
| 11 |
+
- **XLSX**: Excel spreadsheet processing with image OCR support
|
| 12 |
+
- **Images**: PNG, JPEG, JPG with table detection and OCR
|
| 13 |
+
- **Plain Text**: TXT and CSV file support
|
| 14 |
+
- **URLs**: Direct URL processing and Google Docs conversion
|
| 15 |
+
|
| 16 |
+
### Advanced Processing Capabilities
|
| 17 |
+
- **Smart Text Chunking**: Sentence-boundary aware chunking with configurable overlap
|
| 18 |
+
- **Embedding Generation**: Sentence transformer-based embeddings with batch processing
|
| 19 |
+
- **Vector Storage**: Qdrant integration for efficient similarity search
|
| 20 |
+
- **Table Extraction**: Automated table detection and formatting
|
| 21 |
+
- **OCR Integration**: OCR Space API for image text extraction
|
| 22 |
+
- **Metadata Management**: Comprehensive document metadata tracking
|
| 23 |
+
- **Parallel Processing**: Multi-threaded document processing
|
| 24 |
+
- **Caching**: Intelligent caching to avoid reprocessing
|
| 25 |
+
|
| 26 |
+
## π Package Structure
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
preprocessing/
|
| 30 |
+
βββ __init__.py # Package initialization
|
| 31 |
+
βββ preprocessing.py # Main entry point and CLI
|
| 32 |
+
βββ preprocessing_modules/
|
| 33 |
+
βββ __init__.py
|
| 34 |
+
βββ modular_preprocessor.py # Main orchestrator class
|
| 35 |
+
βββ file_downloader.py # Universal file downloading
|
| 36 |
+
βββ pdf_extractor.py # PDF text extraction
|
| 37 |
+
βββ docx_extractor.py # DOCX processing
|
| 38 |
+
βββ pptx_extractor.py # PowerPoint processing
|
| 39 |
+
βββ xlsx_extractor.py # Excel processing
|
| 40 |
+
βββ image_extractor.py # Image and table extraction
|
| 41 |
+
βββ text_chunker.py # Smart text chunking
|
| 42 |
+
βββ embedding_manager.py # Embedding generation
|
| 43 |
+
βββ vector_storage.py # Qdrant vector database
|
| 44 |
+
βββ metadata_manager.py # Document metadata management
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## π οΈ Installation
|
| 48 |
+
|
| 49 |
+
### Dependencies
|
| 50 |
+
Note: these packages are already included in requirements.txt of the project
|
| 51 |
+
```bash
|
| 52 |
+
# Core dependencies
|
| 53 |
+
pip install aiohttp asyncio numpy pandas pathlib
|
| 54 |
+
pip install sentence-transformers qdrant-client
|
| 55 |
+
pip install pdfplumber pymupdf python-docx python-pptx openpyxl
|
| 56 |
+
pip install opencv-python pytesseract pillow lxml
|
| 57 |
+
|
| 58 |
+
# For image processing
|
| 59 |
+
pip install opencv-python pytesseract pillow
|
| 60 |
+
|
| 61 |
+
# For document parsing
|
| 62 |
+
pip install pdfplumber pymupdf python-docx python-pptx openpyxl lxml
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### Environment Variables
|
| 66 |
+
Create a `.env` file with the following:
|
| 67 |
+
```env
|
| 68 |
+
# Required for PowerPoint OCR
|
| 69 |
+
OCR_SPACE_API_KEY=your_ocr_space_api_key
|
| 70 |
+
|
| 71 |
+
# Optional: Custom paths
|
| 72 |
+
OUTPUT_DIR=./vector_db
|
| 73 |
+
EMBEDDING_MODEL=Bge-large-en #or any model
|
| 74 |
+
CHUNK_SIZE=1000
|
| 75 |
+
CHUNK_OVERLAP=200
|
| 76 |
+
BATCH_SIZE=32
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## π§ Configuration
|
| 80 |
+
|
| 81 |
+
The package uses `config/config.py` for configuration:
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
# Embedding configuration
|
| 85 |
+
EMBEDDING_MODEL = "Bge-large-en" # Sentence transformer model
|
| 86 |
+
BATCH_SIZE = 32 # Embedding batch size
|
| 87 |
+
|
| 88 |
+
# Chunking configuration
|
| 89 |
+
CHUNK_SIZE = 1600 # Characters per chunk
|
| 90 |
+
CHUNK_OVERLAP = 500 # Overlap between chunks
|
| 91 |
+
|
| 92 |
+
# Storage configuration
|
| 93 |
+
OUTPUT_DIR = "./vector_db" # Vector database directory
|
| 94 |
+
|
| 95 |
+
# OCR configuration (for PPTX images)
|
| 96 |
+
OCR_SPACE_API_KEY = "your_api_key" # OCR Space API key
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## π Usage
|
| 100 |
+
|
| 101 |
+
### Basic Usage
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
from preprocessing import ModularDocumentPreprocessor
|
| 105 |
+
|
| 106 |
+
# Initialize preprocessor
|
| 107 |
+
preprocessor = ModularDocumentPreprocessor()
|
| 108 |
+
|
| 109 |
+
# Process a single document
|
| 110 |
+
doc_id = await preprocessor.process_document("https://example.com/document.pdf")
|
| 111 |
+
|
| 112 |
+
# Process multiple documents
|
| 113 |
+
urls = [
|
| 114 |
+
"https://example.com/doc1.pdf",
|
| 115 |
+
"https://example.com/doc2.docx",
|
| 116 |
+
"https://example.com/presentation.pptx"
|
| 117 |
+
]
|
| 118 |
+
results = await preprocessor.process_multiple_documents(urls)
|
| 119 |
+
|
| 120 |
+
# Check processing status
|
| 121 |
+
info = preprocessor.get_document_info("https://example.com/document.pdf")
|
| 122 |
+
print(f"Document processed: {info}")
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
### Document Types and Return Values
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
# Different document types return different formats
|
| 129 |
+
result = await preprocessor.process_document(url)
|
| 130 |
+
|
| 131 |
+
# Regular documents (PDF, DOCX, TXT)
|
| 132 |
+
if isinstance(result, str):
|
| 133 |
+
doc_id = result # Normal processing, returns document ID
|
| 134 |
+
|
| 135 |
+
# Special cases
|
| 136 |
+
elif isinstance(result, list):
|
| 137 |
+
content, doc_type = result[0], result[1]
|
| 138 |
+
|
| 139 |
+
if doc_type == 'oneshot':
|
| 140 |
+
# Small documents processed as single chunk
|
| 141 |
+
# Use content directly with LLM
|
| 142 |
+
|
| 143 |
+
elif doc_type == 'tabular':
|
| 144 |
+
# Excel/CSV with structured data
|
| 145 |
+
# Use content for data analysis
|
| 146 |
+
|
| 147 |
+
elif doc_type == 'image':
|
| 148 |
+
# Image file - content is file path
|
| 149 |
+
# Process with image_extractor
|
| 150 |
+
|
| 151 |
+
elif doc_type == 'unsupported':
|
| 152 |
+
# File format not supported
|
| 153 |
+
print(f"Unsupported format: {content}")
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Advanced Usage
|
| 157 |
+
|
| 158 |
+
```python
|
| 159 |
+
# Force reprocessing
|
| 160 |
+
doc_id = await preprocessor.process_document(url, force_reprocess=True)
|
| 161 |
+
|
| 162 |
+
# Custom timeout for large files
|
| 163 |
+
doc_id = await preprocessor.process_document(url, timeout=600) # 10 minutes
|
| 164 |
+
|
| 165 |
+
# Get system information
|
| 166 |
+
system_info = preprocessor.get_system_info()
|
| 167 |
+
print(f"Embedding model: {system_info['embedding_model']}")
|
| 168 |
+
|
| 169 |
+
# Get collection statistics
|
| 170 |
+
stats = preprocessor.get_collection_stats()
|
| 171 |
+
print(f"Total documents: {stats['total_documents']}")
|
| 172 |
+
print(f"Total chunks: {stats['total_chunks']}")
|
| 173 |
+
|
| 174 |
+
# List all processed documents
|
| 175 |
+
docs = preprocessor.list_processed_documents()
|
| 176 |
+
for doc_id, info in docs.items():
|
| 177 |
+
print(f"{doc_id}: {info['document_url']} ({info['chunk_count']} chunks)")
|
| 178 |
+
|
| 179 |
+
# Cleanup document
|
| 180 |
+
success = preprocessor.cleanup_document(url)
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
### Image Processing
|
| 184 |
+
|
| 185 |
+
```python
|
| 186 |
+
from preprocessing_modules.image_extractor import extract_image
|
| 187 |
+
|
| 188 |
+
# Extract text and tables from images
|
| 189 |
+
text_content = extract_image("path/to/image.png")
|
| 190 |
+
print(text_content)
|
| 191 |
+
|
| 192 |
+
# Output format:
|
| 193 |
+
# ### Non-Table Text:
|
| 194 |
+
# Regular text content from the image
|
| 195 |
+
#
|
| 196 |
+
# ### Table 1 (Markdown):
|
| 197 |
+
# | Column 1 | Column 2 | Column 3 |
|
| 198 |
+
# |----------|----------|----------|
|
| 199 |
+
# | Data 1 | Data 2 | Data 3 |
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
## π― Command Line Interface
|
| 203 |
+
|
| 204 |
+
```bash
|
| 205 |
+
# Process a single document
|
| 206 |
+
python -m preprocessing --url "https://example.com/document.pdf"
|
| 207 |
+
|
| 208 |
+
# Process multiple documents from file
|
| 209 |
+
python -m preprocessing --urls-file urls.txt
|
| 210 |
+
|
| 211 |
+
# Force reprocessing
|
| 212 |
+
python -m preprocessing --url "https://example.com/document.pdf" --force
|
| 213 |
+
|
| 214 |
+
# List processed documents
|
| 215 |
+
python -m preprocessing --list
|
| 216 |
+
|
| 217 |
+
# Show collection statistics
|
| 218 |
+
python -m preprocessing --stats
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
### URLs File Format
|
| 222 |
+
```
|
| 223 |
+
https://example.com/doc1.pdf
|
| 224 |
+
https://example.com/doc2.docx
|
| 225 |
+
https://example.com/presentation.pptx
|
| 226 |
+
https://docs.google.com/document/d/abc123/edit?usp=sharing
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
## ποΈ Architecture
|
| 230 |
+
|
| 231 |
+
### Modular Design
|
| 232 |
+
The package follows a modular architecture with clear separation of concerns:
|
| 233 |
+
|
| 234 |
+
1. **File Downloader**: Handles downloading from various sources with retry logic
|
| 235 |
+
2. **Text Extractors**: Specialized extractors for each document format
|
| 236 |
+
3. **Text Chunker**: Smart chunking with sentence boundary detection
|
| 237 |
+
4. **Embedding Manager**: Generates embeddings using sentence transformers
|
| 238 |
+
5. **Vector Storage**: Manages Qdrant vector database operations
|
| 239 |
+
6. **Metadata Manager**: Tracks document processing metadata
|
| 240 |
+
|
| 241 |
+
### Processing Pipeline
|
| 242 |
+
```
|
| 243 |
+
URL/File β Download β Extract Text β Chunk β Generate Embeddings β Store in Qdrant
|
| 244 |
+
β
|
| 245 |
+
Save Metadata
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
### Document Processing Flow
|
| 249 |
+
|
| 250 |
+
1. **Download**: Securely download document to temporary location
|
| 251 |
+
2. **Format Detection**: Identify document type and select appropriate extractor
|
| 252 |
+
3. **Text Extraction**: Extract text content with format-specific handling
|
| 253 |
+
4. **Chunking**: Split text into overlapping chunks with smart boundaries
|
| 254 |
+
5. **Embedding**: Generate embeddings using sentence transformers
|
| 255 |
+
6. **Storage**: Store embeddings and metadata in Qdrant vector database
|
| 256 |
+
7. **Cleanup**: Remove temporary files and update registries
|
| 257 |
+
|
| 258 |
+
## π Supported Formats
|
| 259 |
+
|
| 260 |
+
| Format | Extension | Features | Special Handling |
|
| 261 |
+
|--------|-----------|----------|------------------|
|
| 262 |
+
| PDF | .pdf | Text, tables, complex scripts | CID font mapping, parallel processing |
|
| 263 |
+
| Word | .docx | Text, tables, text boxes | XML parsing, gridSpan handling |
|
| 264 |
+
| PowerPoint | .pptx | Text, images, tables, notes | OCR Space API for images |
|
| 265 |
+
| Excel | .xlsx | Cells, images | OpenPyXL, OCR for embedded images |
|
| 266 |
+
| Images | .png, .jpg, .jpeg | Text, tables | OpenCV table detection, OCR |
|
| 267 |
+
| Text | .txt, .csv | Plain text | Direct processing |
|
| 268 |
+
| URLs | http/https | Web content | Google Docs conversion |
|
| 269 |
+
|
| 270 |
+
## π Advanced Features
|
| 271 |
+
|
| 272 |
+
### Table Processing
|
| 273 |
+
- Automatic table detection in PDFs and images
|
| 274 |
+
- GridSpan handling for complex table structures
|
| 275 |
+
- Markdown formatting for structured output
|
| 276 |
+
- Cell content extraction with proper spacing
|
| 277 |
+
|
| 278 |
+
### CID Font Support
|
| 279 |
+
- Advanced handling of Malayalam and complex scripts
|
| 280 |
+
- Character mapping resolution
|
| 281 |
+
- Proper spacing and conjunct handling
|
| 282 |
+
- Fallback extraction methods
|
| 283 |
+
|
| 284 |
+
### OCR Integration
|
| 285 |
+
- OCR Space API for PowerPoint images
|
| 286 |
+
- Tesseract OCR for Excel images
|
| 287 |
+
- Batch processing for efficiency
|
| 288 |
+
- Error handling and fallback options
|
| 289 |
+
|
| 290 |
+
### Caching System
|
| 291 |
+
- Document-level caching to avoid reprocessing
|
| 292 |
+
- Chunk caching for repeated operations
|
| 293 |
+
- Temporary file management
|
| 294 |
+
- Automatic cleanup on exit
|
| 295 |
+
|
| 296 |
+
## π‘οΈ Error Handling
|
| 297 |
+
|
| 298 |
+
The package includes comprehensive error handling:
|
| 299 |
+
|
| 300 |
+
- **Network Issues**: Retry logic with exponential backoff
|
| 301 |
+
- **Corrupted Files**: Fallback extraction methods
|
| 302 |
+
- **Memory Issues**: Batch processing and streaming
|
| 303 |
+
- **Format Issues**: Multiple parser fallbacks
|
| 304 |
+
- **OCR Failures**: Graceful degradation with error messages
|
| 305 |
+
|
| 306 |
+
## π Performance
|
| 307 |
+
|
| 308 |
+
### Optimization Features
|
| 309 |
+
- **Parallel Processing**: Multi-threaded document processing
|
| 310 |
+
- **Batch Operations**: Efficient embedding generation
|
| 311 |
+
- **Streaming**: Memory-efficient large file handling
|
| 312 |
+
- **Caching**: Avoid redundant processing
|
| 313 |
+
- **Connection Pooling**: Efficient HTTP operations
|
| 314 |
+
|
| 315 |
+
### Benchmarks
|
| 316 |
+
- **PDF Processing**: ~2-5 pages/second (depends on complexity)
|
| 317 |
+
- **Embedding Generation**: ~100-500 chunks/second (depends on model)
|
| 318 |
+
- **Vector Storage**: ~1000+ vectors/second insertion rate
|
| 319 |
+
|
| 320 |
+
## π§ Troubleshooting
|
| 321 |
+
|
| 322 |
+
### Common Issues
|
| 323 |
+
|
| 324 |
+
1. **OCR Space API Errors**
|
| 325 |
+
```python
|
| 326 |
+
# Ensure API key is set
|
| 327 |
+
export OCR_SPACE_API_KEY="your_key_here"
|
| 328 |
+
```
|
| 329 |
+
|
| 330 |
+
2. **Tesseract Not Found**
|
| 331 |
+
```bash
|
| 332 |
+
# Install tesseract
|
| 333 |
+
apt-get install tesseract-ocr
|
| 334 |
+
# or
|
| 335 |
+
brew install tesseract
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
3. **Memory Issues with Large Files**
|
| 339 |
+
```python
|
| 340 |
+
# Reduce batch size in config
|
| 341 |
+
BATCH_SIZE = 16
|
| 342 |
+
```
|
| 343 |
+
|
| 344 |
+
4. **Vector Database Issues**
|
| 345 |
+
```python
|
| 346 |
+
# Check permissions on OUTPUT_DIR
|
| 347 |
+
# Ensure sufficient disk space
|
| 348 |
+
```
|
| 349 |
+
|
| 350 |
+
### Debug Mode
|
| 351 |
+
```python
|
| 352 |
+
import logging
|
| 353 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 354 |
+
|
| 355 |
+
# Enable detailed logging for troubleshooting
|
| 356 |
+
```
|
| 357 |
+
## π License
|
| 358 |
+
|
| 359 |
+
This package is part of the ShastraDocs project. See the main project license for details.
|
| 360 |
+
|
| 361 |
+
|
| 362 |
+
*This preprocessing package is designed to handle the complex requirements of document processing in RAG systems, with a focus on reliability, performance, and format diversity.*
|