Spaces:
Sleeping
title: ShastraDocs
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
tags:
- rag
- document-analysis
- llm
- enterprise
- ai
π ShastraDocs v2
Enterprise RAG System for Document Analysis
π Production-ready API β’ π 8+ Document Formats β’ π€ Multi-LLM Support β’ β‘ Advanced Retrieval
π Overview
ShastraDocs v2 is a production-ready, modular RAG system designed for comprehensive document analysis and intelligent question answering. Built with enterprise requirements in mind, it supports 8+ document formats, features intelligent multi-provider LLM management, and provides advanced retrieval techniques with comprehensive monitoring capabilities.
β¨ Key Highlights
- π― Multi-Format Support: PDF, DOCX, PPTX, XLSX, Images, Text, CSV, and URLs
- β‘ Intelligent Processing: Automatic format detection with specialized handlers
- π Multi-Provider LLM: Smart rotation between Groq, Gemini, and OpenAI with rate limit handling
- π Advanced Retrieval: Hybrid search with BM25 + semantic search and cross-encoder reranking
- π Production Features: Comprehensive logging, monitoring, and health checks
- π³ Docker Ready: Containerized deployment with HuggingFace Spaces optimization
- π° Cost Effective: Process 200+ questions at $0 cost using free tier rotation
ποΈ System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ShastraDocs v2 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FastAPI REST API (Authentication, Endpoints, Health Checks) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Multi-Provider LLM Handler (Groq, Gemini, OpenAI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Advanced RAG Processor (Query Expansion, Reranking) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Document Preprocessing (8+ Formats, OCR, Table Extraction) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vector Storage & Search (Qdrant, Hybrid Search, Caching) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Comprehensive Logging & Monitoring (Request Tracking, Stats) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Project Structure
shastradocs-v2/
βββ π api/ # FastAPI REST API
β βββ __init__.py
β βββ api.py # Main API endpoints and authentication
βββ π config/ # Centralized configuration
β βββ __init__.py
β βββ config.py # Auto-detecting multi-provider configs
βββ π LLM/ # Multi-provider LLM management
β βββ __init__.py
β βββ llm_handler.py # Unified multi-provider handler
β βββ one_shotter.py # Enhanced QA with web scraping
β βββ image_answerer.py # Specialized image analysis
β βββ tabular_answer.py # Structured data handler
β βββ lite_llm.py # Lightweight handler
βββ π RAG/ # Advanced retrieval system
β βββ __init__.py
β βββ advanced_rag_processor.py # Main RAG orchestrator
β βββ rag_modules/ # Modular RAG components
β βββ query_expansion.py # Query decomposition
β βββ embedding_manager.py # Semantic embeddings
β βββ search_manager.py # Hybrid search engine
β βββ reranking_manager.py # Cross-encoder reranking
β βββ context_manager.py # Context assembly
β βββ answer_generator.py # LLM answer generation
βββ π preprocessing/ # Document processing pipeline
β βββ __init__.py
β βββ preprocessing.py # Main entry point and CLI
β βββ preprocessing_modules/ # Specialized extractors
β βββ modular_preprocessor.py # Main orchestrator
β βββ file_downloader.py # Universal file downloading
β βββ pdf_extractor.py # Advanced PDF processing
β βββ docx_extractor.py # Word document handling
β βββ pptx_extractor.py # PowerPoint processing
β βββ xlsx_extractor.py # Excel with OCR support
β βββ image_extractor.py # Image and table extraction
β βββ text_chunker.py # Smart text chunking
β βββ embedding_manager.py # Batch embedding generation
β βββ vector_storage.py # Qdrant integration
β βββ metadata_manager.py # Document metadata
βββ π logger/ # Advanced logging system
β βββ __init__.py
β βββ logger.py # In-memory logging with analytics
βββ π app.py # Application entry point
βββ π startup.sh # Production startup script
βββ π Dockerfile # Container configuration
βββ π requirements.txt # Python dependencies
βββ π LICENSE # MIT License
βββ π README.md # This file
π― Core Features
π§ Multi-Provider LLM Management
Smart Rate Limit Handling
- Automatic rotation between Groq, Gemini, and OpenAI
- 60-second cooldown management per provider
- Intelligent fallback with zero downtime
- Real-time provider health monitoring
Multi-Instance Support
- Up to 10 API keys per provider
- Custom model assignment per instance
- Priority-based routing (Groq β Gemini β OpenAI)
- Cost-effective free tier optimization
π Document Processing Pipeline
Supported Formats
| Format | Extensions | Special Features |
|---|---|---|
| CID font mapping, table extraction, parallel processing | ||
| Word | .docx | Text boxes, tables, gridSpan handling |
| PowerPoint | .pptx | OCR Space API for images, notes extraction |
| Excel | .xlsx | Cell processing, embedded image OCR |
| Images | .png, .jpg, .jpeg | Table detection, OCR text extraction |
| Text | .txt, .csv | Direct processing, structured data handling |
| URLs | http/https | Google Docs conversion, web scraping |
Advanced Processing
- Smart Chunking: Sentence-boundary aware with configurable overlap
- OCR Integration: OCR Space API and Tesseract support
- Table Extraction: Automatic detection and markdown formatting
- Caching System: Document-level caching to avoid reprocessing
- Parallel Processing: Multi-threaded operations for efficiency
π Advanced RAG System
Query Processing
- Query Expansion: Automatic decomposition into focused sub-queries
- Multi-aspect Analysis: Process/document/contact identification
- Budget Management: Intelligent retrieval budget distribution
Hybrid Search Engine
- Semantic Search: Dense vector similarity (SentenceTransformers)
- Keyword Search: BM25 for exact term matching
- Score Fusion: Reciprocal Rank Fusion with weighted combination
- Reranking: Cross-encoder models for relevance refinement
Context-Aware Generation
- Multi-perspective Context: Equal representation from sub-queries
- Enhanced Prompting: Specialized prompts for policy documents
- Error Handling: Graceful handling of edge cases
π Production-Ready API
REST Endpoints
POST /hackrx/run- Document processing and Q&AGET /health- System health monitoringPOST /preprocess- Batch document preprocessing (admin)GET /logs- Request logs export with filtering (admin)GET /collections- List processed documents (admin)
Security Features
- Bearer token authentication for main endpoints
- Admin token for administrative functions
- Request validation using Pydantic models
- CORS and security headers configuration
π Comprehensive Monitoring
Request Tracking
- Unique request ID generation
- Pipeline stage timing breakdown
- Per-question performance metrics
- Success/failure tracking
Performance Analytics
- Real-time processing statistics
- Provider usage distribution
- Memory and resource monitoring
- Export capabilities with filtering
Health Monitoring
- System component status
- Provider availability tracking
- Database connection health
- Resource usage monitoring
βοΈ Quick Setup
Prerequisites
- Python 3.10+
- Docker (optional)
- At least one LLM provider API key (Groq/Gemini/OpenAI)
- OCR Space API key (for PowerPoint images)
π Local Development Setup
Clone Repository
git clone <repository-url> cd shastradocs-v2Install Dependencies
pip install -r requirements.txtConfigure Environment Create
.envfile with your API keys:# === LLM PROVIDERS === # Groq (Primary provider - fastest) GROQ_API_KEY_1=your_first_groq_key DEFAULT_GROQ_MODEL=qwen/qwen3-32b # Gemini (Secondary provider) GEMINI_API_KEY_1=your_gemini_key DEFAULT_GEMINI_MODEL=gemini-2.0-flash # OpenAI (Backup provider) OPENAI_API_KEY_1=your_openai_key DEFAULT_OPENAI_MODEL=gpt-4o-mini #You can add more api keys just change the number # === Specialized Pipelines === GROQ_API_KEY_TABULAR = "a groq api key" # Optional: If Groq key already exists in handler, but recomended GEMINI_API_KEY_IMAGE = "a gemini api" # Optional: If Gemini key already exists in handler, but recomended # === Query Expansion === GROQ_API_KEY_LITE = "a groq api key" # Optional: If Groq key already exists in handler, but recomended # === SERVICES === OCR_SPACE_API_KEY=your_ocr_space_key BEARER_TOKEN=your_secure_api_tokenRun Application
python app.py
π³ Docker Deployment
Build Image
docker build -t shastradocs-v2 .Run Container
docker run -p 7860:7860 --env-file .env shastradocs-v2
βοΈ HuggingFace Spaces Deployment
The application is optimized for HuggingFace Spaces:
- Upload project files to your Space
- Set environment variables in Space settings
- The
startup.shscript handles database initialization - Access via your Space URL
π Usage Examples
Python Client
import httpx
import asyncio
async def analyze_document():
url = "http://localhost:8000/hackrx/run"
headers = {"Authorization": "Bearer your_token"}
data = {
"documents": "https://example.com/policy.pdf",
"questions": [
"What is the claim submission process?",
"What documents are required?",
"Who should I contact for help?"
]
}
async with httpx.AsyncClient(timeout=600) as client:
response = await client.post(url, json=data, headers=headers)
result = response.json()
print("π Document Analysis Results:")
for i, answer in enumerate(result["answers"]):
print(f"\nQ{i+1}: {data['questions'][i]}")
print(f"A{i+1}: {answer}")
# Performance metrics
if "pipeline_timings" in result:
timings = result["pipeline_timings"]
print(f"\nβ±οΈ Processing Time: {timings.get('total_pipeline', 0):.2f}s")
asyncio.run(analyze_document())
cURL Examples
# Process document with questions
curl -X POST "http://localhost:8000/hackrx/run" \
-H "Authorization: Bearer your_token" \
-H "Content-Type: application/json" \
-d '{
"documents": "https://example.com/policy.pdf",
"questions": [
"What are the key policy highlights?",
"How do I submit a claim?"
]
}'
# Check system health
curl -X GET "http://localhost:8000/health"
# Get request logs (admin)
curl -X GET "http://localhost:8000/logs?minutes=60&limit=50" \
-H "Authorization: Bearer 9420689497"
# Preprocess document (admin)
curl -X POST "http://localhost:8000/preprocess" \
-H "Authorization: Bearer 9420689497" \
-d "document_url=https://example.com/document.pdf&force=false"
CLI Usage
# Process single document
python -m preprocessing --url "https://example.com/document.pdf"
# Process multiple documents
python -m preprocessing --urls-file urls.txt
# List processed documents
python -m preprocessing --list
# Show statistics
python -m preprocessing --stats
ποΈ Configuration Guide
Environment Variables
Required Variables
# At least one LLM provider
GROQ_API_KEY_1=your_key # OR
GEMINI_API_KEY_1=your_key # OR
OPENAI_API_KEY_1=your_key
# Authentication
BEARER_TOKEN=your_secure_token
# OCR for PowerPoint processing
OCR_SPACE_API_KEY=your_ocr_key
Optional Variables
# Additional LLM keys (up to 10 per provider)
GROQ_API_KEY_2=backup_key
GEMINI_API_KEY_2=backup_key
# Custom models per provider
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
GROQ_MODEL_1=llama3-70b-8192
# API configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=false
# RAG configuration
TOP_K=9
CHUNK_SIZE=1600
ENABLE_RERANKING=true
Processing Modes
The system automatically selects optimal processing modes:
1. Standard RAG Processing
- Complex documents requiring full pipeline
- Vector database storage and hybrid search
- Best for policy documents, manuals
2. OneShot Processing
- Simple text documents
- Direct LLM processing without vector search
- Faster for short documents
3. Tabular Analysis
- Excel, CSV files with structured data
- Specialized data analysis prompts
- Optimized for numerical data
4. Image Processing
- Visual content with OCR
- Table detection in images
- Automatic cleanup after processing
π Performance Metrics
Processing Speed
- Simple Queries: 0.5-1.5 seconds
- Complex Multi-aspect: 1.5-3.0 seconds
- Document Preprocessing: 2-5 pages/second (PDF)
- Embedding Generation: 100-500 chunks/second
Cost Optimization
- Free Tier Usage: 200+ questions at $0 cost
- Provider Rotation: Automatic cost-effective routing
- Rate Limit Avoidance: Prevents unnecessary paid calls
- Intelligent Caching: Reduces redundant processing
Resource Usage
- Memory: 500MB-1GB (model dependent)
- Storage: Vector databases (~100MB per 1000 documents)
- CPU: Moderate during processing, minimal idle
π οΈ Troubleshooting
Common Issues
1. No LLM Providers Available
# Check provider status
from LLM.llm_handler import llm_handler
status = llm_handler.get_provider_status()
print(f"Available providers: {len(status)}")
# Reset cooldowns if needed
llm_handler.reset_cooldowns()
2. Document Processing Failures
# Check document accessibility
curl -I "https://your-document-url.pdf"
# Force reprocessing
curl -X POST "http://localhost:8000/preprocess" \
-H "Authorization: Bearer admin_token" \
-d "document_url=your_url&force=true"
3. OCR Space API Issues
# Verify OCR API key
export OCR_SPACE_API_KEY="your_key"
# Test OCR endpoint
curl -X POST "https://api.ocr.space/parse/image" \
-F "apikey=your_key" \
-F "url=https://example.com/image.jpg"
4. Memory Issues
# Reduce batch sizes in config.py
BATCH_SIZE = 16
CHUNK_SIZE = 1200
Debug Mode
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
# Check system health
from api.api import app
# Health check includes detailed component status
Health Monitoring
# System health check
curl http://localhost:8000/health
# Detailed logs export
curl -H "Authorization: Bearer admin_token" \
"http://localhost:8000/logs?minutes=60" > debug_logs.json
π Production Deployment
Docker Production Setup
# Multi-stage build for optimization
FROM python:3.10-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt
FROM python:3.10-slim
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app
# Environment setup
ENV PATH=/root/.local/bin:$PATH
ENV HF_HOME=/app/.cache/huggingface
EXPOSE 7860
CMD ["bash", "startup.sh"]
Environment-Specific Configuration
Development
API_RELOAD=true
API_HOST=127.0.0.1
LOG_LEVEL=DEBUG
Staging
API_RELOAD=false
API_HOST=0.0.0.0
LOG_LEVEL=INFO
Production
API_RELOAD=false
API_HOST=0.0.0.0
LOG_LEVEL=WARNING
# Multiple API keys for redundancy
GROQ_API_KEY_1=prod_key_1
GROQ_API_KEY_2=prod_key_2
Monitoring Setup
# Health check endpoint for load balancers
curl -f http://localhost:7860/health || exit 1
# Prometheus metrics (custom implementation)
curl http://localhost:7860/metrics
# Log aggregation
curl -H "Authorization: Bearer admin" \
"http://localhost:7860/logs" | jq '.metadata'
π€ Contributing
We welcome contributions! Please follow these guidelines:
Development Setup
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Follow modular architecture patterns
- Maintain async/await compatibility
- Add comprehensive error handling
- Include type hints and documentation
Code Standards
- Python: Follow PEP 8 style guidelines
- Documentation: Update README for new features
- Testing: Add tests for new components
- Error Handling: Implement graceful error recovery
Pull Request Process
- Update documentation
- Add tests for new functionality
- Ensure all tests pass
- Update CHANGELOG.md
- Submit PR with detailed description
π Security Considerations
Authentication
- Bearer Tokens: Secure API access with rotation support
- Admin Endpoints: Separate authentication for sensitive operations
- Input Validation: Comprehensive request sanitization
Data Security
- No Persistent Storage: Documents processed in memory only
- Automatic Cleanup: Temporary files removed after processing
- Secure Headers: CORS and security headers configured
Rate Limiting
- Request Throttling: Built-in concurrency limits
- Provider Management: Smart rate limit handling
- Graceful Degradation: Continues operation during issues
π License
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2025 Rahul Samedavar and Sambhaji Patil
π Acknowledgments
- HuggingFace: For model hosting and Spaces platform
- Qdrant: For vector database capabilities
- FastAPI: For modern API framework
- SentenceTransformers: For embedding models
- Community Contributors: For feedback and improvements
ShastraDocs v2 - Enterprise-grade RAG system for intelligent document analysis