ShastraDocs2 / README.md
Rahul-Samedavar's picture
added readmeeeee
ade6079
metadata
title: ShastraDocs
emoji: πŸ“š
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
tags:
  - rag
  - document-analysis
  - llm
  - enterprise
  - ai

πŸ“š ShastraDocs v2

Enterprise RAG System for Document Analysis

License Python FastAPI Docker

πŸš€ Production-ready API β€’ πŸ“„ 8+ Document Formats β€’ πŸ€– Multi-LLM Support β€’ ⚑ Advanced Retrieval

Try the API | Full Docs | GitHub


πŸš€ Overview

ShastraDocs v2 is a production-ready, modular RAG system designed for comprehensive document analysis and intelligent question answering. Built with enterprise requirements in mind, it supports 8+ document formats, features intelligent multi-provider LLM management, and provides advanced retrieval techniques with comprehensive monitoring capabilities.

✨ Key Highlights

  • 🎯 Multi-Format Support: PDF, DOCX, PPTX, XLSX, Images, Text, CSV, and URLs
  • ⚑ Intelligent Processing: Automatic format detection with specialized handlers
  • πŸ”„ Multi-Provider LLM: Smart rotation between Groq, Gemini, and OpenAI with rate limit handling
  • πŸ” Advanced Retrieval: Hybrid search with BM25 + semantic search and cross-encoder reranking
  • πŸ“Š Production Features: Comprehensive logging, monitoring, and health checks
  • 🐳 Docker Ready: Containerized deployment with HuggingFace Spaces optimization
  • πŸ’° Cost Effective: Process 200+ questions at $0 cost using free tier rotation

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        ShastraDocs v2                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  FastAPI REST API (Authentication, Endpoints, Health Checks)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     Multi-Provider LLM Handler (Groq, Gemini, OpenAI)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        Advanced RAG Processor (Query Expansion, Reranking)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Document Preprocessing (8+ Formats, OCR, Table Extraction)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    Vector Storage & Search (Qdrant, Hybrid Search, Caching)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Comprehensive Logging & Monitoring (Request Tracking, Stats)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Project Structure

shastradocs-v2/
β”œβ”€β”€ πŸ“ api/                          # FastAPI REST API
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── api.py                       # Main API endpoints and authentication
β”œβ”€β”€ πŸ“ config/                       # Centralized configuration
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── config.py                    # Auto-detecting multi-provider configs
β”œβ”€β”€ πŸ“ LLM/                         # Multi-provider LLM management
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm_handler.py              # Unified multi-provider handler
β”‚   β”œβ”€β”€ one_shotter.py              # Enhanced QA with web scraping
β”‚   β”œβ”€β”€ image_answerer.py           # Specialized image analysis
β”‚   β”œβ”€β”€ tabular_answer.py           # Structured data handler
β”‚   └── lite_llm.py                 # Lightweight handler
β”œβ”€β”€ πŸ“ RAG/                         # Advanced retrieval system
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ advanced_rag_processor.py   # Main RAG orchestrator
β”‚   └── rag_modules/                # Modular RAG components
β”‚       β”œβ”€β”€ query_expansion.py      # Query decomposition
β”‚       β”œβ”€β”€ embedding_manager.py    # Semantic embeddings
β”‚       β”œβ”€β”€ search_manager.py       # Hybrid search engine
β”‚       β”œβ”€β”€ reranking_manager.py    # Cross-encoder reranking
β”‚       β”œβ”€β”€ context_manager.py      # Context assembly
β”‚       └── answer_generator.py     # LLM answer generation
β”œβ”€β”€ πŸ“ preprocessing/               # Document processing pipeline
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ preprocessing.py            # Main entry point and CLI
β”‚   └── preprocessing_modules/      # Specialized extractors
β”‚       β”œβ”€β”€ modular_preprocessor.py # Main orchestrator
β”‚       β”œβ”€β”€ file_downloader.py      # Universal file downloading
β”‚       β”œβ”€β”€ pdf_extractor.py        # Advanced PDF processing
β”‚       β”œβ”€β”€ docx_extractor.py       # Word document handling
β”‚       β”œβ”€β”€ pptx_extractor.py       # PowerPoint processing
β”‚       β”œβ”€β”€ xlsx_extractor.py       # Excel with OCR support
β”‚       β”œβ”€β”€ image_extractor.py      # Image and table extraction
β”‚       β”œβ”€β”€ text_chunker.py         # Smart text chunking
β”‚       β”œβ”€β”€ embedding_manager.py    # Batch embedding generation
β”‚       β”œβ”€β”€ vector_storage.py       # Qdrant integration
β”‚       └── metadata_manager.py     # Document metadata
β”œβ”€β”€ πŸ“ logger/                      # Advanced logging system
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── logger.py                   # In-memory logging with analytics
β”œβ”€β”€ πŸ“„ app.py                       # Application entry point
β”œβ”€β”€ πŸ“„ startup.sh                   # Production startup script
β”œβ”€β”€ πŸ“„ Dockerfile                   # Container configuration
β”œβ”€β”€ πŸ“„ requirements.txt             # Python dependencies
β”œβ”€β”€ πŸ“„ LICENSE                      # MIT License
└── πŸ“„ README.md                    # This file

🎯 Core Features

πŸ”§ Multi-Provider LLM Management

Smart Rate Limit Handling

  • Automatic rotation between Groq, Gemini, and OpenAI
  • 60-second cooldown management per provider
  • Intelligent fallback with zero downtime
  • Real-time provider health monitoring

Multi-Instance Support

  • Up to 10 API keys per provider
  • Custom model assignment per instance
  • Priority-based routing (Groq β†’ Gemini β†’ OpenAI)
  • Cost-effective free tier optimization

πŸ“‹ Document Processing Pipeline

Supported Formats

Format Extensions Special Features
PDF .pdf CID font mapping, table extraction, parallel processing
Word .docx Text boxes, tables, gridSpan handling
PowerPoint .pptx OCR Space API for images, notes extraction
Excel .xlsx Cell processing, embedded image OCR
Images .png, .jpg, .jpeg Table detection, OCR text extraction
Text .txt, .csv Direct processing, structured data handling
URLs http/https Google Docs conversion, web scraping

Advanced Processing

  • Smart Chunking: Sentence-boundary aware with configurable overlap
  • OCR Integration: OCR Space API and Tesseract support
  • Table Extraction: Automatic detection and markdown formatting
  • Caching System: Document-level caching to avoid reprocessing
  • Parallel Processing: Multi-threaded operations for efficiency

πŸ” Advanced RAG System

Query Processing

  • Query Expansion: Automatic decomposition into focused sub-queries
  • Multi-aspect Analysis: Process/document/contact identification
  • Budget Management: Intelligent retrieval budget distribution

Hybrid Search Engine

  • Semantic Search: Dense vector similarity (SentenceTransformers)
  • Keyword Search: BM25 for exact term matching
  • Score Fusion: Reciprocal Rank Fusion with weighted combination
  • Reranking: Cross-encoder models for relevance refinement

Context-Aware Generation

  • Multi-perspective Context: Equal representation from sub-queries
  • Enhanced Prompting: Specialized prompts for policy documents
  • Error Handling: Graceful handling of edge cases

🌐 Production-Ready API

REST Endpoints

  • POST /hackrx/run - Document processing and Q&A
  • GET /health - System health monitoring
  • POST /preprocess - Batch document preprocessing (admin)
  • GET /logs - Request logs export with filtering (admin)
  • GET /collections - List processed documents (admin)

Security Features

  • Bearer token authentication for main endpoints
  • Admin token for administrative functions
  • Request validation using Pydantic models
  • CORS and security headers configuration

πŸ“Š Comprehensive Monitoring

Request Tracking

  • Unique request ID generation
  • Pipeline stage timing breakdown
  • Per-question performance metrics
  • Success/failure tracking

Performance Analytics

  • Real-time processing statistics
  • Provider usage distribution
  • Memory and resource monitoring
  • Export capabilities with filtering

Health Monitoring

  • System component status
  • Provider availability tracking
  • Database connection health
  • Resource usage monitoring

βš™οΈ Quick Setup

Prerequisites

  • Python 3.10+
  • Docker (optional)
  • At least one LLM provider API key (Groq/Gemini/OpenAI)
  • OCR Space API key (for PowerPoint images)

πŸš€ Local Development Setup

  1. Clone Repository

    git clone <repository-url>
    cd shastradocs-v2
    
  2. Install Dependencies

    pip install -r requirements.txt
    
  3. Configure Environment Create .env file with your API keys:

    # === LLM PROVIDERS ===
    # Groq (Primary provider - fastest)
    GROQ_API_KEY_1=your_first_groq_key
    DEFAULT_GROQ_MODEL=qwen/qwen3-32b
    
    # Gemini (Secondary provider)
    GEMINI_API_KEY_1=your_gemini_key
    DEFAULT_GEMINI_MODEL=gemini-2.0-flash
    
    # OpenAI (Backup provider)
    OPENAI_API_KEY_1=your_openai_key
    DEFAULT_OPENAI_MODEL=gpt-4o-mini
    
    #You can add more api keys just change the number
    
    
    # === Specialized Pipelines ===
    GROQ_API_KEY_TABULAR = "a groq api key" # Optional: If Groq key already exists in handler, but recomended 
    GEMINI_API_KEY_IMAGE = "a gemini api" # Optional: If Gemini key already exists in handler, but recomended 
    
    # === Query Expansion ===
    GROQ_API_KEY_LITE = "a groq api key" # Optional: If Groq key already exists in handler, but recomended 
    
    # === SERVICES ===
    OCR_SPACE_API_KEY=your_ocr_space_key
    BEARER_TOKEN=your_secure_api_token
    
  4. Run Application

    python app.py
    

🐳 Docker Deployment

  1. Build Image

    docker build -t shastradocs-v2 .
    
  2. Run Container

    docker run -p 7860:7860 --env-file .env shastradocs-v2
    

☁️ HuggingFace Spaces Deployment

The application is optimized for HuggingFace Spaces:

  1. Upload project files to your Space
  2. Set environment variables in Space settings
  3. The startup.sh script handles database initialization
  4. Access via your Space URL

πŸ“– Usage Examples

Python Client

import httpx
import asyncio

async def analyze_document():
    url = "http://localhost:8000/hackrx/run"
    headers = {"Authorization": "Bearer your_token"}
    
    data = {
        "documents": "https://example.com/policy.pdf",
        "questions": [
            "What is the claim submission process?",
            "What documents are required?",
            "Who should I contact for help?"
        ]
    }
    
    async with httpx.AsyncClient(timeout=600) as client:
        response = await client.post(url, json=data, headers=headers)
        result = response.json()
        
        print("πŸ“„ Document Analysis Results:")
        for i, answer in enumerate(result["answers"]):
            print(f"\nQ{i+1}: {data['questions'][i]}")
            print(f"A{i+1}: {answer}")
        
        # Performance metrics
        if "pipeline_timings" in result:
            timings = result["pipeline_timings"]
            print(f"\n⏱️ Processing Time: {timings.get('total_pipeline', 0):.2f}s")

asyncio.run(analyze_document())

cURL Examples

# Process document with questions
curl -X POST "http://localhost:8000/hackrx/run" \
  -H "Authorization: Bearer your_token" \
  -H "Content-Type: application/json" \
  -d '{
    "documents": "https://example.com/policy.pdf",
    "questions": [
      "What are the key policy highlights?",
      "How do I submit a claim?"
    ]
  }'

# Check system health
curl -X GET "http://localhost:8000/health"

# Get request logs (admin)
curl -X GET "http://localhost:8000/logs?minutes=60&limit=50" \
  -H "Authorization: Bearer 9420689497"

# Preprocess document (admin)
curl -X POST "http://localhost:8000/preprocess" \
  -H "Authorization: Bearer 9420689497" \
  -d "document_url=https://example.com/document.pdf&force=false"

CLI Usage

# Process single document
python -m preprocessing --url "https://example.com/document.pdf"

# Process multiple documents
python -m preprocessing --urls-file urls.txt

# List processed documents
python -m preprocessing --list

# Show statistics
python -m preprocessing --stats

πŸŽ›οΈ Configuration Guide

Environment Variables

Required Variables

# At least one LLM provider
GROQ_API_KEY_1=your_key        # OR
GEMINI_API_KEY_1=your_key      # OR  
OPENAI_API_KEY_1=your_key

# Authentication
BEARER_TOKEN=your_secure_token

# OCR for PowerPoint processing
OCR_SPACE_API_KEY=your_ocr_key

Optional Variables

# Additional LLM keys (up to 10 per provider)
GROQ_API_KEY_2=backup_key
GEMINI_API_KEY_2=backup_key

# Custom models per provider
DEFAULT_GROQ_MODEL=qwen/qwen3-32b
GROQ_MODEL_1=llama3-70b-8192

# API configuration
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=false

# RAG configuration
TOP_K=9
CHUNK_SIZE=1600
ENABLE_RERANKING=true

Processing Modes

The system automatically selects optimal processing modes:

1. Standard RAG Processing

  • Complex documents requiring full pipeline
  • Vector database storage and hybrid search
  • Best for policy documents, manuals

2. OneShot Processing

  • Simple text documents
  • Direct LLM processing without vector search
  • Faster for short documents

3. Tabular Analysis

  • Excel, CSV files with structured data
  • Specialized data analysis prompts
  • Optimized for numerical data

4. Image Processing

  • Visual content with OCR
  • Table detection in images
  • Automatic cleanup after processing

πŸ“Š Performance Metrics

Processing Speed

  • Simple Queries: 0.5-1.5 seconds
  • Complex Multi-aspect: 1.5-3.0 seconds
  • Document Preprocessing: 2-5 pages/second (PDF)
  • Embedding Generation: 100-500 chunks/second

Cost Optimization

  • Free Tier Usage: 200+ questions at $0 cost
  • Provider Rotation: Automatic cost-effective routing
  • Rate Limit Avoidance: Prevents unnecessary paid calls
  • Intelligent Caching: Reduces redundant processing

Resource Usage

  • Memory: 500MB-1GB (model dependent)
  • Storage: Vector databases (~100MB per 1000 documents)
  • CPU: Moderate during processing, minimal idle

πŸ› οΈ Troubleshooting

Common Issues

1. No LLM Providers Available

# Check provider status
from LLM.llm_handler import llm_handler
status = llm_handler.get_provider_status()
print(f"Available providers: {len(status)}")

# Reset cooldowns if needed
llm_handler.reset_cooldowns()

2. Document Processing Failures

# Check document accessibility
curl -I "https://your-document-url.pdf"

# Force reprocessing
curl -X POST "http://localhost:8000/preprocess" \
  -H "Authorization: Bearer admin_token" \
  -d "document_url=your_url&force=true"

3. OCR Space API Issues

# Verify OCR API key
export OCR_SPACE_API_KEY="your_key"

# Test OCR endpoint
curl -X POST "https://api.ocr.space/parse/image" \
  -F "apikey=your_key" \
  -F "url=https://example.com/image.jpg"

4. Memory Issues

# Reduce batch sizes in config.py
BATCH_SIZE = 16
CHUNK_SIZE = 1200

Debug Mode

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

# Check system health
from api.api import app
# Health check includes detailed component status

Health Monitoring

# System health check
curl http://localhost:8000/health

# Detailed logs export
curl -H "Authorization: Bearer admin_token" \
  "http://localhost:8000/logs?minutes=60" > debug_logs.json

πŸš€ Production Deployment

Docker Production Setup

# Multi-stage build for optimization
FROM python:3.10-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.10-slim
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app

# Environment setup
ENV PATH=/root/.local/bin:$PATH
ENV HF_HOME=/app/.cache/huggingface
EXPOSE 7860

CMD ["bash", "startup.sh"]

Environment-Specific Configuration

Development

API_RELOAD=true
API_HOST=127.0.0.1
LOG_LEVEL=DEBUG

Staging

API_RELOAD=false
API_HOST=0.0.0.0
LOG_LEVEL=INFO

Production

API_RELOAD=false
API_HOST=0.0.0.0
LOG_LEVEL=WARNING
# Multiple API keys for redundancy
GROQ_API_KEY_1=prod_key_1
GROQ_API_KEY_2=prod_key_2

Monitoring Setup

# Health check endpoint for load balancers
curl -f http://localhost:7860/health || exit 1

# Prometheus metrics (custom implementation)
curl http://localhost:7860/metrics

# Log aggregation
curl -H "Authorization: Bearer admin" \
  "http://localhost:7860/logs" | jq '.metadata'

🀝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Follow modular architecture patterns
  4. Maintain async/await compatibility
  5. Add comprehensive error handling
  6. Include type hints and documentation

Code Standards

  • Python: Follow PEP 8 style guidelines
  • Documentation: Update README for new features
  • Testing: Add tests for new components
  • Error Handling: Implement graceful error recovery

Pull Request Process

  1. Update documentation
  2. Add tests for new functionality
  3. Ensure all tests pass
  4. Update CHANGELOG.md
  5. Submit PR with detailed description

πŸ”’ Security Considerations

Authentication

  • Bearer Tokens: Secure API access with rotation support
  • Admin Endpoints: Separate authentication for sensitive operations
  • Input Validation: Comprehensive request sanitization

Data Security

  • No Persistent Storage: Documents processed in memory only
  • Automatic Cleanup: Temporary files removed after processing
  • Secure Headers: CORS and security headers configured

Rate Limiting

  • Request Throttling: Built-in concurrency limits
  • Provider Management: Smart rate limit handling
  • Graceful Degradation: Continues operation during issues

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) 2025 Rahul Samedavar and Sambhaji Patil

πŸ™ Acknowledgments

  • HuggingFace: For model hosting and Spaces platform
  • Qdrant: For vector database capabilities
  • FastAPI: For modern API framework
  • SentenceTransformers: For embedding models
  • Community Contributors: For feedback and improvements

ShastraDocs v2 - Enterprise-grade RAG system for intelligent document analysis

🌟 Star on GitHub