Spaces:
Sleeping
title: Wasserstoff AiInternTask
emoji: π
colorFrom: green
colorTo: indigo
sdk: docker
pinned: false
license: apache-2.0
π€ RAG CHAT APPLICATION
A sophisticated Retrieval-Augmented Generation (RAG) chat application that enables intelligent conversations with your documents. Built with a modular architecture featuring a FastAPI backend, modern web frontend, and powerful document processing capabilities.
β¨ Key Highlights
- π Production-Ready FastAPI Backend with comprehensive REST API
- π¨ Modern Web Frontend with responsive design and real-time chat
- π Core RAG Engine in
enhanced_vectordb.py- the heart of the application - π Advanced Document Processing with OCR, semantic search, and AI analysis
- β‘ High Performance with FAISS vector search and async operations
- πΎ Persistent Storage for vector stores and chat history
- π Comprehensive Analytics with processing statistics and visualizations
π Features
- Multi-format Document Processing: PDFs, text files, images (OCR), code files, and more
- Advanced Vector Search: FAISS-powered semantic similarity search with citation tracking
- AI-Powered Analysis: Theme extraction and intelligent response generation using GROQ API
- Interactive Chat Interface: Real-time conversations with document-based context
- Processing Statistics: Detailed metrics, file type analysis, and performance visualizations
- Vector Store Management: Save and load processed document collections with metadata
- Dual Implementation: Production FastAPI backend + legacy Streamlit MVP for comparison
π Supported File Types
- Documents: PDF, TXT, MD, PY, JS, HTML, CSV, JSON
- Images: PNG, JPG, JPEG, BMP, TIFF, WEBP (with OCR)
π οΈ Installation
Prerequisites
- Python 3.8 or higher
- pip package manager
- GROQ API key (for OCR and chat capabilities)
Quick Start
Clone the repository:
git clone https://github.com/Jatin-Mehra119/wasserstoff-AiInternTask.git cd wasserstoff-AiInternTaskInstall dependencies:
pip install -r requirements.txtSet up environment variables (optional): Create a
.envfile in the root directory:GROQ_API_KEY=your_groq_api_key_here
π₯οΈ Running the Application
π― FastAPI + Web Frontend (Production)
The main application uses a modern FastAPI backend with an HTML/CSS/JavaScript frontend.
Quick Start
# Navigate to backend directory
cd backend
# Run the FastAPI server
python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Then open http://localhost:8000 in your browser.
Alternative Launch Methods
- Linux/macOS:
./run.sh(if available) - Windows:
run.bat(if available) - Docker:
docker build -t rag-app . && docker run -p 8000:8000 rag-app(if Dockerfile is configured)
π± Streamlit MVP Version
The original Streamlit implementation (MVP - Minimum Viable Product):
streamlit run streamlit_rag_app.py
Note: The FastAPI version is the primary, production-ready implementation. The Streamlit version serves as a reference MVP.
π― Usage Guide
1. π Configure API Key
- Navigate to the configuration section in the web interface
- Enter your GROQ API key to enable:
- OCR processing for image documents
- AI-powered chat responses and theme analysis
- Advanced document understanding capabilities
2. π€ Upload Documents
Choose from two upload methods:
File Upload
- Click "Choose Files" or drag-and-drop files into the upload area
- Select multiple files simultaneously for batch processing
- Supported formats: PDF, TXT, MD, PY, JS, HTML, CSV, JSON, PNG, JPG, etc.
Directory Processing
- Enter an absolute directory path (e.g.,
/path/to/documents/) - Enable recursive processing to include subdirectories
- All supported files will be automatically detected and processed
3. βοΈ Process Documents
Click "Process Documents" to initiate the RAG pipeline:
- Text Extraction: Extract content from all document types
- OCR Processing: Convert images to searchable text using GROQ Vision API
- Document Chunking: Split content into optimal segments for search
- Vector Generation: Create semantic embeddings using sentence transformers
- Index Building: Construct FAISS vector database for fast similarity search
- Metadata Tracking: Preserve source information, file types, and processing statistics
4. π¬ Chat with Your Documents
Start asking questions about your uploaded content:
- Natural Language Queries: Ask questions in plain English
- Contextual Responses: Get AI-generated answers based on document content
- Source Citations: View exact references with file paths and relevance scores
- Theme Analysis: Discover common topics across multiple documents
- Chat History: Access complete conversation history with timestamps
5. πΎ Manage Vector Store
Persist your processed documents:
- Save: Store the current vector database to disk for future sessions
- Load: Restore previously processed document collections instantly
- Statistics: View detailed processing metrics and file type distributions
π Features Breakdown
Document Processing
- Multi-format Support: Handles text, PDFs, images, and code files seamlessly
- OCR Capabilities: Extracts text from images using GROQ Vision API
- Intelligent Chunking: Text splitting optimized for search and retrieval
- Metadata Preservation: Tracks source files, types, processing timestamps
- Batch Processing: Handles multiple files simultaneously with progress tracking
- Directory Scanning: Recursive processing of entire directory structures
- File Type Detection: Automatic identification and appropriate processing
- Error Recovery: Graceful handling of corrupted or unsupported files
Search & Retrieval
- Semantic Search: Uses sentence transformers for meaning-based search
- Citation Tracking: Provides exact source references with file paths and types
- Relevance Scoring: Shows confidence scores for search results
- Multi-document Analysis: Finds connections across different sources
- Top-K Results: Returns top 5 most relevant document chunks per query
- Enhanced Metadata: Preserves source file information, document types, and processing timestamps
AI-Powered Analysis
- Theme Extraction: Identifies common topics across search results using LLM
- Response Generation: Creates comprehensive markdown-formatted answers
- Context Synthesis: Combines information from multiple document sources
- Intelligent Fallback: Provides structured responses even without LLM access
- Citation Integration: Seamlessly weaves source references into responses
- Markdown Formatting: Rich text responses with proper formatting, lists, and emphasis
User Interface
- Responsive Design: Works on desktop, tablet, and mobile devices
- Real-time Updates: Live processing status and statistics display
- Visual Analytics: Charts showing file type distribution and processing metrics
- Interactive Chat: Smooth conversation experience with instant responses
- Citation Display: Click-through citations with source file information
- History Management: Complete chat history with timestamps and persistence
- Error Handling: User-friendly error messages and status indicators
- Progress Tracking: Visual feedback during file processing and uploads
π§ Backend Architecture & Core Implementation
The application follows a modular, production-ready architecture with clear separation of concerns:
π― Core RAG Engine: enhanced_vectordb.py
The heart of the application - a comprehensive RAG implementation that handles the entire document-to-chat pipeline:
Key Components:
- π Document Processing Pipeline: Multi-format document ingestion and intelligent text extraction
- π’ Vector Database Management: FAISS-powered vector store with comprehensive metadata tracking
- π§ Embedding Generation: Advanced sentence transformer-based document embeddings
- π Semantic Search: Intelligent similarity-based document retrieval with confidence scoring
- π€ AI Integration: Seamless GROQ API integration for OCR and conversational AI
- π Theme Analysis: LLM-powered pattern recognition and theme extraction across documents
Core Class Structure:
class EnhancedDocumentProcessor:
# Document Processing
def process_files(self, file_paths) # Multi-format file processing
def process_directory(self, directory_path) # Recursive directory processing
# Vector Operations
def create_enhanced_vector_store(self, documents) # FAISS index creation with metadata
def search_with_citations(self, query, k=5) # Semantic search with source tracking
# AI-Powered Features
def analyze_themes(self, query, search_results) # Theme extraction using LLM
def get_chat_response(self, query) # End-to-end chat response generation
# Persistence
def save_vector_store(self, path) # Save vector store with metadata
def load_vector_store(self, path) # Load saved vector store
π FastAPI Backend Structure
Modern, scalable backend architecture with clean separation of concerns:
Backend Organization:
backend/
βββ main.py # π FastAPI app initialization, CORS, and server config
βββ models.py # π Pydantic data models and API schemas
βββ utils.py # π οΈ Core utilities, state management, and helpers
βββ routes/ # π Modular API endpoint handlers
βββ __init__.py # Package initialization and router exports
βββ main_routes.py # π Frontend serving and application health
βββ upload_routes.py # π€ Document upload and processing logic
βββ chat_routes.py # π¬ Chat interface and AI response handling
βββ store_routes.py # πΎ Vector store persistence and management
Key Integration Points:
- π enhanced_vectordb.py: Core RAG engine integration
- π€ GROQ API: Vision OCR and conversational AI capabilities
- β‘ FAISS: High-performance vector similarity search
- π LangChain: Advanced document processing and text manipulation
- π§ Sentence Transformers: State-of-the-art text embedding generation
- π FastAPI: Async/await patterns with automatic OpenAPI documentation
π API Documentation
The FastAPI backend provides comprehensive, interactive API documentation:
- π Swagger UI: http://localhost:8000/docs
- π ReDoc: http://localhost:8000/redoc
- π§ OpenAPI Schema: Auto-generated from code with full type hints
Core Endpoints
Authentication & Configuration
POST /set-api-key: Configure GROQ API key- Body:
{"api_key": "your_groq_api_key"} - Response: Status confirmation
- Body:
Document Processing
POST /upload-files: Upload and process multiple files- Input: Multipart form data with file uploads
- Response: Processing statistics and success status
- Features: Handles multiple file formats, creates vector store, provides detailed statistics
POST /process-directory: Process documents from a directory path- Body: Form data with
directory_pathfield - Response: Processing statistics and success status
- Features: Recursive directory processing, automatic file type detection
- Body: Form data with
Chat & Interaction
POST /chat: Send chat message and get AI-powered response- Body:
{"message": "your question"} - Response: Comprehensive chat response with citations and theme analysis
- Features:
- Semantic search across processed documents
- AI-generated responses with markdown formatting
- Source citations with relevance scores
- Theme analysis across multiple documents
- Automatic chat history tracking
- Body:
Data Management & Analytics
GET /stats: Get current processing statistics- Response: File counts, document counts, chunk counts, file type distribution
GET /chat-history: Retrieve complete chat conversation history- Response: Array of chat exchanges with timestamps
DELETE /clear-chat: Clear chat history and reset session- Response: Confirmation of history deletion
Vector Store Management
POST /save-vector-store: Persist current vector store to disk- Response: Success confirmation
- Storage: Saves to
./vector_store/directory with enhanced metadata
POST /load-vector-store: Load previously saved vector store- Response: Success confirmation with restored statistics
- Features: Automatically restores processing statistics and metadata
Frontend Serving
GET /: Serve the main HTML application interface- Response: Complete web application frontend
Data Models & Response Formats
Chat Response Format
{
"response": "AI-generated markdown response",
"citations": [
{
"content": "relevant document excerpt",
"citation": "source file path",
"type": "file type",
"score": "relevance score"
}
],
"themes": {
"key_themes": ["theme1", "theme2"],
"analysis": "theme analysis text"
},
"timestamp": "2025-06-07T10:30:00.123456"
}
Processing Statistics Format
{
"total_files": 10,
"total_documents": 25,
"total_chunks": 150,
"file_types": ["pdf", "txt", "py", "md"],
"type_counts": {"pdf": 5, "txt": 3, "py": 1, "md": 1},
"processed_at": "2025-06-07 10:30:00"
}
Error Response Format
{
"detail": "Error description message"
}
π Project Architecture
wasserstoff-AiInternTask/
βββ π§ rag_elements/ # Core RAG Implementation
β βββ enhanced_vectordb.py # β MAIN RAG ENGINE - Complete implementation
β βββ config.py # Configuration and settings management
β
βββ π backend/ # Production FastAPI Server
β βββ main.py # Application entry point and server config
β βββ models.py # Pydantic schemas and data models
β βββ utils.py # Utilities, state management, and helpers
β βββ routes/ # Modular API endpoint handlers
β β βββ __init__.py # Router exports and package init
β β βββ main_routes.py # Frontend serving and health endpoints
β β βββ upload_routes.py # Document upload and processing APIs
β β βββ chat_routes.py # Chat interface and AI response APIs
β β βββ store_routes.py # Vector store management APIs
β βββ vector_store/ # πΎ Runtime vector database storage
β βββ index.faiss # FAISS vector similarity index
β βββ index.pkl # Index metadata and document mappings
β βββ enhanced_metadata.json # Processing stats and file information
β
βββ π¨ frontend/ # Modern Web Interface
β βββ index.html # Main application UI and layout
β βββ style.css # Responsive design and modern styling
β βββ script.js # Frontend logic and API integration
β
βββ π± streamlit_rag_app.py # Legacy MVP Implementation (Streamlit)
βββ π requirements.txt # Python dependencies and versions
βββ π LICENSE # Apache 2.0 License
βββ π³ Dockerfile # Container deployment configuration
βββ π README.md # Comprehensive project documentation
π― Core Implementation Focus
β enhanced_vectordb.py - The RAG Engine
This is where the magic happens. The complete RAG implementation including:
- π Document Ingestion: Multi-format processing (PDF, images, text, code)
- βοΈ Text Processing: Intelligent chunking and metadata extraction
- π’ Vector Operations: FAISS indexing and semantic similarity search
- π€ AI Integration: GROQ API for OCR and conversational capabilities
- πΎ State Management: Save/load functionality for vector stores
- π Analytics: Processing statistics and performance metrics
π FastAPI Backend (Production)
Enterprise-ready API server featuring:
- β‘ Async/Await Patterns: High-performance async operations
- π Modular Architecture: Clean separation of concerns with organized routes
- π‘οΈ Error Handling: Comprehensive exception handling and logging
- π Auto Documentation: Interactive OpenAPI/Swagger documentation
- π CORS Support: Cross-origin resource sharing for frontend integration
- π Hot Reload: Development server with automatic code reloading
π¨ Frontend (Modern Web UI)
Professional web interface with:
- π± Responsive Design: Mobile-first approach for all device types
- π¬ Real-time Chat: WebSocket-like experience with instant responses
- π€ File Upload: Drag-and-drop interface with progress indicators
- π Analytics Dashboard: Processing statistics and data visualizations
- π Citation System: Interactive source references and document tracking
- π― UX/UI Excellence: Modern design patterns and intuitive workflows
ποΈ Runtime Generated Files
The application creates additional files during operation:
backend/vector_store/ # πΎ Generated during document processing
βββ index.faiss # FAISS vector similarity index (binary)
βββ index.pkl # Index metadata and document mappings (pickle)
βββ enhanced_metadata.json # Processing statistics and file information (JSON)
π¬ Technical Implementation Details
π§° Core Dependencies & Technologies
π€ AI & Machine Learning
- π LangChain: Advanced document processing and LLM integration framework
- β‘ FAISS: Facebook's high-performance vector similarity search library
- π§ Sentence Transformers: State-of-the-art text embedding models (all-MiniLM-L6-v2)
- π GROQ: Cutting-edge Vision API for OCR and conversational AI capabilities
π Backend & API
- β‘ FastAPI: Modern, fast async web framework with automatic OpenAPI documentation
- π¦ Uvicorn: Lightning-fast ASGI server for async Python applications
- π Aiofiles: Async file operations for improved I/O performance
- π Python-dotenv: Environment variable management and configuration
π Data Processing & Utilities
- πΌ Pandas: Data manipulation and analysis (if CSV processing is needed)
- πΌοΈ Pillow: Image processing and manipulation library
- π PyPDF2/pdfplumber: PDF text extraction and processing
- π Pydantic: Data validation and settings management with type hints
π Processing Pipeline Deep Dive (enhanced_vectordb.py)
1. π₯ Document Ingestion & Text Extraction
- Multi-format Support: PDF, TXT, MD, PY, JS, HTML, CSV, JSON
- Advanced OCR: PNG, JPG, JPEG, BMP, TIFF, WEBP using GROQ Vision API
- Error Handling: Graceful fallback for corrupted or unsupported files
- Metadata Tracking: Source file paths, types, processing timestamps
2. βοΈ Intelligent Text Chunking
- Optimal Chunk Size: 800 characters with 100-character overlap for context preservation
- Smart Splitting: Respects sentence boundaries and document structure
- Metadata Preservation: Maintains source information throughout chunking process
- Memory Efficiency: Streaming processing for large documents
3. π§ Embedding Generation & Vector Store Creation
- Model:
all-MiniLM-L6-v2(384-dimensional embeddings, good balance of speed/accuracy) - Alternative:
BAAI/bge-large-en-v1.5(1024-dimensional, higher accuracy for production) - FAISS Index: Efficient L2 distance similarity search with IndexFlatL2
- Metadata Storage: Parallel storage of document metadata for citation tracking
4. π Query Processing & Semantic Search
- Top-K Retrieval: Returns top 5 most relevant document chunks (configurable)
- Similarity Scoring: L2 distance normalization for relevance ranking
- Citation Generation: Automatic source attribution with file paths and confidence scores
- Result Filtering: Duplicate removal and relevance threshold filtering
5. π― AI-Powered Theme Analysis & Response Generation
- Theme Extraction: GROQ LLM analyzes common patterns across search results
- Context Synthesis: Intelligent combination of multiple document sources
- Markdown Generation: Rich text responses with proper formatting, lists, and emphasis
- Citation Integration: Seamless source reference weaving within responses
- Fallback Handling: Structured responses even without LLM access
β‘ Performance Optimizations & Considerations
π Speed & Efficiency
- Chunk Size Optimization: 800 characters balances context and search precision
- Efficient Embedding Model: all-MiniLM-L6-v2 provides fast inference with good quality
- FAISS Performance: In-memory vector search with sub-millisecond query times
- Async Operations: Non-blocking file uploads and processing using async/await patterns
- Memory Management: Automatic cleanup of temporary files and efficient garbage collection
πΎ Storage & Caching
- Vector Store Persistence: Complete save/load with metadata preservation
- Metadata Compression: JSON-based metadata storage with optional compression
- Index Optimization: FAISS index optimization for reduced memory footprint
- Session Management: Efficient state management across multiple user sessions
π‘οΈ Error Handling & Robustness
- Comprehensive Exception Handling: Detailed error messages with context
- File Type Validation: Automatic file type detection and format verification
- Resource Cleanup: Automatic temporary file and directory cleanup
- Graceful Degradation: Fallback responses when AI services are unavailable
- Input Sanitization: Protection against malformed files and malicious inputs
π Monitoring & Analytics
- Processing Statistics: Detailed metrics on file counts, chunk counts, and processing times
- Performance Tracking: Query response times and embedding generation speeds
- Error Logging: Comprehensive logging for debugging and monitoring
- Usage Analytics: Chat history tracking and user interaction patterns
π§ͺ Testing & Development
π¬ Running Tests
# Navigate to tests directory
cd tests
# Install test dependencies
pip install -r requirements-test.txt
# Run comprehensive test suite
bash run_tests.sh
# Or run individual test files
python -m pytest test_endpoints_pytest.py -v
python test_api_endpoints.py
π οΈ Development Setup
# Clone and setup development environment
git clone https://github.com/Jatin-Mehra119/wasserstoff-AiInternTask.git
cd wasserstoff-AiInternTask
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env # Create .env file
# Edit .env and add your GROQ_API_KEY
# Run in development mode
cd backend
python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000
π³ Docker Deployment (if configured)
# Build the container
docker build -t rag-chat-app .
# Run the application
docker run -p 8000:8000 -e GROQ_API_KEY=your_key_here rag-chat-app
# Or use docker-compose (if docker-compose.yml exists)
docker-compose up -d
π€ Contributing
π§ Development Guidelines
- Code Style: Follow PEP 8 Python style guidelines
- Type Hints: Use comprehensive type annotations
- Documentation: Add docstrings for all functions and classes
- Testing: Write tests for new features and bug fixes
- Error Handling: Implement comprehensive exception handling
π Pull Request Process
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Ensure all tests pass:
bash tests/run_tests.sh - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request with a clear description
π― Future Enhancements
π Planned Features
- π Authentication System: User accounts and document access control
- βοΈ Cloud Storage Integration: AWS S3, Google Drive, Dropbox support
- π Advanced Analytics: Usage metrics, query performance analytics
- π Multi-language Support: International document processing
- π Real-time Collaboration: Shared document spaces and collaborative chat
- π± Mobile App: React Native or Flutter mobile application
- π¨ Theming System: Customizable UI themes and branding options
π§ Technical Improvements
- ποΈ Database Integration: PostgreSQL/MongoDB for metadata storage
- βοΈ Load Balancing: Horizontal scaling for high-traffic deployment
- π Monitoring: Comprehensive logging, metrics, and alerting
- π Security: Enhanced security features, rate limiting, input validation
- π WebSocket Support: Real-time chat updates and live document processing
- π§ Model Upgrades: Integration with latest embedding and LLM models
π Documentation
Comprehensive documentation is available in the docs/ directory:
- π Documentation Index - Complete documentation overview
- ποΈ Architecture & Quick Start - Project architecture with mermaid diagram
- π API Reference - REST API endpoints and examples
- π» Development Guide - Contributing and development setup
Interactive API Documentation
When the server is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
π License
This project is licensed under the Apache License 2.0 - see the LICENSE file for complete details.
π License Summary
- β Commercial Use: Free to use in commercial applications
- β Modification: Modify and distribute modified versions
- β Distribution: Distribute original or modified versions
- β Patent Grant: Express patent grant from contributors
- β οΈ Attribution: Must include original copyright and license notices
- β οΈ State Changes: Document significant changes made to the code
π Acknowledgments
- π€ GROQ: For providing powerful Vision OCR and conversational AI APIs
- β‘ FAISS: Facebook AI Research for the exceptional vector similarity search library
- π¦ FastAPI: For the modern, fast, and developer-friendly web framework
- π LangChain: For comprehensive document processing and LLM integration tools
- π§ Sentence Transformers: For state-of-the-art text embedding models
- π¨ Claude AI: For assistance in developing the modern web frontend interface
β If you find this project useful, please consider giving it a star! β
π§ Questions? Issues? Feel free to open an issue or reach out!
π Happy RAG Chatting! π