--- title: StuddyBuddy Ingestion emoji: βš™οΈ colorFrom: blue colorTo: pink sdk: docker pinned: false license: mit short_description: 'backend for data ingestion' --- # Ingestion Pipeline A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load. [API docs](CURL.md) ## πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ USER INTERFACE β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Frontend UI β”‚ β”‚ Load Balancer β”‚ β”‚ Main System β”‚ β”‚ β”‚ β”‚ │◄──►│ │◄──►│ (Port 7860) β”‚ β”‚ β”‚ β”‚ - File Upload β”‚ β”‚ - Route Requests β”‚ β”‚ - Chat & Reportsβ”‚ β”‚ β”‚ β”‚ - Chat Interfaceβ”‚ β”‚ - Health Checks β”‚ β”‚ - User Managementβ”‚ β”‚ β”‚ β”‚ - Project Mgmt β”‚ β”‚ - Load Balancing β”‚ β”‚ - Analytics β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ INGESTION PIPELINE β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ File Processing β”‚ β”‚ Data Storage β”‚ β”‚ Monitoring β”‚ β”‚ β”‚ β”‚ - PDF/DOCX Parseβ”‚ β”‚ - MongoDB Atlas β”‚ β”‚ - Job Status β”‚ β”‚ β”‚ β”‚ - Image Caption β”‚ β”‚ - Vector Search β”‚ β”‚ - Health Checks β”‚ β”‚ β”‚ β”‚ - Text Chunking β”‚ β”‚ - Embeddings β”‚ β”‚ - Error Handlingβ”‚ β”‚ β”‚ β”‚ - Embedding Gen β”‚ β”‚ - User/Project β”‚ β”‚ - Logging β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SHARED DATABASE β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ MongoDB Atlas β”‚ β”‚ Collections β”‚ β”‚ Indexes β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ - chunks β”‚ β”‚ - Vector Search β”‚ β”‚ β”‚ β”‚ - Same Cluster β”‚ β”‚ - files β”‚ β”‚ - Text Search β”‚ β”‚ β”‚ β”‚ - Same Database β”‚ β”‚ - chat_sessions β”‚ β”‚ - User/Project β”‚ β”‚ β”‚ β”‚ - Same Schema β”‚ β”‚ - chat_messages β”‚ β”‚ - Performance β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ“ Project Structure ``` ingestion_pipeline/ β”œβ”€β”€ __init__.py β”œβ”€β”€ app.py # Main FastAPI application β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ Dockerfile # HuggingFace deployment β”œβ”€β”€ deploy.sh # Deployment script β”œβ”€β”€ test_pipeline.py # Test script β”œβ”€β”€ README.md # This file β”œβ”€β”€ config/ # Configuration β”‚ β”œβ”€β”€ __init__.py β”‚ └── settings.py β”œβ”€β”€ api/ # API layer β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ models.py # Pydantic models β”‚ └── routes.py # API routes └── services/ # Business logic β”œβ”€β”€ __init__.py └── ingestion_service.py ``` ## πŸš€ Quick Start ### Prerequisites - Docker - MongoDB Atlas cluster - Python 3.11+ ## πŸ”§ API Endpoints ### Health Check ```http GET /health ``` ### Upload Files ```http POST /upload Content-Type: multipart/form-data user_id: string project_id: string files: File[] replace_filenames: string (optional) rename_map: string (optional) ``` ### Job Status ```http GET /upload/status?job_id={job_id} ``` ### List Files ```http GET /files?user_id={user_id}&project_id={project_id} ``` ### Get File Chunks ```http GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit} ``` ## πŸ”„ Data Flow ### File Processing Pipeline 1. **File Upload**: User uploads files via frontend 2. **Load Balancing**: Request routed to ingestion pipeline 3. **File Processing**: - PDF/DOCX parsing with image extraction - BLIP image captioning - Semantic chunking with overlap - Embedding generation (all-MiniLM-L6-v2) 4. **Data Storage**: - Chunks stored in `chunks` collection - File summaries in `files` collection - Both scoped by `user_id` and `project_id` 5. **Response**: Job ID returned for progress tracking ### Data Consistency - **Same Database**: Uses identical MongoDB Atlas cluster - **Same Collections**: Stores in `chunks` and `files` collections - **Same Schema**: Identical data structure and metadata - **Same Scoping**: All data scoped by `user_id` and `project_id` - **Same Indexes**: Uses identical database indexes ## 🐳 Docker Deployment ### HuggingFace Spaces The service is designed for HuggingFace Spaces deployment with: - Port 7860 (HuggingFace default) - Non-root user for security - HuggingFace cache directories - Model preloading and warmup ### Logging - Comprehensive logging for all operations - Error tracking and debugging - Performance monitoring ### Job Tracking - Upload progress monitoring - Error handling and reporting - Status updates ## πŸ”§ Configuration ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `MONGO_URI` | Required | MongoDB connection string | | `MONGO_DB` | `studybuddy` | Database name | | `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model | | `ATLAS_VECTOR` | `0` | Enable Atlas Vector Search | | `MAX_FILES_PER_UPLOAD` | `15` | Maximum files per upload | | `MAX_FILE_MB` | `50` | Maximum file size in MB | | `INGESTION_PORT` | `7860` | Service port | ### Processing Configuration - **Vector Dimension**: 384 (all-MiniLM-L6-v2) - **Chunk Max Words**: 500 - **Chunk Min Words**: 150 - **Chunk Overlap**: 50 words ## πŸ”’ Security ### Security Features - Non-root user in Docker container - Input validation and sanitization - Error handling and logging - Rate limiting (configurable) ### Best Practices - Use environment variables for secrets - Regular security updates - Monitor logs for anomalies - Implement proper access controls ## πŸš€ Performance ### Optimization Features - Lazy loading of ML models - Efficient file processing - Background task processing - Memory management ### Scaling - Horizontal scaling support - Load balancing ready - Resource optimization - Performance monitoring ## πŸ“š Integration ### Main System Integration The ingestion pipeline is designed to work seamlessly with the main system: - Same API endpoints - Same data structures - Same processing pipeline - Same storage format ### Load Balancer Integration - Automatic request routing - Health check integration - Failover support - Performance monitoring ## πŸ› Troubleshooting ### Common Issues 1. **MongoDB Connection**: Verify `MONGO_URI` is correct 2. **Port Conflicts**: Ensure port 7860 is available 3. **Model Loading**: Check HuggingFace cache permissions 4. **File Processing**: Verify file format support ## πŸ“ˆ Future Enhancements ### Planned Features - Multiple file format support - Advanced chunking strategies - Performance optimizations - Enhanced monitoring ### Scalability - Kubernetes deployment - Auto-scaling support - Load balancing improvements - Resource optimization