Spaces:
Sleeping
Sleeping
metadata
title: StuddyBuddy Ingestion
emoji: ⚙️
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: backend for data ingestion
Ingestion Pipeline
A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.
🏗️ Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Frontend UI │ │ Load Balancer │ │ Main System │ │
│ │ │◄──►│ │◄──►│ (Port 7860) │ │
│ │ - File Upload │ │ - Route Requests │ │ - Chat & Reports│ │
│ │ - Chat Interface│ │ - Health Checks │ │ - User Management│ │
│ │ - Project Mgmt │ │ - Load Balancing │ │ - Analytics │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ File Processing │ │ Data Storage │ │ Monitoring │ │
│ │ - PDF/DOCX Parse│ │ - MongoDB Atlas │ │ - Job Status │ │
│ │ - Image Caption │ │ - Vector Search │ │ - Health Checks │ │
│ │ - Text Chunking │ │ - Embeddings │ │ - Error Handling│ │
│ │ - Embedding Gen │ │ - User/Project │ │ - Logging │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SHARED DATABASE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ MongoDB Atlas │ │ Collections │ │ Indexes │ │
│ │ │ │ - chunks │ │ - Vector Search │ │
│ │ - Same Cluster │ │ - files │ │ - Text Search │ │
│ │ - Same Database │ │ - chat_sessions │ │ - User/Project │ │
│ │ - Same Schema │ │ - chat_messages │ │ - Performance │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
📁 Project Structure
ingestion_pipeline/
├── __init__.py
├── app.py # Main FastAPI application
├── requirements.txt # Python dependencies
├── Dockerfile # HuggingFace deployment
├── deploy.sh # Deployment script
├── test_pipeline.py # Test script
├── README.md # This file
├── config/ # Configuration
│ ├── __init__.py
│ └── settings.py
├── api/ # API layer
│ ├── __init__.py
│ ├── models.py # Pydantic models
│ └── routes.py # API routes
└── services/ # Business logic
├── __init__.py
└── ingestion_service.py
🚀 Quick Start
Prerequisites
- Docker
- MongoDB Atlas cluster
- Python 3.11+
🔧 API Endpoints
Health Check
GET /health
Upload Files
POST /upload
Content-Type: multipart/form-data
user_id: string
project_id: string
files: File[]
replace_filenames: string (optional)
rename_map: string (optional)
Job Status
GET /upload/status?job_id={job_id}
List Files
GET /files?user_id={user_id}&project_id={project_id}
Get File Chunks
GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
🔄 Data Flow
File Processing Pipeline
- File Upload: User uploads files via frontend
- Load Balancing: Request routed to ingestion pipeline
- File Processing:
- PDF/DOCX parsing with image extraction
- BLIP image captioning
- Semantic chunking with overlap
- Embedding generation (all-MiniLM-L6-v2)
- Data Storage:
- Chunks stored in
chunkscollection - File summaries in
filescollection - Both scoped by
user_idandproject_id
- Chunks stored in
- Response: Job ID returned for progress tracking
Data Consistency
- Same Database: Uses identical MongoDB Atlas cluster
- Same Collections: Stores in
chunksandfilescollections - Same Schema: Identical data structure and metadata
- Same Scoping: All data scoped by
user_idandproject_id - Same Indexes: Uses identical database indexes
🐳 Docker Deployment
HuggingFace Spaces
The service is designed for HuggingFace Spaces deployment with:
- Port 7860 (HuggingFace default)
- Non-root user for security
- HuggingFace cache directories
- Model preloading and warmup
Logging
- Comprehensive logging for all operations
- Error tracking and debugging
- Performance monitoring
Job Tracking
- Upload progress monitoring
- Error handling and reporting
- Status updates
🔧 Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
MONGO_URI |
Required | MongoDB connection string |
MONGO_DB |
studybuddy |
Database name |
EMBED_MODEL |
sentence-transformers/all-MiniLM-L6-v2 |
Embedding model |
ATLAS_VECTOR |
0 |
Enable Atlas Vector Search |
MAX_FILES_PER_UPLOAD |
15 |
Maximum files per upload |
MAX_FILE_MB |
50 |
Maximum file size in MB |
INGESTION_PORT |
7860 |
Service port |
Processing Configuration
- Vector Dimension: 384 (all-MiniLM-L6-v2)
- Chunk Max Words: 500
- Chunk Min Words: 150
- Chunk Overlap: 50 words
🔒 Security
Security Features
- Non-root user in Docker container
- Input validation and sanitization
- Error handling and logging
- Rate limiting (configurable)
Best Practices
- Use environment variables for secrets
- Regular security updates
- Monitor logs for anomalies
- Implement proper access controls
🚀 Performance
Optimization Features
- Lazy loading of ML models
- Efficient file processing
- Background task processing
- Memory management
Scaling
- Horizontal scaling support
- Load balancing ready
- Resource optimization
- Performance monitoring
📚 Integration
Main System Integration
The ingestion pipeline is designed to work seamlessly with the main system:
- Same API endpoints
- Same data structures
- Same processing pipeline
- Same storage format
Load Balancer Integration
- Automatic request routing
- Health check integration
- Failover support
- Performance monitoring
🐛 Troubleshooting
Common Issues
- MongoDB Connection: Verify
MONGO_URIis correct - Port Conflicts: Ensure port 7860 is available
- Model Loading: Check HuggingFace cache permissions
- File Processing: Verify file format support
📈 Future Enhancements
Planned Features
- Multiple file format support
- Advanced chunking strategies
- Performance optimizations
- Enhanced monitoring
Scalability
- Kubernetes deployment
- Auto-scaling support
- Load balancing improvements
- Resource optimization