LiamKhoaLe's picture
Initial setups refine for ingestion_js #1
1f9183d
---
title: StuddyBuddy Ingestion
emoji: ⚙️
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: 'backend for data ingestion'
---
# Ingestion Pipeline
A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.
[API docs](CURL.md)
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Frontend UI │ │ Load Balancer │ │ Main System │ │
│ │ │◄──►│ │◄──►│ (Port 7860) │ │
│ │ - File Upload │ │ - Route Requests │ │ - Chat & Reports│ │
│ │ - Chat Interface│ │ - Health Checks │ │ - User Management│ │
│ │ - Project Mgmt │ │ - Load Balancing │ │ - Analytics │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ File Processing │ │ Data Storage │ │ Monitoring │ │
│ │ - PDF/DOCX Parse│ │ - MongoDB Atlas │ │ - Job Status │ │
│ │ - Image Caption │ │ - Vector Search │ │ - Health Checks │ │
│ │ - Text Chunking │ │ - Embeddings │ │ - Error Handling│ │
│ │ - Embedding Gen │ │ - User/Project │ │ - Logging │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SHARED DATABASE │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ MongoDB Atlas │ │ Collections │ │ Indexes │ │
│ │ │ │ - chunks │ │ - Vector Search │ │
│ │ - Same Cluster │ │ - files │ │ - Text Search │ │
│ │ - Same Database │ │ - chat_sessions │ │ - User/Project │ │
│ │ - Same Schema │ │ - chat_messages │ │ - Performance │ │
│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
```
## 📁 Project Structure
```
ingestion_pipeline/
├── __init__.py
├── app.py # Main FastAPI application
├── requirements.txt # Python dependencies
├── Dockerfile # HuggingFace deployment
├── deploy.sh # Deployment script
├── test_pipeline.py # Test script
├── README.md # This file
├── config/ # Configuration
│ ├── __init__.py
│ └── settings.py
├── api/ # API layer
│ ├── __init__.py
│ ├── models.py # Pydantic models
│ └── routes.py # API routes
└── services/ # Business logic
├── __init__.py
└── ingestion_service.py
```
## 🚀 Quick Start
### Prerequisites
- Docker
- MongoDB Atlas cluster
- Python 3.11+
## 🔧 API Endpoints
### Health Check
```http
GET /health
```
### Upload Files
```http
POST /upload
Content-Type: multipart/form-data
user_id: string
project_id: string
files: File[]
replace_filenames: string (optional)
rename_map: string (optional)
```
### Job Status
```http
GET /upload/status?job_id={job_id}
```
### List Files
```http
GET /files?user_id={user_id}&project_id={project_id}
```
### Get File Chunks
```http
GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
```
## 🔄 Data Flow
### File Processing Pipeline
1. **File Upload**: User uploads files via frontend
2. **Load Balancing**: Request routed to ingestion pipeline
3. **File Processing**:
- PDF/DOCX parsing with image extraction
- BLIP image captioning
- Semantic chunking with overlap
- Embedding generation (all-MiniLM-L6-v2)
4. **Data Storage**:
- Chunks stored in `chunks` collection
- File summaries in `files` collection
- Both scoped by `user_id` and `project_id`
5. **Response**: Job ID returned for progress tracking
### Data Consistency
- **Same Database**: Uses identical MongoDB Atlas cluster
- **Same Collections**: Stores in `chunks` and `files` collections
- **Same Schema**: Identical data structure and metadata
- **Same Scoping**: All data scoped by `user_id` and `project_id`
- **Same Indexes**: Uses identical database indexes
## 🐳 Docker Deployment
### HuggingFace Spaces
The service is designed for HuggingFace Spaces deployment with:
- Port 7860 (HuggingFace default)
- Non-root user for security
- HuggingFace cache directories
- Model preloading and warmup
### Logging
- Comprehensive logging for all operations
- Error tracking and debugging
- Performance monitoring
### Job Tracking
- Upload progress monitoring
- Error handling and reporting
- Status updates
## 🔧 Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MONGO_URI` | Required | MongoDB connection string |
| `MONGO_DB` | `studybuddy` | Database name |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
| `ATLAS_VECTOR` | `0` | Enable Atlas Vector Search |
| `MAX_FILES_PER_UPLOAD` | `15` | Maximum files per upload |
| `MAX_FILE_MB` | `50` | Maximum file size in MB |
| `INGESTION_PORT` | `7860` | Service port |
### Processing Configuration
- **Vector Dimension**: 384 (all-MiniLM-L6-v2)
- **Chunk Max Words**: 500
- **Chunk Min Words**: 150
- **Chunk Overlap**: 50 words
## 🔒 Security
### Security Features
- Non-root user in Docker container
- Input validation and sanitization
- Error handling and logging
- Rate limiting (configurable)
### Best Practices
- Use environment variables for secrets
- Regular security updates
- Monitor logs for anomalies
- Implement proper access controls
## 🚀 Performance
### Optimization Features
- Lazy loading of ML models
- Efficient file processing
- Background task processing
- Memory management
### Scaling
- Horizontal scaling support
- Load balancing ready
- Resource optimization
- Performance monitoring
## 📚 Integration
### Main System Integration
The ingestion pipeline is designed to work seamlessly with the main system:
- Same API endpoints
- Same data structures
- Same processing pipeline
- Same storage format
### Load Balancer Integration
- Automatic request routing
- Health check integration
- Failover support
- Performance monitoring
## 🐛 Troubleshooting
### Common Issues
1. **MongoDB Connection**: Verify `MONGO_URI` is correct
2. **Port Conflicts**: Ensure port 7860 is available
3. **Model Loading**: Check HuggingFace cache permissions
4. **File Processing**: Verify file format support
## 📈 Future Enhancements
### Planned Features
- Multiple file format support
- Advanced chunking strategies
- Performance optimizations
- Enhanced monitoring
### Scalability
- Kubernetes deployment
- Auto-scaling support
- Load balancing improvements
- Resource optimization