Spaces:

BinKhoaLe1812
/

EdSummariser

Sleeping

File size: 10,046 Bytes

---
title: StuddyBuddy Ingestion
emoji: ⚙️
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: 'backend for data ingestion'
---

# Ingestion Pipeline

A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.

[API docs](CURL.md)

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                USER INTERFACE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   Frontend UI   │    │  Load Balancer   │    │  Main System    │              │
│  │                 │◄──►│                  │◄──►│   (Port 7860)   │              │
│  │ - File Upload   │    │ - Route Requests │    │ - Chat & Reports│              │
│  │ - Chat Interface│    │ - Health Checks │    │ - User Management│             │
│  │ - Project Mgmt  │    │ - Load Balancing │    │ - Analytics     │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              INGESTION PIPELINE                                │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │ File Processing │    │   Data Storage   │    │   Monitoring    │              │
│  │ - PDF/DOCX Parse│    │ - MongoDB Atlas │    │ - Job Status    │              │
│  │ - Image Caption │    │ - Vector Search │    │ - Health Checks │              │
│  │ - Text Chunking │    │ - Embeddings    │    │ - Error Handling│              │
│  │ - Embedding Gen │    │ - User/Project  │    │ - Logging       │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              SHARED DATABASE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   MongoDB Atlas  │    │   Collections    │    │   Indexes       │              │
│  │                 │    │ - chunks         │    │ - Vector Search │              │
│  │ - Same Cluster  │    │ - files          │    │ - Text Search   │              │
│  │ - Same Database │    │ - chat_sessions  │    │ - User/Project  │              │
│  │ - Same Schema   │    │ - chat_messages  │    │ - Performance   │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
```

## 📁 Project Structure

```
ingestion_pipeline/
├── __init__.py
├── app.py                    # Main FastAPI application
├── requirements.txt          # Python dependencies
├── Dockerfile               # HuggingFace deployment
├── deploy.sh               # Deployment script
├── test_pipeline.py        # Test script
├── README.md               # This file
├── config/               # Configuration
│   ├── __init__.py
│   └── settings.py
├── api/                   # API layer
│   ├── __init__.py
│   ├── models.py         # Pydantic models
│   └── routes.py         # API routes
└── services/             # Business logic
    ├── __init__.py
    └── ingestion_service.py
```

## 🚀 Quick Start

### Prerequisites
- Docker
- MongoDB Atlas cluster
- Python 3.11+


## 🔧 API Endpoints

### Health Check
```http
GET /health
```

### Upload Files
```http
POST /upload
Content-Type: multipart/form-data

user_id: string
project_id: string
files: File[]
replace_filenames: string (optional)
rename_map: string (optional)
```

### Job Status
```http
GET /upload/status?job_id={job_id}
```

### List Files
```http
GET /files?user_id={user_id}&project_id={project_id}
```

### Get File Chunks
```http
GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
```

## 🔄 Data Flow

### File Processing Pipeline
1. **File Upload**: User uploads files via frontend
2. **Load Balancing**: Request routed to ingestion pipeline
3. **File Processing**: 
   - PDF/DOCX parsing with image extraction
   - BLIP image captioning
   - Semantic chunking with overlap
   - Embedding generation (all-MiniLM-L6-v2)
4. **Data Storage**: 
   - Chunks stored in `chunks` collection
   - File summaries in `files` collection
   - Both scoped by `user_id` and `project_id`
5. **Response**: Job ID returned for progress tracking

### Data Consistency
- **Same Database**: Uses identical MongoDB Atlas cluster
- **Same Collections**: Stores in `chunks` and `files` collections
- **Same Schema**: Identical data structure and metadata
- **Same Scoping**: All data scoped by `user_id` and `project_id`
- **Same Indexes**: Uses identical database indexes

## 🐳 Docker Deployment

### HuggingFace Spaces
The service is designed for HuggingFace Spaces deployment with:
- Port 7860 (HuggingFace default)
- Non-root user for security
- HuggingFace cache directories
- Model preloading and warmup

### Logging
- Comprehensive logging for all operations
- Error tracking and debugging
- Performance monitoring

### Job Tracking
- Upload progress monitoring
- Error handling and reporting
- Status updates

## 🔧 Configuration

### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MONGO_URI` | Required | MongoDB connection string |
| `MONGO_DB` | `studybuddy` | Database name |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
| `ATLAS_VECTOR` | `0` | Enable Atlas Vector Search |
| `MAX_FILES_PER_UPLOAD` | `15` | Maximum files per upload |
| `MAX_FILE_MB` | `50` | Maximum file size in MB |
| `INGESTION_PORT` | `7860` | Service port |

### Processing Configuration
- **Vector Dimension**: 384 (all-MiniLM-L6-v2)
- **Chunk Max Words**: 500
- **Chunk Min Words**: 150
- **Chunk Overlap**: 50 words

## 🔒 Security

### Security Features
- Non-root user in Docker container
- Input validation and sanitization
- Error handling and logging
- Rate limiting (configurable)

### Best Practices
- Use environment variables for secrets
- Regular security updates
- Monitor logs for anomalies
- Implement proper access controls

## 🚀 Performance

### Optimization Features
- Lazy loading of ML models
- Efficient file processing
- Background task processing
- Memory management

### Scaling
- Horizontal scaling support
- Load balancing ready
- Resource optimization
- Performance monitoring

## 📚 Integration

### Main System Integration
The ingestion pipeline is designed to work seamlessly with the main system:
- Same API endpoints
- Same data structures
- Same processing pipeline
- Same storage format

### Load Balancer Integration
- Automatic request routing
- Health check integration
- Failover support
- Performance monitoring

## 🐛 Troubleshooting

### Common Issues
1. **MongoDB Connection**: Verify `MONGO_URI` is correct
2. **Port Conflicts**: Ensure port 7860 is available
3. **Model Loading**: Check HuggingFace cache permissions
4. **File Processing**: Verify file format support

## 📈 Future Enhancements

### Planned Features
- Multiple file format support
- Advanced chunking strategies
- Performance optimizations
- Enhanced monitoring

### Scalability
- Kubernetes deployment
- Auto-scaling support
- Load balancing improvements
- Resource optimization