File size: 10,046 Bytes
ee39cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5411a7d
ee39cc9
 
 
 
 
 
 
 
 
 
 
1f9183d
ee39cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: StuddyBuddy Ingestion
emoji: ⚙️
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: 'backend for data ingestion'
---

# Ingestion Pipeline

A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.

[API docs](CURL.md)

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                USER INTERFACE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   Frontend UI   │    │  Load Balancer   │    │  Main System    │              │
│  │                 │◄──►│                  │◄──►│   (Port 7860)   │              │
│  │ - File Upload   │    │ - Route Requests │    │ - Chat & Reports│              │
│  │ - Chat Interface│    │ - Health Checks │    │ - User Management│             │
│  │ - Project Mgmt  │    │ - Load Balancing │    │ - Analytics     │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────────┐
│                              INGESTION PIPELINE                                │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │ File Processing │    │   Data Storage   │    │   Monitoring    │              │
│  │ - PDF/DOCX Parse│    │ - MongoDB Atlas │    │ - Job Status    │              │
│  │ - Image Caption │    │ - Vector Search │    │ - Health Checks │              │
│  │ - Text Chunking │    │ - Embeddings    │    │ - Error Handling│              │
│  │ - Embedding Gen │    │ - User/Project  │    │ - Logging       │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────────┐
│                              SHARED DATABASE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   MongoDB Atlas  │    │   Collections    │    │   Indexes       │              │
│  │                 │    │ - chunks         │    │ - Vector Search │              │
│  │ - Same Cluster  │    │ - files          │    │ - Text Search   │              │
│  │ - Same Database │    │ - chat_sessions  │    │ - User/Project  │              │
│  │ - Same Schema   │    │ - chat_messages  │    │ - Performance   │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
```

## 📁 Project Structure

```
ingestion_pipeline/
├── __init__.py
├── app.py                    # Main FastAPI application
├── requirements.txt          # Python dependencies
├── Dockerfile               # HuggingFace deployment
├── deploy.sh               # Deployment script
├── test_pipeline.py        # Test script
├── README.md               # This file
├── config/               # Configuration
│   ├── __init__.py
│   └── settings.py
├── api/                   # API layer
│   ├── __init__.py
│   ├── models.py         # Pydantic models
│   └── routes.py         # API routes
└── services/             # Business logic
    ├── __init__.py
    └── ingestion_service.py
```

## 🚀 Quick Start

### Prerequisites
- Docker
- MongoDB Atlas cluster
- Python 3.11+


## 🔧 API Endpoints

### Health Check
```http
GET /health
```

### Upload Files
```http
POST /upload
Content-Type: multipart/form-data

user_id: string
project_id: string
files: File[]
replace_filenames: string (optional)
rename_map: string (optional)
```

### Job Status
```http
GET /upload/status?job_id={job_id}
```

### List Files
```http
GET /files?user_id={user_id}&project_id={project_id}
```

### Get File Chunks
```http
GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
```

## 🔄 Data Flow

### File Processing Pipeline
1. **File Upload**: User uploads files via frontend
2. **Load Balancing**: Request routed to ingestion pipeline
3. **File Processing**: 
   - PDF/DOCX parsing with image extraction
   - BLIP image captioning
   - Semantic chunking with overlap
   - Embedding generation (all-MiniLM-L6-v2)
4. **Data Storage**: 
   - Chunks stored in `chunks` collection
   - File summaries in `files` collection
   - Both scoped by `user_id` and `project_id`
5. **Response**: Job ID returned for progress tracking

### Data Consistency
- **Same Database**: Uses identical MongoDB Atlas cluster
- **Same Collections**: Stores in `chunks` and `files` collections
- **Same Schema**: Identical data structure and metadata
- **Same Scoping**: All data scoped by `user_id` and `project_id`
- **Same Indexes**: Uses identical database indexes

## 🐳 Docker Deployment

### HuggingFace Spaces
The service is designed for HuggingFace Spaces deployment with:
- Port 7860 (HuggingFace default)
- Non-root user for security
- HuggingFace cache directories
- Model preloading and warmup

### Logging
- Comprehensive logging for all operations
- Error tracking and debugging
- Performance monitoring

### Job Tracking
- Upload progress monitoring
- Error handling and reporting
- Status updates

## 🔧 Configuration

### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MONGO_URI` | Required | MongoDB connection string |
| `MONGO_DB` | `studybuddy` | Database name |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
| `ATLAS_VECTOR` | `0` | Enable Atlas Vector Search |
| `MAX_FILES_PER_UPLOAD` | `15` | Maximum files per upload |
| `MAX_FILE_MB` | `50` | Maximum file size in MB |
| `INGESTION_PORT` | `7860` | Service port |

### Processing Configuration
- **Vector Dimension**: 384 (all-MiniLM-L6-v2)
- **Chunk Max Words**: 500
- **Chunk Min Words**: 150
- **Chunk Overlap**: 50 words

## 🔒 Security

### Security Features
- Non-root user in Docker container
- Input validation and sanitization
- Error handling and logging
- Rate limiting (configurable)

### Best Practices
- Use environment variables for secrets
- Regular security updates
- Monitor logs for anomalies
- Implement proper access controls

## 🚀 Performance

### Optimization Features
- Lazy loading of ML models
- Efficient file processing
- Background task processing
- Memory management

### Scaling
- Horizontal scaling support
- Load balancing ready
- Resource optimization
- Performance monitoring

## 📚 Integration

### Main System Integration
The ingestion pipeline is designed to work seamlessly with the main system:
- Same API endpoints
- Same data structures
- Same processing pipeline
- Same storage format

### Load Balancer Integration
- Automatic request routing
- Health check integration
- Failover support
- Performance monitoring

## 🐛 Troubleshooting

### Common Issues
1. **MongoDB Connection**: Verify `MONGO_URI` is correct
2. **Port Conflicts**: Ensure port 7860 is available
3. **Model Loading**: Check HuggingFace cache permissions
4. **File Processing**: Verify file format support

## 📈 Future Enhancements

### Planned Features
- Multiple file format support
- Advanced chunking strategies
- Performance optimizations
- Enhanced monitoring

### Scalability
- Kubernetes deployment
- Auto-scaling support
- Load balancing improvements
- Resource optimization