Spaces:

BinKhoaLe1812
/

EdSummariser

Sleeping

App Files Files Community

EdSummariser / ingestion_python /README.md

LiamKhoaLe

Initial setups refine for ingestion_js #1

1f9183d 3 months ago

preview code

raw

history blame contribute delete

10 kB

	---
	title: StuddyBuddy Ingestion
	emoji: ⚙️
	colorFrom: blue
	colorTo: pink
	sdk: docker
	pinned: false
	license: mit
	short_description: 'backend for data ingestion'
	---

	# Ingestion Pipeline

	A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.

	[API docs](CURL.md)

	## 🏗️ Architecture

	```
	┌─────────────────────────────────────────────────────────────────────────────────┐
	│ USER INTERFACE │
	│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
	│ │ Frontend UI │ │ Load Balancer │ │ Main System │ │
	│ │ │◄──►│ │◄──►│ (Port 7860) │ │
	│ │ - File Upload │ │ - Route Requests │ │ - Chat & Reports│ │
	│ │ - Chat Interface│ │ - Health Checks │ │ - User Management│ │
	│ │ - Project Mgmt │ │ - Load Balancing │ │ - Analytics │ │
	│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
	└─────────────────────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────────────────┐
	│ INGESTION PIPELINE │
	│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
	│ │ File Processing │ │ Data Storage │ │ Monitoring │ │
	│ │ - PDF/DOCX Parse│ │ - MongoDB Atlas │ │ - Job Status │ │
	│ │ - Image Caption │ │ - Vector Search │ │ - Health Checks │ │
	│ │ - Text Chunking │ │ - Embeddings │ │ - Error Handling│ │
	│ │ - Embedding Gen │ │ - User/Project │ │ - Logging │ │
	│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
	└─────────────────────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────────────────┐
	│ SHARED DATABASE │
	│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
	│ │ MongoDB Atlas │ │ Collections │ │ Indexes │ │
	│ │ │ │ - chunks │ │ - Vector Search │ │
	│ │ - Same Cluster │ │ - files │ │ - Text Search │ │
	│ │ - Same Database │ │ - chat_sessions │ │ - User/Project │ │
	│ │ - Same Schema │ │ - chat_messages │ │ - Performance │ │
	│ └─────────────────┘ └──────────────────┘ └─────────────────┘ │
	└─────────────────────────────────────────────────────────────────────────────────┘
	```

	## 📁 Project Structure

	```
	ingestion_pipeline/
	├── __init__.py
	├── app.py # Main FastAPI application
	├── requirements.txt # Python dependencies
	├── Dockerfile # HuggingFace deployment
	├── deploy.sh # Deployment script
	├── test_pipeline.py # Test script
	├── README.md # This file
	├── config/ # Configuration
	│ ├── __init__.py
	│ └── settings.py
	├── api/ # API layer
	│ ├── __init__.py
	│ ├── models.py # Pydantic models
	│ └── routes.py # API routes
	└── services/ # Business logic
	├── __init__.py
	└── ingestion_service.py
	```

	## 🚀 Quick Start

	### Prerequisites
	- Docker
	- MongoDB Atlas cluster
	- Python 3.11+


	## 🔧 API Endpoints

	### Health Check
	```http
	GET /health
	```

	### Upload Files
	```http
	POST /upload
	Content-Type: multipart/form-data

	user_id: string
	project_id: string
	files: File[]
	replace_filenames: string (optional)
	rename_map: string (optional)
	```

	### Job Status
	```http
	GET /upload/status?job_id={job_id}
	```

	### List Files
	```http
	GET /files?user_id={user_id}&project_id={project_id}
	```

	### Get File Chunks
	```http
	GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
	```

	## 🔄 Data Flow

	### File Processing Pipeline
	1. File Upload: User uploads files via frontend
	2. Load Balancing: Request routed to ingestion pipeline
	3. File Processing:
	- PDF/DOCX parsing with image extraction
	- BLIP image captioning
	- Semantic chunking with overlap
	- Embedding generation (all-MiniLM-L6-v2)
	4. Data Storage:
	- Chunks stored in `chunks` collection
	- File summaries in `files` collection
	- Both scoped by `user_id` and `project_id`
	5. Response: Job ID returned for progress tracking

	### Data Consistency
	- Same Database: Uses identical MongoDB Atlas cluster
	- Same Collections: Stores in `chunks` and `files` collections
	- Same Schema: Identical data structure and metadata
	- Same Scoping: All data scoped by `user_id` and `project_id`
	- Same Indexes: Uses identical database indexes

	## 🐳 Docker Deployment

	### HuggingFace Spaces
	The service is designed for HuggingFace Spaces deployment with:
	- Port 7860 (HuggingFace default)
	- Non-root user for security
	- HuggingFace cache directories
	- Model preloading and warmup

	### Logging
	- Comprehensive logging for all operations
	- Error tracking and debugging
	- Performance monitoring

	### Job Tracking
	- Upload progress monitoring
	- Error handling and reporting
	- Status updates

	## 🔧 Configuration

	### Environment Variables
	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `MONGO_URI` \| Required \| MongoDB connection string \|
	\| `MONGO_DB` \| `studybuddy` \| Database name \|
	\| `EMBED_MODEL` \| `sentence-transformers/all-MiniLM-L6-v2` \| Embedding model \|
	\| `ATLAS_VECTOR` \| `0` \| Enable Atlas Vector Search \|
	\| `MAX_FILES_PER_UPLOAD` \| `15` \| Maximum files per upload \|
	\| `MAX_FILE_MB` \| `50` \| Maximum file size in MB \|
	\| `INGESTION_PORT` \| `7860` \| Service port \|

	### Processing Configuration
	- Vector Dimension: 384 (all-MiniLM-L6-v2)
	- Chunk Max Words: 500
	- Chunk Min Words: 150
	- Chunk Overlap: 50 words

	## 🔒 Security

	### Security Features
	- Non-root user in Docker container
	- Input validation and sanitization
	- Error handling and logging
	- Rate limiting (configurable)

	### Best Practices
	- Use environment variables for secrets
	- Regular security updates
	- Monitor logs for anomalies
	- Implement proper access controls

	## 🚀 Performance

	### Optimization Features
	- Lazy loading of ML models
	- Efficient file processing
	- Background task processing
	- Memory management

	### Scaling
	- Horizontal scaling support
	- Load balancing ready
	- Resource optimization
	- Performance monitoring

	## 📚 Integration

	### Main System Integration
	The ingestion pipeline is designed to work seamlessly with the main system:
	- Same API endpoints
	- Same data structures
	- Same processing pipeline
	- Same storage format

	### Load Balancer Integration
	- Automatic request routing
	- Health check integration
	- Failover support
	- Performance monitoring

	## 🐛 Troubleshooting

	### Common Issues
	1. MongoDB Connection: Verify `MONGO_URI` is correct
	2. Port Conflicts: Ensure port 7860 is available
	3. Model Loading: Check HuggingFace cache permissions
	4. File Processing: Verify file format support

	## 📈 Future Enhancements

	### Planned Features
	- Multiple file format support
	- Advanced chunking strategies
	- Performance optimizations
	- Enhanced monitoring

	### Scalability
	- Kubernetes deployment
	- Auto-scaling support
	- Load balancing improvements
	- Resource optimization