Spaces:

BinKhoaLe1812
/

EdSummariser

Sleeping

App Files Files Community

EdSummariser / ingestion_python /README.md

LiamKhoaLe

Initial setups refine for ingestion_js #1

1f9183d 3 months ago

preview code

raw

history blame contribute delete

10 kB

metadata

title: StuddyBuddy Ingestion
emoji: ⚙️
colorFrom: blue
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: backend for data ingestion

Ingestion Pipeline

A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.

API docs

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                USER INTERFACE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   Frontend UI   │    │  Load Balancer   │    │  Main System    │              │
│  │                 │◄──►│                  │◄──►│   (Port 7860)   │              │
│  │ - File Upload   │    │ - Route Requests │    │ - Chat & Reports│              │
│  │ - Chat Interface│    │ - Health Checks │    │ - User Management│             │
│  │ - Project Mgmt  │    │ - Load Balancing │    │ - Analytics     │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              INGESTION PIPELINE                                │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │ File Processing │    │   Data Storage   │    │   Monitoring    │              │
│  │ - PDF/DOCX Parse│    │ - MongoDB Atlas │    │ - Job Status    │              │
│  │ - Image Caption │    │ - Vector Search │    │ - Health Checks │              │
│  │ - Text Chunking │    │ - Embeddings    │    │ - Error Handling│              │
│  │ - Embedding Gen │    │ - User/Project  │    │ - Logging       │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              SHARED DATABASE                                   │
│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
│  │   MongoDB Atlas  │    │   Collections    │    │   Indexes       │              │
│  │                 │    │ - chunks         │    │ - Vector Search │              │
│  │ - Same Cluster  │    │ - files          │    │ - Text Search   │              │
│  │ - Same Database │    │ - chat_sessions  │    │ - User/Project  │              │
│  │ - Same Schema   │    │ - chat_messages  │    │ - Performance   │              │
│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
└─────────────────────────────────────────────────────────────────────────────────┘

📁 Project Structure

ingestion_pipeline/
├── __init__.py
├── app.py                    # Main FastAPI application
├── requirements.txt          # Python dependencies
├── Dockerfile               # HuggingFace deployment
├── deploy.sh               # Deployment script
├── test_pipeline.py        # Test script
├── README.md               # This file
├── config/               # Configuration
│   ├── __init__.py
│   └── settings.py
├── api/                   # API layer
│   ├── __init__.py
│   ├── models.py         # Pydantic models
│   └── routes.py         # API routes
└── services/             # Business logic
    ├── __init__.py
    └── ingestion_service.py

🚀 Quick Start

Prerequisites

Docker
MongoDB Atlas cluster
Python 3.11+

🔧 API Endpoints

Health Check

GET /health

Upload Files

POST /upload
Content-Type: multipart/form-data

user_id: string
project_id: string
files: File[]
replace_filenames: string (optional)
rename_map: string (optional)

Job Status

GET /upload/status?job_id={job_id}

List Files

GET /files?user_id={user_id}&project_id={project_id}

Get File Chunks

GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}

🔄 Data Flow

File Processing Pipeline

File Upload: User uploads files via frontend
Load Balancing: Request routed to ingestion pipeline
File Processing:
- PDF/DOCX parsing with image extraction
- BLIP image captioning
- Semantic chunking with overlap
- Embedding generation (all-MiniLM-L6-v2)
Data Storage:
- Chunks stored in chunks collection
- File summaries in files collection
- Both scoped by user_id and project_id
Response: Job ID returned for progress tracking

Data Consistency

Same Database: Uses identical MongoDB Atlas cluster
Same Collections: Stores in chunks and files collections
Same Schema: Identical data structure and metadata
Same Scoping: All data scoped by user_id and project_id
Same Indexes: Uses identical database indexes

🐳 Docker Deployment

HuggingFace Spaces

The service is designed for HuggingFace Spaces deployment with:

Port 7860 (HuggingFace default)
Non-root user for security
HuggingFace cache directories
Model preloading and warmup

Logging

Comprehensive logging for all operations
Error tracking and debugging
Performance monitoring

Job Tracking

Upload progress monitoring
Error handling and reporting
Status updates

🔧 Configuration

Environment Variables

Variable	Default	Description
`MONGO_URI`	Required	MongoDB connection string
`MONGO_DB`	`studybuddy`	Database name
`EMBED_MODEL`	`sentence-transformers/all-MiniLM-L6-v2`	Embedding model
`ATLAS_VECTOR`	`0`	Enable Atlas Vector Search
`MAX_FILES_PER_UPLOAD`	`15`	Maximum files per upload
`MAX_FILE_MB`	`50`	Maximum file size in MB
`INGESTION_PORT`	`7860`	Service port

Processing Configuration

Vector Dimension: 384 (all-MiniLM-L6-v2)
Chunk Max Words: 500
Chunk Min Words: 150
Chunk Overlap: 50 words

🔒 Security

Security Features

Non-root user in Docker container
Input validation and sanitization
Error handling and logging
Rate limiting (configurable)

Best Practices

Use environment variables for secrets
Regular security updates
Monitor logs for anomalies
Implement proper access controls

🚀 Performance

Optimization Features

Lazy loading of ML models
Efficient file processing
Background task processing
Memory management

Scaling

Horizontal scaling support
Load balancing ready
Resource optimization
Performance monitoring

📚 Integration

Main System Integration

The ingestion pipeline is designed to work seamlessly with the main system:

Same API endpoints
Same data structures
Same processing pipeline
Same storage format

Load Balancer Integration

Automatic request routing
Health check integration
Failover support
Performance monitoring

🐛 Troubleshooting

Common Issues

MongoDB Connection: Verify MONGO_URI is correct
Port Conflicts: Ensure port 7860 is available
Model Loading: Check HuggingFace cache permissions
File Processing: Verify file format support

📈 Future Enhancements

Planned Features

Multiple file format support
Advanced chunking strategies
Performance optimizations
Enhanced monitoring

Scalability

Kubernetes deployment
Auto-scaling support
Load balancing improvements
Resource optimization