File size: 5,465 Bytes
255cbd1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | # File Discovery Agent
**Agent 1** in the Agentic Business Digitization Pipeline
## Overview
The File Discovery Agent is responsible for securely extracting ZIP files and classifying all contained files by type. It implements comprehensive security checks to prevent path traversal attacks, zip bombs, and handles corrupted files gracefully.
## Features
- β
Secure ZIP extraction with safety checks
- β
Multi-strategy file type classification
- β
Path traversal prevention
- β
ZIP bomb detection (compression ratio check)
- β
File size and count limits
- β
Directory structure preservation
- β
Comprehensive error handling
- β
Detailed metadata generation
## Security Features
| Check | Description | Limit |
|-------|-------------|-------|
| **File Size** | Maximum ZIP file size | 500MB (configurable) |
| **File Count** | Maximum files per ZIP | 100 (configurable) |
| **Compression Ratio** | Zip bomb detection | 1000:1 max |
| **Path Traversal** | Block `..` patterns | Always blocked |
| **Magic Numbers** | Validate file content | Auto-detected |
## Usage
### Basic Usage
```python
from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput
from backend.utils.storage_manager import StorageManager
# Initialize
storage = StorageManager(storage_base="./storage")
agent = FileDiscoveryAgent(storage_manager=storage)
# Create input
input_data = FileDiscoveryInput(
zip_file_path="/path/to/upload.zip",
job_id="job_20240315_abc123"
)
# Run discovery
output = agent.discover(input_data)
# Check results
if output.success:
print(f"Discovered {output.total_files} files")
print(f"Documents: {len(output.documents)}")
print(f"Images: {len(output.images)}")
print(f"Extraction dir: {output.extraction_dir}")
else:
print(f"Errors: {output.errors}")
```
### Input Schema
```python
FileDiscoveryInput(
zip_file_path: str, # Absolute path to ZIP file
job_id: str, # Unique job identifier
max_file_size: int = 524288000, # Optional: 500MB default
max_files: int = 100 # Optional: 100 files default
)
```
### Output Schema
```python
FileDiscoveryOutput(
job_id: str,
success: bool,
# Classified files
documents: List[DocumentFile], # PDFs, DOCX, DOC
spreadsheets: List[SpreadsheetFile], # XLSX, XLS, CSV
images: List[ImageFile], # JPG, PNG, GIF, WEBP
videos: List[VideoFile], # MP4, AVI, MOV, MKV
unknown: List[UnknownFile], # Unsupported types
# Structure
directory_tree: DirectoryNode, # Folder hierarchy
# Metadata
total_files: int,
extraction_dir: str,
processing_time: float,
errors: List[str],
# Summary
summary: dict
)
```
## File Type Classification
The agent uses a **3-strategy approach**:
1. **MIME Type Detection** (python-magic if available)
2. **Extension-based** classification
3. **Magic Number** validation
### Supported Types
| Category | Extensions |
|----------|-----------|
| **Documents** | .pdf, .doc, .docx |
| **Spreadsheets** | .xls, .xlsx, .csv |
| **Images** | .jpg, .jpeg, .png, .gif, .webp |
| **Videos** | .mp4, .avi, .mov, .mkv |
## Directory Structure
After extraction, files are organized as:
```
storage/
βββ extracted/
βββ {job_id}/
βββ documents/
βββ spreadsheets/
βββ images/
βββ videos/
βββ unknown/
βββ discovery_metadata.json
```
## Error Handling
| Error Type | Behavior |
|-----------|----------|
| Invalid ZIP | `success=False`, error in list |
| Path traversal | File skipped, warning logged |
| Corrupted file | File skipped, error logged |
| Unsupported type | Added to `unknown` list |
| Size exceeded | `success=False`, processing stopped |
## Testing
Run tests with pytest:
```bash
# Run all tests
pytest tests/agents/test_file_discovery.py -v
# Run with coverage
pytest tests/agents/test_file_discovery.py --cov=backend.agents.file_discovery
# Run specific test
pytest tests/agents/test_file_discovery.py::TestFileDiscoveryAgent::test_discover_valid_zip -v
```
### Test Coverage
- β
Valid ZIP with mixed files
- β
Nested folder structures
- β
Non-existent files
- β
File size exceeded
- β
File count exceeded
- β
Path traversal attempts
- β
Corrupted ZIP files
- β
File type classification
- β
Directory tree building
- β
Metadata persistence
## Configuration
Environment variables (see `.env.example`):
```bash
# File limits
MAX_FILE_SIZE=524288000 # 500MB
MAX_FILES_PER_ZIP=100
# Storage paths
STORAGE_BASE=./storage
EXTRACTED_DIR=extracted
```
## Performance
Typical performance for business document folders:
| Files | Total Size | Processing Time |
|-------|-----------|-----------------|
| 10 files | 5MB | ~0.5s |
| 50 files | 25MB | ~2s |
| 100 files | 50MB | ~4s |
## Next Steps
After file discovery completes successfully:
1. **Document Parsing Agent** processes PDFs and DOCX files
2. **Table Extraction Agent** finds and structures tables
3. **Media Extraction Agent** extracts embedded images
4. **Vision Agent** analyzes images with Qwen3.5:0.8B
## Files
- `backend/agents/file_discovery.py` - Main agent implementation
- `backend/utils/file_classifier.py` - File type classification
- `backend/utils/storage_manager.py` - Storage organization
- `tests/agents/test_file_discovery.py` - Unit tests
|