File size: 5,465 Bytes
255cbd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File Discovery Agent

**Agent 1** in the Agentic Business Digitization Pipeline

## Overview

The File Discovery Agent is responsible for securely extracting ZIP files and classifying all contained files by type. It implements comprehensive security checks to prevent path traversal attacks, zip bombs, and handles corrupted files gracefully.

## Features

- βœ… Secure ZIP extraction with safety checks
- βœ… Multi-strategy file type classification
- βœ… Path traversal prevention
- βœ… ZIP bomb detection (compression ratio check)
- βœ… File size and count limits
- βœ… Directory structure preservation
- βœ… Comprehensive error handling
- βœ… Detailed metadata generation

## Security Features

| Check | Description | Limit |
|-------|-------------|-------|
| **File Size** | Maximum ZIP file size | 500MB (configurable) |
| **File Count** | Maximum files per ZIP | 100 (configurable) |
| **Compression Ratio** | Zip bomb detection | 1000:1 max |
| **Path Traversal** | Block `..` patterns | Always blocked |
| **Magic Numbers** | Validate file content | Auto-detected |

## Usage

### Basic Usage

```python
from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput
from backend.utils.storage_manager import StorageManager

# Initialize
storage = StorageManager(storage_base="./storage")
agent = FileDiscoveryAgent(storage_manager=storage)

# Create input
input_data = FileDiscoveryInput(
    zip_file_path="/path/to/upload.zip",
    job_id="job_20240315_abc123"
)

# Run discovery
output = agent.discover(input_data)

# Check results
if output.success:
    print(f"Discovered {output.total_files} files")
    print(f"Documents: {len(output.documents)}")
    print(f"Images: {len(output.images)}")
    print(f"Extraction dir: {output.extraction_dir}")
else:
    print(f"Errors: {output.errors}")
```

### Input Schema

```python
FileDiscoveryInput(
    zip_file_path: str,        # Absolute path to ZIP file
    job_id: str,                # Unique job identifier
    max_file_size: int = 524288000,  # Optional: 500MB default
    max_files: int = 100         # Optional: 100 files default
)
```

### Output Schema

```python
FileDiscoveryOutput(
    job_id: str,
    success: bool,
    
    # Classified files
    documents: List[DocumentFile],      # PDFs, DOCX, DOC
    spreadsheets: List[SpreadsheetFile], # XLSX, XLS, CSV
    images: List[ImageFile],            # JPG, PNG, GIF, WEBP
    videos: List[VideoFile],            # MP4, AVI, MOV, MKV
    unknown: List[UnknownFile],         # Unsupported types
    
    # Structure
    directory_tree: DirectoryNode,      # Folder hierarchy
    
    # Metadata
    total_files: int,
    extraction_dir: str,
    processing_time: float,
    errors: List[str],
    
    # Summary
    summary: dict
)
```

## File Type Classification

The agent uses a **3-strategy approach**:

1. **MIME Type Detection** (python-magic if available)
2. **Extension-based** classification
3. **Magic Number** validation

### Supported Types

| Category | Extensions |
|----------|-----------|
| **Documents** | .pdf, .doc, .docx |
| **Spreadsheets** | .xls, .xlsx, .csv |
| **Images** | .jpg, .jpeg, .png, .gif, .webp |
| **Videos** | .mp4, .avi, .mov, .mkv |

## Directory Structure

After extraction, files are organized as:

```
storage/
└── extracted/
    └── {job_id}/
        β”œβ”€β”€ documents/
        β”œβ”€β”€ spreadsheets/
        β”œβ”€β”€ images/
        β”œβ”€β”€ videos/
        β”œβ”€β”€ unknown/
        └── discovery_metadata.json
```

## Error Handling

| Error Type | Behavior |
|-----------|----------|
| Invalid ZIP | `success=False`, error in list |
| Path traversal | File skipped, warning logged |
| Corrupted file | File skipped, error logged |
| Unsupported type | Added to `unknown` list |
| Size exceeded | `success=False`, processing stopped |

## Testing

Run tests with pytest:

```bash
# Run all tests
pytest tests/agents/test_file_discovery.py -v

# Run with coverage
pytest tests/agents/test_file_discovery.py --cov=backend.agents.file_discovery

# Run specific test
pytest tests/agents/test_file_discovery.py::TestFileDiscoveryAgent::test_discover_valid_zip -v
```

### Test Coverage

- βœ… Valid ZIP with mixed files
- βœ… Nested folder structures
- βœ… Non-existent files
- βœ… File size exceeded
- βœ… File count exceeded
- βœ… Path traversal attempts
- βœ… Corrupted ZIP files
- βœ… File type classification
- βœ… Directory tree building
- βœ… Metadata persistence

## Configuration

Environment variables (see `.env.example`):

```bash
# File limits
MAX_FILE_SIZE=524288000    # 500MB
MAX_FILES_PER_ZIP=100

# Storage paths
STORAGE_BASE=./storage
EXTRACTED_DIR=extracted
```

## Performance

Typical performance for business document folders:

| Files | Total Size | Processing Time |
|-------|-----------|-----------------|
| 10 files | 5MB | ~0.5s |
| 50 files | 25MB | ~2s |
| 100 files | 50MB | ~4s |

## Next Steps

After file discovery completes successfully:

1. **Document Parsing Agent** processes PDFs and DOCX files
2. **Table Extraction Agent** finds and structures tables
3. **Media Extraction Agent** extracts embedded images
4. **Vision Agent** analyzes images with Qwen3.5:0.8B

## Files

- `backend/agents/file_discovery.py` - Main agent implementation
- `backend/utils/file_classifier.py` - File type classification
- `backend/utils/storage_manager.py` - Storage organization
- `tests/agents/test_file_discovery.py` - Unit tests