File size: 9,109 Bytes
af107f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Project Structure

```
pdf-redaction-api/
β”‚
β”œβ”€β”€ main.py                      # FastAPI application entry point
β”œβ”€β”€ Dockerfile                   # Docker configuration for deployment
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # Project documentation (for HuggingFace)
β”œβ”€β”€ DEPLOYMENT.md               # Deployment guide
β”œβ”€β”€ .gitignore                  # Git ignore rules
β”œβ”€β”€ .dockerignore               # Docker ignore rules
β”‚
β”œβ”€β”€ app/                        # Application modules
β”‚   β”œβ”€β”€ __init__.py            # Package initialization
β”‚   └── redaction.py           # Core redaction logic (PDFRedactor class)
β”‚
β”œβ”€β”€ uploads/                    # Temporary upload directory
β”‚   └── .gitkeep               # Keep directory in git
β”‚
β”œβ”€β”€ outputs/                    # Redacted PDF output directory
β”‚   └── .gitkeep               # Keep directory in git
β”‚
β”œβ”€β”€ tests/                      # Test suite
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── test_api.py            # API endpoint tests
β”‚
└── client_example.py           # Example client for API usage
```

## File Descriptions

### Core Files

#### `main.py`
FastAPI application with endpoints:
- `POST /redact` - Upload and redact PDF
- `GET /download/{job_id}` - Download redacted PDF
- `GET /health` - Health check
- `GET /stats` - API statistics
- `DELETE /cleanup/{job_id}` - Manual cleanup

#### `app/redaction.py`
Core redaction logic:
- `PDFRedactor` class
- OCR processing with pytesseract
- NER using HuggingFace transformers
- Entity-to-box mapping
- PDF redaction with coordinate scaling

### Configuration Files

#### `requirements.txt`
Python dependencies:
- FastAPI & Uvicorn (API framework)
- Transformers & Torch (NER model)
- PyPDF (PDF manipulation)
- pdf2image (PDF to image conversion)
- pytesseract (OCR)
- Pillow (Image processing)

#### `Dockerfile`
Multi-stage build:
1. Install system dependencies (tesseract, poppler)
2. Install Python dependencies
3. Copy application code
4. Configure for port 7860 (HuggingFace default)

### Documentation

#### `README.md`
HuggingFace Space documentation:
- Features overview
- API endpoint documentation
- Usage examples (cURL, Python)
- Response format
- Local development setup

#### `DEPLOYMENT.md`
Step-by-step deployment guide:
- HuggingFace Spaces setup
- Git workflow
- Configuration options
- Security considerations
- Troubleshooting
- Cost estimation

### Testing & Examples

#### `tests/test_api.py`
Unit tests for API endpoints:
- Health check tests
- Upload validation tests
- Error handling tests

#### `client_example.py`
Example client implementation:
- Upload PDF
- Download redacted file
- Health check
- Statistics

## Data Flow

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Client uploads PDF                                   β”‚
β”‚    POST /redact with file                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. FastAPI (main.py)                                    β”‚
β”‚    - Validates file                                     β”‚
β”‚    - Generates job_id                                   β”‚
β”‚    - Saves to uploads/                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. PDFRedactor (app/redaction.py)                       β”‚
β”‚    - perform_ocr() β†’ Extract text + boxes               β”‚
β”‚    - run_ner() β†’ Identify entities                      β”‚
β”‚    - map_entities_to_boxes() β†’ Link entities to coords  β”‚
β”‚    - create_redacted_pdf() β†’ Generate output            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Response                                             β”‚
β”‚    - Return job_id and entity list                      β”‚
β”‚    - Save redacted PDF to outputs/                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Client downloads                                     β”‚
β”‚    GET /download/{job_id}                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Key Components

### 1. FastAPI Application (`main.py`)

**Endpoints:**
- RESTful API design
- File upload handling
- Background task cleanup
- CORS middleware for web access

**Features:**
- Automatic OpenAPI documentation at `/docs`
- JSON response models with Pydantic
- Error handling with HTTP exceptions
- Request validation

### 2. Redaction Engine (`app/redaction.py`)

**Pipeline Steps:**

1. **OCR Processing**
   - Convert PDF pages to images (pdf2image)
   - Extract text and bounding boxes (pytesseract)
   - Store image dimensions for coordinate scaling

2. **NER Processing**
   - Load HuggingFace model
   - Identify entities in text
   - Return entity types and character positions

3. **Mapping**
   - Create character span index for OCR words
   - Match NER entities to OCR bounding boxes
   - Handle partial word matches

4. **Redaction**
   - Scale OCR image coordinates to PDF points
   - Create black rectangle annotations
   - Write redacted PDF with pypdf

### 3. Docker Container

**Layers:**
- Base: Python 3.10 slim
- System packages: tesseract-ocr, poppler-utils
- Python packages: From requirements.txt
- Application code: Copied last for better caching

**Optimizations:**
- Multi-stage build (not used here, but possible)
- Minimal base image
- Cached dependency layers
- .dockerignore to reduce context size

## Environment Variables

Default configuration (can be overridden):

```bash
PYTHONUNBUFFERED=1        # Immediate log output
HF_HOME=/app/cache        # HuggingFace cache directory
```

## Port Configuration

- **Development**: 7860 (configurable in main.py)
- **Production (HF Spaces)**: 7860 (required)

## Directory Permissions

Ensure write permissions for:
- `uploads/` - Temporary PDF storage
- `outputs/` - Redacted PDF storage
- `cache/` - Model cache (created automatically)

## Adding New Features

### Add New Endpoint

1. Define in `main.py`:
```python
@app.get("/new-endpoint")
async def new_endpoint():
    return {"message": "Hello"}
```

2. Add response model if needed
3. Update README.md documentation
4. Add tests in `tests/test_api.py`

### Add New Redaction Option

1. Modify `PDFRedactor` class in `app/redaction.py`
2. Add parameter to `redact_document()` method
3. Update API endpoint in `main.py`
4. Document in README.md

### Add Authentication

1. Install: `pip install python-jose passlib`
2. Create `app/auth.py` with JWT logic
3. Add middleware to `main.py`
4. Protect endpoints with dependencies

## Best Practices

1. **Logging**: Use `logger` for all important events
2. **Error Handling**: Catch exceptions and return meaningful errors
3. **Validation**: Use Pydantic models for request/response validation
4. **Cleanup**: Always clean up temporary files
5. **Documentation**: Keep README.md and code comments updated
6. **Testing**: Add tests for new features

## Performance Considerations

### Bottlenecks
1. OCR processing (most time-consuming)
2. Model inference (NER)
3. File I/O

### Optimizations
- Lower DPI for faster OCR (trade-off with accuracy)
- Cache loaded models in memory
- Use async file operations
- Implement request queuing for high load
- Consider GPU for NER model

### Scaling
- Horizontal: Multiple container instances
- Vertical: Larger CPU/RAM allocation
- Caching: Redis for temporary results
- Queue: Celery for background processing