| # Project Structure |
|
|
| ``` |
| pdf-redaction-api/ |
| β |
| βββ main.py # FastAPI application entry point |
| βββ Dockerfile # Docker configuration for deployment |
| βββ requirements.txt # Python dependencies |
| βββ README.md # Project documentation (for HuggingFace) |
| βββ DEPLOYMENT.md # Deployment guide |
| βββ .gitignore # Git ignore rules |
| βββ .dockerignore # Docker ignore rules |
| β |
| βββ app/ # Application modules |
| β βββ __init__.py # Package initialization |
| β βββ redaction.py # Core redaction logic (PDFRedactor class) |
| β |
| βββ uploads/ # Temporary upload directory |
| β βββ .gitkeep # Keep directory in git |
| β |
| βββ outputs/ # Redacted PDF output directory |
| β βββ .gitkeep # Keep directory in git |
| β |
| βββ tests/ # Test suite |
| β βββ __init__.py |
| β βββ test_api.py # API endpoint tests |
| β |
| βββ client_example.py # Example client for API usage |
| ``` |
|
|
| ## File Descriptions |
|
|
| ### Core Files |
|
|
| #### `main.py` |
| FastAPI application with endpoints: |
| - `POST /redact` - Upload and redact PDF |
| - `GET /download/{job_id}` - Download redacted PDF |
| - `GET /health` - Health check |
| - `GET /stats` - API statistics |
| - `DELETE /cleanup/{job_id}` - Manual cleanup |
|
|
| #### `app/redaction.py` |
| Core redaction logic: |
| - `PDFRedactor` class |
| - OCR processing with pytesseract |
| - NER using HuggingFace transformers |
| - Entity-to-box mapping |
| - PDF redaction with coordinate scaling |
|
|
| ### Configuration Files |
|
|
| #### `requirements.txt` |
| Python dependencies: |
| - FastAPI & Uvicorn (API framework) |
| - Transformers & Torch (NER model) |
| - PyPDF (PDF manipulation) |
| - pdf2image (PDF to image conversion) |
| - pytesseract (OCR) |
| - Pillow (Image processing) |
|
|
| #### `Dockerfile` |
| Multi-stage build: |
| 1. Install system dependencies (tesseract, poppler) |
| 2. Install Python dependencies |
| 3. Copy application code |
| 4. Configure for port 7860 (HuggingFace default) |
|
|
| ### Documentation |
|
|
| #### `README.md` |
| HuggingFace Space documentation: |
| - Features overview |
| - API endpoint documentation |
| - Usage examples (cURL, Python) |
| - Response format |
| - Local development setup |
|
|
| #### `DEPLOYMENT.md` |
| Step-by-step deployment guide: |
| - HuggingFace Spaces setup |
| - Git workflow |
| - Configuration options |
| - Security considerations |
| - Troubleshooting |
| - Cost estimation |
|
|
| ### Testing & Examples |
|
|
| #### `tests/test_api.py` |
| Unit tests for API endpoints: |
| - Health check tests |
| - Upload validation tests |
| - Error handling tests |
| |
| #### `client_example.py` |
| Example client implementation: |
| - Upload PDF |
| - Download redacted file |
| - Health check |
| - Statistics |
|
|
| ## Data Flow |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 1. Client uploads PDF β |
| β POST /redact with file β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 2. FastAPI (main.py) β |
| β - Validates file β |
| β - Generates job_id β |
| β - Saves to uploads/ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 3. PDFRedactor (app/redaction.py) β |
| β - perform_ocr() β Extract text + boxes β |
| β - run_ner() β Identify entities β |
| β - map_entities_to_boxes() β Link entities to coords β |
| β - create_redacted_pdf() β Generate output β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 4. Response β |
| β - Return job_id and entity list β |
| β - Save redacted PDF to outputs/ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 5. Client downloads β |
| β GET /download/{job_id} β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Key Components |
|
|
| ### 1. FastAPI Application (`main.py`) |
|
|
| **Endpoints:** |
| - RESTful API design |
| - File upload handling |
| - Background task cleanup |
| - CORS middleware for web access |
|
|
| **Features:** |
| - Automatic OpenAPI documentation at `/docs` |
| - JSON response models with Pydantic |
| - Error handling with HTTP exceptions |
| - Request validation |
|
|
| ### 2. Redaction Engine (`app/redaction.py`) |
|
|
| **Pipeline Steps:** |
|
|
| 1. **OCR Processing** |
| - Convert PDF pages to images (pdf2image) |
| - Extract text and bounding boxes (pytesseract) |
| - Store image dimensions for coordinate scaling |
|
|
| 2. **NER Processing** |
| - Load HuggingFace model |
| - Identify entities in text |
| - Return entity types and character positions |
|
|
| 3. **Mapping** |
| - Create character span index for OCR words |
| - Match NER entities to OCR bounding boxes |
| - Handle partial word matches |
|
|
| 4. **Redaction** |
| - Scale OCR image coordinates to PDF points |
| - Create black rectangle annotations |
| - Write redacted PDF with pypdf |
|
|
| ### 3. Docker Container |
|
|
| **Layers:** |
| - Base: Python 3.10 slim |
| - System packages: tesseract-ocr, poppler-utils |
| - Python packages: From requirements.txt |
| - Application code: Copied last for better caching |
|
|
| **Optimizations:** |
| - Multi-stage build (not used here, but possible) |
| - Minimal base image |
| - Cached dependency layers |
| - .dockerignore to reduce context size |
|
|
| ## Environment Variables |
|
|
| Default configuration (can be overridden): |
|
|
| ```bash |
| PYTHONUNBUFFERED=1 # Immediate log output |
| HF_HOME=/app/cache # HuggingFace cache directory |
| ``` |
|
|
| ## Port Configuration |
|
|
| - **Development**: 7860 (configurable in main.py) |
| - **Production (HF Spaces)**: 7860 (required) |
|
|
| ## Directory Permissions |
|
|
| Ensure write permissions for: |
| - `uploads/` - Temporary PDF storage |
| - `outputs/` - Redacted PDF storage |
| - `cache/` - Model cache (created automatically) |
|
|
| ## Adding New Features |
|
|
| ### Add New Endpoint |
|
|
| 1. Define in `main.py`: |
| ```python |
| @app.get("/new-endpoint") |
| async def new_endpoint(): |
| return {"message": "Hello"} |
| ``` |
|
|
| 2. Add response model if needed |
| 3. Update README.md documentation |
| 4. Add tests in `tests/test_api.py` |
|
|
| ### Add New Redaction Option |
|
|
| 1. Modify `PDFRedactor` class in `app/redaction.py` |
| 2. Add parameter to `redact_document()` method |
| 3. Update API endpoint in `main.py` |
| 4. Document in README.md |
|
|
| ### Add Authentication |
|
|
| 1. Install: `pip install python-jose passlib` |
| 2. Create `app/auth.py` with JWT logic |
| 3. Add middleware to `main.py` |
| 4. Protect endpoints with dependencies |
|
|
| ## Best Practices |
|
|
| 1. **Logging**: Use `logger` for all important events |
| 2. **Error Handling**: Catch exceptions and return meaningful errors |
| 3. **Validation**: Use Pydantic models for request/response validation |
| 4. **Cleanup**: Always clean up temporary files |
| 5. **Documentation**: Keep README.md and code comments updated |
| 6. **Testing**: Add tests for new features |
|
|
| ## Performance Considerations |
|
|
| ### Bottlenecks |
| 1. OCR processing (most time-consuming) |
| 2. Model inference (NER) |
| 3. File I/O |
|
|
| ### Optimizations |
| - Lower DPI for faster OCR (trade-off with accuracy) |
| - Cache loaded models in memory |
| - Use async file operations |
| - Implement request queuing for high load |
| - Consider GPU for NER model |
|
|
| ### Scaling |
| - Horizontal: Multiple container instances |
| - Vertical: Larger CPU/RAM allocation |
| - Caching: Redis for temporary results |
| - Queue: Celery for background processing |
|
|