# Project Structure ``` pdf-redaction-api/ │ ├── main.py # FastAPI application entry point ├── Dockerfile # Docker configuration for deployment ├── requirements.txt # Python dependencies ├── README.md # Project documentation (for HuggingFace) ├── DEPLOYMENT.md # Deployment guide ├── .gitignore # Git ignore rules ├── .dockerignore # Docker ignore rules │ ├── app/ # Application modules │ ├── __init__.py # Package initialization │ └── redaction.py # Core redaction logic (PDFRedactor class) │ ├── uploads/ # Temporary upload directory │ └── .gitkeep # Keep directory in git │ ├── outputs/ # Redacted PDF output directory │ └── .gitkeep # Keep directory in git │ ├── tests/ # Test suite │ ├── __init__.py │ └── test_api.py # API endpoint tests │ └── client_example.py # Example client for API usage ``` ## File Descriptions ### Core Files #### `main.py` FastAPI application with endpoints: - `POST /redact` - Upload and redact PDF - `GET /download/{job_id}` - Download redacted PDF - `GET /health` - Health check - `GET /stats` - API statistics - `DELETE /cleanup/{job_id}` - Manual cleanup #### `app/redaction.py` Core redaction logic: - `PDFRedactor` class - OCR processing with pytesseract - NER using HuggingFace transformers - Entity-to-box mapping - PDF redaction with coordinate scaling ### Configuration Files #### `requirements.txt` Python dependencies: - FastAPI & Uvicorn (API framework) - Transformers & Torch (NER model) - PyPDF (PDF manipulation) - pdf2image (PDF to image conversion) - pytesseract (OCR) - Pillow (Image processing) #### `Dockerfile` Multi-stage build: 1. Install system dependencies (tesseract, poppler) 2. Install Python dependencies 3. Copy application code 4. Configure for port 7860 (HuggingFace default) ### Documentation #### `README.md` HuggingFace Space documentation: - Features overview - API endpoint documentation - Usage examples (cURL, Python) - Response format - Local development setup #### `DEPLOYMENT.md` Step-by-step deployment guide: - HuggingFace Spaces setup - Git workflow - Configuration options - Security considerations - Troubleshooting - Cost estimation ### Testing & Examples #### `tests/test_api.py` Unit tests for API endpoints: - Health check tests - Upload validation tests - Error handling tests #### `client_example.py` Example client implementation: - Upload PDF - Download redacted file - Health check - Statistics ## Data Flow ``` ┌─────────────────────────────────────────────────────────┐ │ 1. Client uploads PDF │ │ POST /redact with file │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ 2. FastAPI (main.py) │ │ - Validates file │ │ - Generates job_id │ │ - Saves to uploads/ │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ 3. PDFRedactor (app/redaction.py) │ │ - perform_ocr() → Extract text + boxes │ │ - run_ner() → Identify entities │ │ - map_entities_to_boxes() → Link entities to coords │ │ - create_redacted_pdf() → Generate output │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ 4. Response │ │ - Return job_id and entity list │ │ - Save redacted PDF to outputs/ │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ 5. Client downloads │ │ GET /download/{job_id} │ └─────────────────────────────────────────────────────────┘ ``` ## Key Components ### 1. FastAPI Application (`main.py`) **Endpoints:** - RESTful API design - File upload handling - Background task cleanup - CORS middleware for web access **Features:** - Automatic OpenAPI documentation at `/docs` - JSON response models with Pydantic - Error handling with HTTP exceptions - Request validation ### 2. Redaction Engine (`app/redaction.py`) **Pipeline Steps:** 1. **OCR Processing** - Convert PDF pages to images (pdf2image) - Extract text and bounding boxes (pytesseract) - Store image dimensions for coordinate scaling 2. **NER Processing** - Load HuggingFace model - Identify entities in text - Return entity types and character positions 3. **Mapping** - Create character span index for OCR words - Match NER entities to OCR bounding boxes - Handle partial word matches 4. **Redaction** - Scale OCR image coordinates to PDF points - Create black rectangle annotations - Write redacted PDF with pypdf ### 3. Docker Container **Layers:** - Base: Python 3.10 slim - System packages: tesseract-ocr, poppler-utils - Python packages: From requirements.txt - Application code: Copied last for better caching **Optimizations:** - Multi-stage build (not used here, but possible) - Minimal base image - Cached dependency layers - .dockerignore to reduce context size ## Environment Variables Default configuration (can be overridden): ```bash PYTHONUNBUFFERED=1 # Immediate log output HF_HOME=/app/cache # HuggingFace cache directory ``` ## Port Configuration - **Development**: 7860 (configurable in main.py) - **Production (HF Spaces)**: 7860 (required) ## Directory Permissions Ensure write permissions for: - `uploads/` - Temporary PDF storage - `outputs/` - Redacted PDF storage - `cache/` - Model cache (created automatically) ## Adding New Features ### Add New Endpoint 1. Define in `main.py`: ```python @app.get("/new-endpoint") async def new_endpoint(): return {"message": "Hello"} ``` 2. Add response model if needed 3. Update README.md documentation 4. Add tests in `tests/test_api.py` ### Add New Redaction Option 1. Modify `PDFRedactor` class in `app/redaction.py` 2. Add parameter to `redact_document()` method 3. Update API endpoint in `main.py` 4. Document in README.md ### Add Authentication 1. Install: `pip install python-jose passlib` 2. Create `app/auth.py` with JWT logic 3. Add middleware to `main.py` 4. Protect endpoints with dependencies ## Best Practices 1. **Logging**: Use `logger` for all important events 2. **Error Handling**: Catch exceptions and return meaningful errors 3. **Validation**: Use Pydantic models for request/response validation 4. **Cleanup**: Always clean up temporary files 5. **Documentation**: Keep README.md and code comments updated 6. **Testing**: Add tests for new features ## Performance Considerations ### Bottlenecks 1. OCR processing (most time-consuming) 2. Model inference (NER) 3. File I/O ### Optimizations - Lower DPI for faster OCR (trade-off with accuracy) - Cache loaded models in memory - Use async file operations - Implement request queuing for high load - Consider GPU for NER model ### Scaling - Horizontal: Multiple container instances - Vertical: Larger CPU/RAM allocation - Caching: Redis for temporary results - Queue: Celery for background processing