Project Structure
pdf-redaction-api/
β
βββ main.py # FastAPI application entry point
βββ Dockerfile # Docker configuration for deployment
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation (for HuggingFace)
βββ DEPLOYMENT.md # Deployment guide
βββ .gitignore # Git ignore rules
βββ .dockerignore # Docker ignore rules
β
βββ app/ # Application modules
β βββ __init__.py # Package initialization
β βββ redaction.py # Core redaction logic (PDFRedactor class)
β
βββ uploads/ # Temporary upload directory
β βββ .gitkeep # Keep directory in git
β
βββ outputs/ # Redacted PDF output directory
β βββ .gitkeep # Keep directory in git
β
βββ tests/ # Test suite
β βββ __init__.py
β βββ test_api.py # API endpoint tests
β
βββ client_example.py # Example client for API usage
File Descriptions
Core Files
main.py
FastAPI application with endpoints:
POST /redact- Upload and redact PDFGET /download/{job_id}- Download redacted PDFGET /health- Health checkGET /stats- API statisticsDELETE /cleanup/{job_id}- Manual cleanup
app/redaction.py
Core redaction logic:
PDFRedactorclass- OCR processing with pytesseract
- NER using HuggingFace transformers
- Entity-to-box mapping
- PDF redaction with coordinate scaling
Configuration Files
requirements.txt
Python dependencies:
- FastAPI & Uvicorn (API framework)
- Transformers & Torch (NER model)
- PyPDF (PDF manipulation)
- pdf2image (PDF to image conversion)
- pytesseract (OCR)
- Pillow (Image processing)
Dockerfile
Multi-stage build:
- Install system dependencies (tesseract, poppler)
- Install Python dependencies
- Copy application code
- Configure for port 7860 (HuggingFace default)
Documentation
README.md
HuggingFace Space documentation:
- Features overview
- API endpoint documentation
- Usage examples (cURL, Python)
- Response format
- Local development setup
DEPLOYMENT.md
Step-by-step deployment guide:
- HuggingFace Spaces setup
- Git workflow
- Configuration options
- Security considerations
- Troubleshooting
- Cost estimation
Testing & Examples
tests/test_api.py
Unit tests for API endpoints:
- Health check tests
- Upload validation tests
- Error handling tests
client_example.py
Example client implementation:
- Upload PDF
- Download redacted file
- Health check
- Statistics
Data Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Client uploads PDF β
β POST /redact with file β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. FastAPI (main.py) β
β - Validates file β
β - Generates job_id β
β - Saves to uploads/ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. PDFRedactor (app/redaction.py) β
β - perform_ocr() β Extract text + boxes β
β - run_ner() β Identify entities β
β - map_entities_to_boxes() β Link entities to coords β
β - create_redacted_pdf() β Generate output β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Response β
β - Return job_id and entity list β
β - Save redacted PDF to outputs/ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. Client downloads β
β GET /download/{job_id} β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Components
1. FastAPI Application (main.py)
Endpoints:
- RESTful API design
- File upload handling
- Background task cleanup
- CORS middleware for web access
Features:
- Automatic OpenAPI documentation at
/docs - JSON response models with Pydantic
- Error handling with HTTP exceptions
- Request validation
2. Redaction Engine (app/redaction.py)
Pipeline Steps:
OCR Processing
- Convert PDF pages to images (pdf2image)
- Extract text and bounding boxes (pytesseract)
- Store image dimensions for coordinate scaling
NER Processing
- Load HuggingFace model
- Identify entities in text
- Return entity types and character positions
Mapping
- Create character span index for OCR words
- Match NER entities to OCR bounding boxes
- Handle partial word matches
Redaction
- Scale OCR image coordinates to PDF points
- Create black rectangle annotations
- Write redacted PDF with pypdf
3. Docker Container
Layers:
- Base: Python 3.10 slim
- System packages: tesseract-ocr, poppler-utils
- Python packages: From requirements.txt
- Application code: Copied last for better caching
Optimizations:
- Multi-stage build (not used here, but possible)
- Minimal base image
- Cached dependency layers
- .dockerignore to reduce context size
Environment Variables
Default configuration (can be overridden):
PYTHONUNBUFFERED=1 # Immediate log output
HF_HOME=/app/cache # HuggingFace cache directory
Port Configuration
- Development: 7860 (configurable in main.py)
- Production (HF Spaces): 7860 (required)
Directory Permissions
Ensure write permissions for:
uploads/- Temporary PDF storageoutputs/- Redacted PDF storagecache/- Model cache (created automatically)
Adding New Features
Add New Endpoint
- Define in
main.py:
@app.get("/new-endpoint")
async def new_endpoint():
return {"message": "Hello"}
- Add response model if needed
- Update README.md documentation
- Add tests in
tests/test_api.py
Add New Redaction Option
- Modify
PDFRedactorclass inapp/redaction.py - Add parameter to
redact_document()method - Update API endpoint in
main.py - Document in README.md
Add Authentication
- Install:
pip install python-jose passlib - Create
app/auth.pywith JWT logic - Add middleware to
main.py - Protect endpoints with dependencies
Best Practices
- Logging: Use
loggerfor all important events - Error Handling: Catch exceptions and return meaningful errors
- Validation: Use Pydantic models for request/response validation
- Cleanup: Always clean up temporary files
- Documentation: Keep README.md and code comments updated
- Testing: Add tests for new features
Performance Considerations
Bottlenecks
- OCR processing (most time-consuming)
- Model inference (NER)
- File I/O
Optimizations
- Lower DPI for faster OCR (trade-off with accuracy)
- Cache loaded models in memory
- Use async file operations
- Implement request queuing for high load
- Consider GPU for NER model
Scaling
- Horizontal: Multiple container instances
- Vertical: Larger CPU/RAM allocation
- Caching: Redis for temporary results
- Queue: Celery for background processing