PDF-Redaction-API / STRUCTURE.md
Sammi1211's picture
adding url support
af107f1

Project Structure

pdf-redaction-api/
β”‚
β”œβ”€β”€ main.py                      # FastAPI application entry point
β”œβ”€β”€ Dockerfile                   # Docker configuration for deployment
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # Project documentation (for HuggingFace)
β”œβ”€β”€ DEPLOYMENT.md               # Deployment guide
β”œβ”€β”€ .gitignore                  # Git ignore rules
β”œβ”€β”€ .dockerignore               # Docker ignore rules
β”‚
β”œβ”€β”€ app/                        # Application modules
β”‚   β”œβ”€β”€ __init__.py            # Package initialization
β”‚   └── redaction.py           # Core redaction logic (PDFRedactor class)
β”‚
β”œβ”€β”€ uploads/                    # Temporary upload directory
β”‚   └── .gitkeep               # Keep directory in git
β”‚
β”œβ”€β”€ outputs/                    # Redacted PDF output directory
β”‚   └── .gitkeep               # Keep directory in git
β”‚
β”œβ”€β”€ tests/                      # Test suite
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── test_api.py            # API endpoint tests
β”‚
└── client_example.py           # Example client for API usage

File Descriptions

Core Files

main.py

FastAPI application with endpoints:

  • POST /redact - Upload and redact PDF
  • GET /download/{job_id} - Download redacted PDF
  • GET /health - Health check
  • GET /stats - API statistics
  • DELETE /cleanup/{job_id} - Manual cleanup

app/redaction.py

Core redaction logic:

  • PDFRedactor class
  • OCR processing with pytesseract
  • NER using HuggingFace transformers
  • Entity-to-box mapping
  • PDF redaction with coordinate scaling

Configuration Files

requirements.txt

Python dependencies:

  • FastAPI & Uvicorn (API framework)
  • Transformers & Torch (NER model)
  • PyPDF (PDF manipulation)
  • pdf2image (PDF to image conversion)
  • pytesseract (OCR)
  • Pillow (Image processing)

Dockerfile

Multi-stage build:

  1. Install system dependencies (tesseract, poppler)
  2. Install Python dependencies
  3. Copy application code
  4. Configure for port 7860 (HuggingFace default)

Documentation

README.md

HuggingFace Space documentation:

  • Features overview
  • API endpoint documentation
  • Usage examples (cURL, Python)
  • Response format
  • Local development setup

DEPLOYMENT.md

Step-by-step deployment guide:

  • HuggingFace Spaces setup
  • Git workflow
  • Configuration options
  • Security considerations
  • Troubleshooting
  • Cost estimation

Testing & Examples

tests/test_api.py

Unit tests for API endpoints:

  • Health check tests
  • Upload validation tests
  • Error handling tests

client_example.py

Example client implementation:

  • Upload PDF
  • Download redacted file
  • Health check
  • Statistics

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Client uploads PDF                                   β”‚
β”‚    POST /redact with file                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. FastAPI (main.py)                                    β”‚
β”‚    - Validates file                                     β”‚
β”‚    - Generates job_id                                   β”‚
β”‚    - Saves to uploads/                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. PDFRedactor (app/redaction.py)                       β”‚
β”‚    - perform_ocr() β†’ Extract text + boxes               β”‚
β”‚    - run_ner() β†’ Identify entities                      β”‚
β”‚    - map_entities_to_boxes() β†’ Link entities to coords  β”‚
β”‚    - create_redacted_pdf() β†’ Generate output            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Response                                             β”‚
β”‚    - Return job_id and entity list                      β”‚
β”‚    - Save redacted PDF to outputs/                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Client downloads                                     β”‚
β”‚    GET /download/{job_id}                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

1. FastAPI Application (main.py)

Endpoints:

  • RESTful API design
  • File upload handling
  • Background task cleanup
  • CORS middleware for web access

Features:

  • Automatic OpenAPI documentation at /docs
  • JSON response models with Pydantic
  • Error handling with HTTP exceptions
  • Request validation

2. Redaction Engine (app/redaction.py)

Pipeline Steps:

  1. OCR Processing

    • Convert PDF pages to images (pdf2image)
    • Extract text and bounding boxes (pytesseract)
    • Store image dimensions for coordinate scaling
  2. NER Processing

    • Load HuggingFace model
    • Identify entities in text
    • Return entity types and character positions
  3. Mapping

    • Create character span index for OCR words
    • Match NER entities to OCR bounding boxes
    • Handle partial word matches
  4. Redaction

    • Scale OCR image coordinates to PDF points
    • Create black rectangle annotations
    • Write redacted PDF with pypdf

3. Docker Container

Layers:

  • Base: Python 3.10 slim
  • System packages: tesseract-ocr, poppler-utils
  • Python packages: From requirements.txt
  • Application code: Copied last for better caching

Optimizations:

  • Multi-stage build (not used here, but possible)
  • Minimal base image
  • Cached dependency layers
  • .dockerignore to reduce context size

Environment Variables

Default configuration (can be overridden):

PYTHONUNBUFFERED=1        # Immediate log output
HF_HOME=/app/cache        # HuggingFace cache directory

Port Configuration

  • Development: 7860 (configurable in main.py)
  • Production (HF Spaces): 7860 (required)

Directory Permissions

Ensure write permissions for:

  • uploads/ - Temporary PDF storage
  • outputs/ - Redacted PDF storage
  • cache/ - Model cache (created automatically)

Adding New Features

Add New Endpoint

  1. Define in main.py:
@app.get("/new-endpoint")
async def new_endpoint():
    return {"message": "Hello"}
  1. Add response model if needed
  2. Update README.md documentation
  3. Add tests in tests/test_api.py

Add New Redaction Option

  1. Modify PDFRedactor class in app/redaction.py
  2. Add parameter to redact_document() method
  3. Update API endpoint in main.py
  4. Document in README.md

Add Authentication

  1. Install: pip install python-jose passlib
  2. Create app/auth.py with JWT logic
  3. Add middleware to main.py
  4. Protect endpoints with dependencies

Best Practices

  1. Logging: Use logger for all important events
  2. Error Handling: Catch exceptions and return meaningful errors
  3. Validation: Use Pydantic models for request/response validation
  4. Cleanup: Always clean up temporary files
  5. Documentation: Keep README.md and code comments updated
  6. Testing: Add tests for new features

Performance Considerations

Bottlenecks

  1. OCR processing (most time-consuming)
  2. Model inference (NER)
  3. File I/O

Optimizations

  • Lower DPI for faster OCR (trade-off with accuracy)
  • Cache loaded models in memory
  • Use async file operations
  • Implement request queuing for high load
  • Consider GPU for NER model

Scaling

  • Horizontal: Multiple container instances
  • Vertical: Larger CPU/RAM allocation
  • Caching: Redis for temporary results
  • Queue: Celery for background processing