Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / STRUCTURE.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

9.11 kB

Project Structure

pdf-redaction-api/
│
├── main.py                      # FastAPI application entry point
├── Dockerfile                   # Docker configuration for deployment
├── requirements.txt             # Python dependencies
├── README.md                    # Project documentation (for HuggingFace)
├── DEPLOYMENT.md               # Deployment guide
├── .gitignore                  # Git ignore rules
├── .dockerignore               # Docker ignore rules
│
├── app/                        # Application modules
│   ├── __init__.py            # Package initialization
│   └── redaction.py           # Core redaction logic (PDFRedactor class)
│
├── uploads/                    # Temporary upload directory
│   └── .gitkeep               # Keep directory in git
│
├── outputs/                    # Redacted PDF output directory
│   └── .gitkeep               # Keep directory in git
│
├── tests/                      # Test suite
│   ├── __init__.py
│   └── test_api.py            # API endpoint tests
│
└── client_example.py           # Example client for API usage

File Descriptions

Core Files

`main.py`

FastAPI application with endpoints:

POST /redact - Upload and redact PDF
GET /download/{job_id} - Download redacted PDF
GET /health - Health check
GET /stats - API statistics
DELETE /cleanup/{job_id} - Manual cleanup

`app/redaction.py`

Core redaction logic:

PDFRedactor class
OCR processing with pytesseract
NER using HuggingFace transformers
Entity-to-box mapping
PDF redaction with coordinate scaling

Configuration Files

`requirements.txt`

Python dependencies:

FastAPI & Uvicorn (API framework)
Transformers & Torch (NER model)
PyPDF (PDF manipulation)
pdf2image (PDF to image conversion)
pytesseract (OCR)
Pillow (Image processing)

`Dockerfile`

Multi-stage build:

Install system dependencies (tesseract, poppler)
Install Python dependencies
Copy application code
Configure for port 7860 (HuggingFace default)

Documentation

`README.md`

HuggingFace Space documentation:

Features overview
API endpoint documentation
Usage examples (cURL, Python)
Response format
Local development setup

`DEPLOYMENT.md`

Step-by-step deployment guide:

HuggingFace Spaces setup
Git workflow
Configuration options
Security considerations
Troubleshooting
Cost estimation

Testing & Examples

`tests/test_api.py`

Unit tests for API endpoints:

Health check tests
Upload validation tests
Error handling tests

`client_example.py`

Example client implementation:

Upload PDF
Download redacted file
Health check
Statistics

Data Flow

┌─────────────────────────────────────────────────────────┐
│ 1. Client uploads PDF                                   │
│    POST /redact with file                               │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 2. FastAPI (main.py)                                    │
│    - Validates file                                     │
│    - Generates job_id                                   │
│    - Saves to uploads/                                  │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 3. PDFRedactor (app/redaction.py)                       │
│    - perform_ocr() → Extract text + boxes               │
│    - run_ner() → Identify entities                      │
│    - map_entities_to_boxes() → Link entities to coords  │
│    - create_redacted_pdf() → Generate output            │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 4. Response                                             │
│    - Return job_id and entity list                      │
│    - Save redacted PDF to outputs/                      │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 5. Client downloads                                     │
│    GET /download/{job_id}                               │
└─────────────────────────────────────────────────────────┘

Key Components

1. FastAPI Application (`main.py`)

Endpoints:

RESTful API design
File upload handling
Background task cleanup
CORS middleware for web access

Features:

Automatic OpenAPI documentation at /docs
JSON response models with Pydantic
Error handling with HTTP exceptions
Request validation

2. Redaction Engine (`app/redaction.py`)

Pipeline Steps:

OCR Processing
- Convert PDF pages to images (pdf2image)
- Extract text and bounding boxes (pytesseract)
- Store image dimensions for coordinate scaling
NER Processing
- Load HuggingFace model
- Identify entities in text
- Return entity types and character positions
Mapping
- Create character span index for OCR words
- Match NER entities to OCR bounding boxes
- Handle partial word matches
Redaction
- Scale OCR image coordinates to PDF points
- Create black rectangle annotations
- Write redacted PDF with pypdf

3. Docker Container

Layers:

Base: Python 3.10 slim
System packages: tesseract-ocr, poppler-utils
Python packages: From requirements.txt
Application code: Copied last for better caching

Optimizations:

Multi-stage build (not used here, but possible)
Minimal base image
Cached dependency layers
.dockerignore to reduce context size

Environment Variables

Default configuration (can be overridden):

PYTHONUNBUFFERED=1        # Immediate log output
HF_HOME=/app/cache        # HuggingFace cache directory

Port Configuration

Development: 7860 (configurable in main.py)
Production (HF Spaces): 7860 (required)

Directory Permissions

Ensure write permissions for:

uploads/ - Temporary PDF storage
outputs/ - Redacted PDF storage
cache/ - Model cache (created automatically)

Adding New Features

Add New Endpoint

Define in main.py:

@app.get("/new-endpoint")
async def new_endpoint():
    return {"message": "Hello"}

Add response model if needed
Update README.md documentation
Add tests in tests/test_api.py

Add New Redaction Option

Modify PDFRedactor class in app/redaction.py
Add parameter to redact_document() method
Update API endpoint in main.py
Document in README.md

Add Authentication

Install: pip install python-jose passlib
Create app/auth.py with JWT logic
Add middleware to main.py
Protect endpoints with dependencies

Best Practices

Logging: Use logger for all important events
Error Handling: Catch exceptions and return meaningful errors
Validation: Use Pydantic models for request/response validation
Cleanup: Always clean up temporary files
Documentation: Keep README.md and code comments updated
Testing: Add tests for new features

Performance Considerations

Bottlenecks

OCR processing (most time-consuming)
Model inference (NER)
File I/O

Optimizations

Lower DPI for faster OCR (trade-off with accuracy)
Cache loaded models in memory
Use async file operations
Implement request queuing for high load
Consider GPU for NER model

Scaling

Horizontal: Multiple container instances
Vertical: Larger CPU/RAM allocation
Caching: Redis for temporary results
Queue: Celery for background processing