Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

File size: 9,109 Bytes

af107f1

# Project Structure

```
pdf-redaction-api/
│
├── main.py                      # FastAPI application entry point
├── Dockerfile                   # Docker configuration for deployment
├── requirements.txt             # Python dependencies
├── README.md                    # Project documentation (for HuggingFace)
├── DEPLOYMENT.md               # Deployment guide
├── .gitignore                  # Git ignore rules
├── .dockerignore               # Docker ignore rules
│
├── app/                        # Application modules
│   ├── __init__.py            # Package initialization
│   └── redaction.py           # Core redaction logic (PDFRedactor class)
│
├── uploads/                    # Temporary upload directory
│   └── .gitkeep               # Keep directory in git
│
├── outputs/                    # Redacted PDF output directory
│   └── .gitkeep               # Keep directory in git
│
├── tests/                      # Test suite
│   ├── __init__.py
│   └── test_api.py            # API endpoint tests
│
└── client_example.py           # Example client for API usage
```

## File Descriptions

### Core Files

#### `main.py`
FastAPI application with endpoints:
- `POST /redact` - Upload and redact PDF
- `GET /download/{job_id}` - Download redacted PDF
- `GET /health` - Health check
- `GET /stats` - API statistics
- `DELETE /cleanup/{job_id}` - Manual cleanup

#### `app/redaction.py`
Core redaction logic:
- `PDFRedactor` class
- OCR processing with pytesseract
- NER using HuggingFace transformers
- Entity-to-box mapping
- PDF redaction with coordinate scaling

### Configuration Files

#### `requirements.txt`
Python dependencies:
- FastAPI & Uvicorn (API framework)
- Transformers & Torch (NER model)
- PyPDF (PDF manipulation)
- pdf2image (PDF to image conversion)
- pytesseract (OCR)
- Pillow (Image processing)

#### `Dockerfile`
Multi-stage build:
1. Install system dependencies (tesseract, poppler)
2. Install Python dependencies
3. Copy application code
4. Configure for port 7860 (HuggingFace default)

### Documentation

#### `README.md`
HuggingFace Space documentation:
- Features overview
- API endpoint documentation
- Usage examples (cURL, Python)
- Response format
- Local development setup

#### `DEPLOYMENT.md`
Step-by-step deployment guide:
- HuggingFace Spaces setup
- Git workflow
- Configuration options
- Security considerations
- Troubleshooting
- Cost estimation

### Testing & Examples

#### `tests/test_api.py`
Unit tests for API endpoints:
- Health check tests
- Upload validation tests
- Error handling tests

#### `client_example.py`
Example client implementation:
- Upload PDF
- Download redacted file
- Health check
- Statistics

## Data Flow

```
┌─────────────────────────────────────────────────────────┐
│ 1. Client uploads PDF                                   │
│    POST /redact with file                               │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 2. FastAPI (main.py)                                    │
│    - Validates file                                     │
│    - Generates job_id                                   │
│    - Saves to uploads/                                  │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 3. PDFRedactor (app/redaction.py)                       │
│    - perform_ocr() → Extract text + boxes               │
│    - run_ner() → Identify entities                      │
│    - map_entities_to_boxes() → Link entities to coords  │
│    - create_redacted_pdf() → Generate output            │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 4. Response                                             │
│    - Return job_id and entity list                      │
│    - Save redacted PDF to outputs/                      │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│ 5. Client downloads                                     │
│    GET /download/{job_id}                               │
└─────────────────────────────────────────────────────────┘
```

## Key Components

### 1. FastAPI Application (`main.py`)

**Endpoints:**
- RESTful API design
- File upload handling
- Background task cleanup
- CORS middleware for web access

**Features:**
- Automatic OpenAPI documentation at `/docs`
- JSON response models with Pydantic
- Error handling with HTTP exceptions
- Request validation

### 2. Redaction Engine (`app/redaction.py`)

**Pipeline Steps:**

1. **OCR Processing**
   - Convert PDF pages to images (pdf2image)
   - Extract text and bounding boxes (pytesseract)
   - Store image dimensions for coordinate scaling

2. **NER Processing**
   - Load HuggingFace model
   - Identify entities in text
   - Return entity types and character positions

3. **Mapping**
   - Create character span index for OCR words
   - Match NER entities to OCR bounding boxes
   - Handle partial word matches

4. **Redaction**
   - Scale OCR image coordinates to PDF points
   - Create black rectangle annotations
   - Write redacted PDF with pypdf

### 3. Docker Container

**Layers:**
- Base: Python 3.10 slim
- System packages: tesseract-ocr, poppler-utils
- Python packages: From requirements.txt
- Application code: Copied last for better caching

**Optimizations:**
- Multi-stage build (not used here, but possible)
- Minimal base image
- Cached dependency layers
- .dockerignore to reduce context size

## Environment Variables

Default configuration (can be overridden):

```bash
PYTHONUNBUFFERED=1        # Immediate log output
HF_HOME=/app/cache        # HuggingFace cache directory
```

## Port Configuration

- **Development**: 7860 (configurable in main.py)
- **Production (HF Spaces)**: 7860 (required)

## Directory Permissions

Ensure write permissions for:
- `uploads/` - Temporary PDF storage
- `outputs/` - Redacted PDF storage
- `cache/` - Model cache (created automatically)

## Adding New Features

### Add New Endpoint

1. Define in `main.py`:
```python
@app.get("/new-endpoint")
async def new_endpoint():
    return {"message": "Hello"}
```

2. Add response model if needed
3. Update README.md documentation
4. Add tests in `tests/test_api.py`

### Add New Redaction Option

1. Modify `PDFRedactor` class in `app/redaction.py`
2. Add parameter to `redact_document()` method
3. Update API endpoint in `main.py`
4. Document in README.md

### Add Authentication

1. Install: `pip install python-jose passlib`
2. Create `app/auth.py` with JWT logic
3. Add middleware to `main.py`
4. Protect endpoints with dependencies

## Best Practices

1. **Logging**: Use `logger` for all important events
2. **Error Handling**: Catch exceptions and return meaningful errors
3. **Validation**: Use Pydantic models for request/response validation
4. **Cleanup**: Always clean up temporary files
5. **Documentation**: Keep README.md and code comments updated
6. **Testing**: Add tests for new features

## Performance Considerations

### Bottlenecks
1. OCR processing (most time-consuming)
2. Model inference (NER)
3. File I/O

### Optimizations
- Lower DPI for faster OCR (trade-off with accuracy)
- Cache loaded models in memory
- Use async file operations
- Implement request queuing for high load
- Consider GPU for NER model

### Scaling
- Horizontal: Multiple container instances
- Vertical: Larger CPU/RAM allocation
- Caching: Redis for temporary results
- Queue: Celery for background processing