Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

File size: 3,425 Bytes

e262fe2
 
af107f1
e262fe2
af107f1
e262fe2
 
af107f1
e262fe2
 
af107f1

---
title: PDF Redaction API
emoji: 🔒
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
---

# PDF Redaction API 🔒

Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).

## Features

- 🤖 **Powered by NER**: Uses state-of-the-art Named Entity Recognition
- 📄 **PDF Support**: Upload and process PDF documents
- 🎯 **Accurate Redaction**: Correctly positioned black rectangles over sensitive text
- 🚀 **Fast Processing**: Optimized OCR and NER pipeline
- 🔧 **Configurable**: Adjust DPI and filter entity types

## API Endpoints

### `POST /redact`

Upload a PDF file and get it redacted.

**Parameters:**
- `file`: PDF file (required)
- `dpi`: OCR quality (default: 300)
- `entity_types`: Comma-separated entity types to redact (optional)

**Example using cURL:**

```bash
curl -X POST "https://your-space.hf.space/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300"
```

**Example using Python:**

```python
import requests

url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}

response = requests.post(url, files=files, params=params)
result = response.json()

# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)

with open("redacted.pdf", "wb") as f:
    f.write(redacted_pdf.content)
```

### `GET /download/{job_id}`

Download the redacted PDF file.

### `GET /health`

Check API health and model status.

### `GET /stats`

Get API statistics.

## Response Format

```json
{
  "job_id": "uuid-here",
  "status": "completed",
  "message": "Successfully redacted 5 entities",
  "entities": [
    {
      "entity_type": "PER",
      "entity_text": "John Doe",
      "page": 1,
      "word_count": 2
    }
  ],
  "redacted_file_url": "/download/uuid-here"
}
```

## Entity Types

Common entity types detected:
- `PER`: Person names
- `ORG`: Organizations
- `LOC`: Locations
- `DATE`: Dates
- `EMAIL`: Email addresses
- `PHONE`: Phone numbers
- And more...

## Local Development

### Prerequisites

- Python 3.10+
- Tesseract OCR
- Poppler utils

### Installation

```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

# Install Python dependencies
pip install -r requirements.txt

# Run the server
python main.py
```

The API will be available at `http://localhost:7860`

### Using Docker

```bash
# Build the image
docker build -t pdf-redaction-api .

# Run the container
docker run -p 7860:7860 pdf-redaction-api
```

## Configuration

Adjust the DPI parameter based on your needs:
- `150`: Fast processing, lower quality
- `300`: Recommended balance (default)
- `600`: High quality, slower processing

## Limitations

- Maximum file size: Dependent on Space resources
- Processing time increases with page count and DPI
- Files are automatically cleaned up after processing

## Privacy

- Uploaded files are processed in-memory and deleted after redaction
- No data is stored permanently
- Use your own deployment for sensitive documents

## Credits

Built with:
- [FastAPI](https://fastapi.tiangolo.com/)
- [Transformers](https://huggingface.co/transformers/)
- [PyPDF](https://github.com/py-pdf/pypdf)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)

## License

MIT License - See LICENSE file for details