PDF-Redaction-API / README.md
Sammi1211's picture
adding url support
af107f1
metadata
title: PDF Redaction API
emoji: πŸ”’
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit

PDF Redaction API πŸ”’

Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).

Features

  • πŸ€– Powered by NER: Uses state-of-the-art Named Entity Recognition
  • πŸ“„ PDF Support: Upload and process PDF documents
  • 🎯 Accurate Redaction: Correctly positioned black rectangles over sensitive text
  • πŸš€ Fast Processing: Optimized OCR and NER pipeline
  • πŸ”§ Configurable: Adjust DPI and filter entity types

API Endpoints

POST /redact

Upload a PDF file and get it redacted.

Parameters:

  • file: PDF file (required)
  • dpi: OCR quality (default: 300)
  • entity_types: Comma-separated entity types to redact (optional)

Example using cURL:

curl -X POST "https://your-space.hf.space/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300"

Example using Python:

import requests

url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}

response = requests.post(url, files=files, params=params)
result = response.json()

# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)

with open("redacted.pdf", "wb") as f:
    f.write(redacted_pdf.content)

GET /download/{job_id}

Download the redacted PDF file.

GET /health

Check API health and model status.

GET /stats

Get API statistics.

Response Format

{
  "job_id": "uuid-here",
  "status": "completed",
  "message": "Successfully redacted 5 entities",
  "entities": [
    {
      "entity_type": "PER",
      "entity_text": "John Doe",
      "page": 1,
      "word_count": 2
    }
  ],
  "redacted_file_url": "/download/uuid-here"
}

Entity Types

Common entity types detected:

  • PER: Person names
  • ORG: Organizations
  • LOC: Locations
  • DATE: Dates
  • EMAIL: Email addresses
  • PHONE: Phone numbers
  • And more...

Local Development

Prerequisites

  • Python 3.10+
  • Tesseract OCR
  • Poppler utils

Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

# Install Python dependencies
pip install -r requirements.txt

# Run the server
python main.py

The API will be available at http://localhost:7860

Using Docker

# Build the image
docker build -t pdf-redaction-api .

# Run the container
docker run -p 7860:7860 pdf-redaction-api

Configuration

Adjust the DPI parameter based on your needs:

  • 150: Fast processing, lower quality
  • 300: Recommended balance (default)
  • 600: High quality, slower processing

Limitations

  • Maximum file size: Dependent on Space resources
  • Processing time increases with page count and DPI
  • Files are automatically cleaned up after processing

Privacy

  • Uploaded files are processed in-memory and deleted after redaction
  • No data is stored permanently
  • Use your own deployment for sensitive documents

Credits

Built with:

License

MIT License - See LICENSE file for details