Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / README.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

3.43 kB

metadata

title: PDF Redaction API
emoji: 🔒
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit

PDF Redaction API 🔒

Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).

Features

🤖 Powered by NER: Uses state-of-the-art Named Entity Recognition
📄 PDF Support: Upload and process PDF documents
🎯 Accurate Redaction: Correctly positioned black rectangles over sensitive text
🚀 Fast Processing: Optimized OCR and NER pipeline
🔧 Configurable: Adjust DPI and filter entity types

API Endpoints

`POST /redact`

Upload a PDF file and get it redacted.

Parameters:

file: PDF file (required)
dpi: OCR quality (default: 300)
entity_types: Comma-separated entity types to redact (optional)

Example using cURL:

curl -X POST "https://your-space.hf.space/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300"

Example using Python:

import requests

url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}

response = requests.post(url, files=files, params=params)
result = response.json()

# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)

with open("redacted.pdf", "wb") as f:
    f.write(redacted_pdf.content)

`GET /download/{job_id}`

Download the redacted PDF file.

`GET /health`

Check API health and model status.

`GET /stats`

Get API statistics.

Response Format

{
  "job_id": "uuid-here",
  "status": "completed",
  "message": "Successfully redacted 5 entities",
  "entities": [
    {
      "entity_type": "PER",
      "entity_text": "John Doe",
      "page": 1,
      "word_count": 2
    }
  ],
  "redacted_file_url": "/download/uuid-here"
}

Entity Types

Common entity types detected:

PER: Person names
ORG: Organizations
LOC: Locations
DATE: Dates
EMAIL: Email addresses
PHONE: Phone numbers
And more...

Local Development

Prerequisites

Python 3.10+
Tesseract OCR
Poppler utils

Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

# Install Python dependencies
pip install -r requirements.txt

# Run the server
python main.py

The API will be available at http://localhost:7860

Using Docker

# Build the image
docker build -t pdf-redaction-api .

# Run the container
docker run -p 7860:7860 pdf-redaction-api

Configuration

Adjust the DPI parameter based on your needs:

150: Fast processing, lower quality
300: Recommended balance (default)
600: High quality, slower processing

Limitations

Maximum file size: Dependent on Space resources
Processing time increases with page count and DPI
Files are automatically cleaned up after processing

Privacy

Uploaded files are processed in-memory and deleted after redaction
No data is stored permanently
Use your own deployment for sensitive documents

Credits

Built with:

License

MIT License - See LICENSE file for details