metadata
title: PDF Redaction API
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
PDF Redaction API π
Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).
Features
- π€ Powered by NER: Uses state-of-the-art Named Entity Recognition
- π PDF Support: Upload and process PDF documents
- π― Accurate Redaction: Correctly positioned black rectangles over sensitive text
- π Fast Processing: Optimized OCR and NER pipeline
- π§ Configurable: Adjust DPI and filter entity types
API Endpoints
POST /redact
Upload a PDF file and get it redacted.
Parameters:
file: PDF file (required)dpi: OCR quality (default: 300)entity_types: Comma-separated entity types to redact (optional)
Example using cURL:
curl -X POST "https://your-space.hf.space/redact" \
-F "file=@document.pdf" \
-F "dpi=300"
Example using Python:
import requests
url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}
response = requests.post(url, files=files, params=params)
result = response.json()
# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)
with open("redacted.pdf", "wb") as f:
f.write(redacted_pdf.content)
GET /download/{job_id}
Download the redacted PDF file.
GET /health
Check API health and model status.
GET /stats
Get API statistics.
Response Format
{
"job_id": "uuid-here",
"status": "completed",
"message": "Successfully redacted 5 entities",
"entities": [
{
"entity_type": "PER",
"entity_text": "John Doe",
"page": 1,
"word_count": 2
}
],
"redacted_file_url": "/download/uuid-here"
}
Entity Types
Common entity types detected:
PER: Person namesORG: OrganizationsLOC: LocationsDATE: DatesEMAIL: Email addressesPHONE: Phone numbers- And more...
Local Development
Prerequisites
- Python 3.10+
- Tesseract OCR
- Poppler utils
Installation
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
# Install Python dependencies
pip install -r requirements.txt
# Run the server
python main.py
The API will be available at http://localhost:7860
Using Docker
# Build the image
docker build -t pdf-redaction-api .
# Run the container
docker run -p 7860:7860 pdf-redaction-api
Configuration
Adjust the DPI parameter based on your needs:
150: Fast processing, lower quality300: Recommended balance (default)600: High quality, slower processing
Limitations
- Maximum file size: Dependent on Space resources
- Processing time increases with page count and DPI
- Files are automatically cleaned up after processing
Privacy
- Uploaded files are processed in-memory and deleted after redaction
- No data is stored permanently
- Use your own deployment for sensitive documents
Credits
Built with:
License
MIT License - See LICENSE file for details