--- title: PDF Redaction API emoji: 🔒 colorFrom: blue colorTo: green sdk: docker pinned: false license: mit --- # PDF Redaction API 🔒 Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER). ## Features - 🤖 **Powered by NER**: Uses state-of-the-art Named Entity Recognition - 📄 **PDF Support**: Upload and process PDF documents - 🎯 **Accurate Redaction**: Correctly positioned black rectangles over sensitive text - 🚀 **Fast Processing**: Optimized OCR and NER pipeline - 🔧 **Configurable**: Adjust DPI and filter entity types ## API Endpoints ### `POST /redact` Upload a PDF file and get it redacted. **Parameters:** - `file`: PDF file (required) - `dpi`: OCR quality (default: 300) - `entity_types`: Comma-separated entity types to redact (optional) **Example using cURL:** ```bash curl -X POST "https://your-space.hf.space/redact" \ -F "file=@document.pdf" \ -F "dpi=300" ``` **Example using Python:** ```python import requests url = "https://your-space.hf.space/redact" files = {"file": open("document.pdf", "rb")} params = {"dpi": 300} response = requests.post(url, files=files, params=params) result = response.json() # Download redacted file job_id = result["job_id"] download_url = f"https://your-space.hf.space/download/{job_id}" redacted_pdf = requests.get(download_url) with open("redacted.pdf", "wb") as f: f.write(redacted_pdf.content) ``` ### `GET /download/{job_id}` Download the redacted PDF file. ### `GET /health` Check API health and model status. ### `GET /stats` Get API statistics. ## Response Format ```json { "job_id": "uuid-here", "status": "completed", "message": "Successfully redacted 5 entities", "entities": [ { "entity_type": "PER", "entity_text": "John Doe", "page": 1, "word_count": 2 } ], "redacted_file_url": "/download/uuid-here" } ``` ## Entity Types Common entity types detected: - `PER`: Person names - `ORG`: Organizations - `LOC`: Locations - `DATE`: Dates - `EMAIL`: Email addresses - `PHONE`: Phone numbers - And more... ## Local Development ### Prerequisites - Python 3.10+ - Tesseract OCR - Poppler utils ### Installation ```bash # Install system dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr poppler-utils # Install Python dependencies pip install -r requirements.txt # Run the server python main.py ``` The API will be available at `http://localhost:7860` ### Using Docker ```bash # Build the image docker build -t pdf-redaction-api . # Run the container docker run -p 7860:7860 pdf-redaction-api ``` ## Configuration Adjust the DPI parameter based on your needs: - `150`: Fast processing, lower quality - `300`: Recommended balance (default) - `600`: High quality, slower processing ## Limitations - Maximum file size: Dependent on Space resources - Processing time increases with page count and DPI - Files are automatically cleaned up after processing ## Privacy - Uploaded files are processed in-memory and deleted after redaction - No data is stored permanently - Use your own deployment for sensitive documents ## Credits Built with: - [FastAPI](https://fastapi.tiangolo.com/) - [Transformers](https://huggingface.co/transformers/) - [PyPDF](https://github.com/py-pdf/pypdf) - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) ## License MIT License - See LICENSE file for details