| --- |
| title: PDF Redaction API |
| emoji: π |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| pinned: false |
| license: mit |
| --- |
| |
| # PDF Redaction API π |
|
|
| Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER). |
|
|
| ## Features |
|
|
| - π€ **Powered by NER**: Uses state-of-the-art Named Entity Recognition |
| - π **PDF Support**: Upload and process PDF documents |
| - π― **Accurate Redaction**: Correctly positioned black rectangles over sensitive text |
| - π **Fast Processing**: Optimized OCR and NER pipeline |
| - π§ **Configurable**: Adjust DPI and filter entity types |
|
|
| ## API Endpoints |
|
|
| ### `POST /redact` |
|
|
| Upload a PDF file and get it redacted. |
|
|
| **Parameters:** |
| - `file`: PDF file (required) |
| - `dpi`: OCR quality (default: 300) |
| - `entity_types`: Comma-separated entity types to redact (optional) |
|
|
| **Example using cURL:** |
|
|
| ```bash |
| curl -X POST "https://your-space.hf.space/redact" \ |
| -F "file=@document.pdf" \ |
| -F "dpi=300" |
| ``` |
|
|
| **Example using Python:** |
|
|
| ```python |
| import requests |
| |
| url = "https://your-space.hf.space/redact" |
| files = {"file": open("document.pdf", "rb")} |
| params = {"dpi": 300} |
| |
| response = requests.post(url, files=files, params=params) |
| result = response.json() |
| |
| # Download redacted file |
| job_id = result["job_id"] |
| download_url = f"https://your-space.hf.space/download/{job_id}" |
| redacted_pdf = requests.get(download_url) |
| |
| with open("redacted.pdf", "wb") as f: |
| f.write(redacted_pdf.content) |
| ``` |
|
|
| ### `GET /download/{job_id}` |
| |
| Download the redacted PDF file. |
| |
| ### `GET /health` |
| |
| Check API health and model status. |
| |
| ### `GET /stats` |
| |
| Get API statistics. |
| |
| ## Response Format |
| |
| ```json |
| { |
| "job_id": "uuid-here", |
| "status": "completed", |
| "message": "Successfully redacted 5 entities", |
| "entities": [ |
| { |
| "entity_type": "PER", |
| "entity_text": "John Doe", |
| "page": 1, |
| "word_count": 2 |
| } |
| ], |
| "redacted_file_url": "/download/uuid-here" |
| } |
| ``` |
| |
| ## Entity Types |
|
|
| Common entity types detected: |
| - `PER`: Person names |
| - `ORG`: Organizations |
| - `LOC`: Locations |
| - `DATE`: Dates |
| - `EMAIL`: Email addresses |
| - `PHONE`: Phone numbers |
| - And more... |
|
|
| ## Local Development |
|
|
| ### Prerequisites |
|
|
| - Python 3.10+ |
| - Tesseract OCR |
| - Poppler utils |
|
|
| ### Installation |
|
|
| ```bash |
| # Install system dependencies (Ubuntu/Debian) |
| sudo apt-get install tesseract-ocr poppler-utils |
| |
| # Install Python dependencies |
| pip install -r requirements.txt |
| |
| # Run the server |
| python main.py |
| ``` |
|
|
| The API will be available at `http://localhost:7860` |
|
|
| ### Using Docker |
|
|
| ```bash |
| # Build the image |
| docker build -t pdf-redaction-api . |
| |
| # Run the container |
| docker run -p 7860:7860 pdf-redaction-api |
| ``` |
|
|
| ## Configuration |
|
|
| Adjust the DPI parameter based on your needs: |
| - `150`: Fast processing, lower quality |
| - `300`: Recommended balance (default) |
| - `600`: High quality, slower processing |
|
|
| ## Limitations |
|
|
| - Maximum file size: Dependent on Space resources |
| - Processing time increases with page count and DPI |
| - Files are automatically cleaned up after processing |
|
|
| ## Privacy |
|
|
| - Uploaded files are processed in-memory and deleted after redaction |
| - No data is stored permanently |
| - Use your own deployment for sensitive documents |
|
|
| ## Credits |
|
|
| Built with: |
| - [FastAPI](https://fastapi.tiangolo.com/) |
| - [Transformers](https://huggingface.co/transformers/) |
| - [PyPDF](https://github.com/py-pdf/pypdf) |
| - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) |
|
|
| ## License |
|
|
| MIT License - See LICENSE file for details |
|
|