File size: 3,425 Bytes
e262fe2 af107f1 e262fe2 af107f1 e262fe2 af107f1 e262fe2 af107f1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
title: PDF Redaction API
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
---
# PDF Redaction API π
Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).
## Features
- π€ **Powered by NER**: Uses state-of-the-art Named Entity Recognition
- π **PDF Support**: Upload and process PDF documents
- π― **Accurate Redaction**: Correctly positioned black rectangles over sensitive text
- π **Fast Processing**: Optimized OCR and NER pipeline
- π§ **Configurable**: Adjust DPI and filter entity types
## API Endpoints
### `POST /redact`
Upload a PDF file and get it redacted.
**Parameters:**
- `file`: PDF file (required)
- `dpi`: OCR quality (default: 300)
- `entity_types`: Comma-separated entity types to redact (optional)
**Example using cURL:**
```bash
curl -X POST "https://your-space.hf.space/redact" \
-F "file=@document.pdf" \
-F "dpi=300"
```
**Example using Python:**
```python
import requests
url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}
response = requests.post(url, files=files, params=params)
result = response.json()
# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)
with open("redacted.pdf", "wb") as f:
f.write(redacted_pdf.content)
```
### `GET /download/{job_id}`
Download the redacted PDF file.
### `GET /health`
Check API health and model status.
### `GET /stats`
Get API statistics.
## Response Format
```json
{
"job_id": "uuid-here",
"status": "completed",
"message": "Successfully redacted 5 entities",
"entities": [
{
"entity_type": "PER",
"entity_text": "John Doe",
"page": 1,
"word_count": 2
}
],
"redacted_file_url": "/download/uuid-here"
}
```
## Entity Types
Common entity types detected:
- `PER`: Person names
- `ORG`: Organizations
- `LOC`: Locations
- `DATE`: Dates
- `EMAIL`: Email addresses
- `PHONE`: Phone numbers
- And more...
## Local Development
### Prerequisites
- Python 3.10+
- Tesseract OCR
- Poppler utils
### Installation
```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
# Install Python dependencies
pip install -r requirements.txt
# Run the server
python main.py
```
The API will be available at `http://localhost:7860`
### Using Docker
```bash
# Build the image
docker build -t pdf-redaction-api .
# Run the container
docker run -p 7860:7860 pdf-redaction-api
```
## Configuration
Adjust the DPI parameter based on your needs:
- `150`: Fast processing, lower quality
- `300`: Recommended balance (default)
- `600`: High quality, slower processing
## Limitations
- Maximum file size: Dependent on Space resources
- Processing time increases with page count and DPI
- Files are automatically cleaned up after processing
## Privacy
- Uploaded files are processed in-memory and deleted after redaction
- No data is stored permanently
- Use your own deployment for sensitive documents
## Credits
Built with:
- [FastAPI](https://fastapi.tiangolo.com/)
- [Transformers](https://huggingface.co/transformers/)
- [PyPDF](https://github.com/py-pdf/pypdf)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
## License
MIT License - See LICENSE file for details
|