PDF-Redaction-API / README.md
Sammi1211's picture
adding url support
af107f1
---
title: PDF Redaction API
emoji: πŸ”’
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
---
# PDF Redaction API πŸ”’
Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).
## Features
- πŸ€– **Powered by NER**: Uses state-of-the-art Named Entity Recognition
- πŸ“„ **PDF Support**: Upload and process PDF documents
- 🎯 **Accurate Redaction**: Correctly positioned black rectangles over sensitive text
- πŸš€ **Fast Processing**: Optimized OCR and NER pipeline
- πŸ”§ **Configurable**: Adjust DPI and filter entity types
## API Endpoints
### `POST /redact`
Upload a PDF file and get it redacted.
**Parameters:**
- `file`: PDF file (required)
- `dpi`: OCR quality (default: 300)
- `entity_types`: Comma-separated entity types to redact (optional)
**Example using cURL:**
```bash
curl -X POST "https://your-space.hf.space/redact" \
-F "file=@document.pdf" \
-F "dpi=300"
```
**Example using Python:**
```python
import requests
url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}
response = requests.post(url, files=files, params=params)
result = response.json()
# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)
with open("redacted.pdf", "wb") as f:
f.write(redacted_pdf.content)
```
### `GET /download/{job_id}`
Download the redacted PDF file.
### `GET /health`
Check API health and model status.
### `GET /stats`
Get API statistics.
## Response Format
```json
{
"job_id": "uuid-here",
"status": "completed",
"message": "Successfully redacted 5 entities",
"entities": [
{
"entity_type": "PER",
"entity_text": "John Doe",
"page": 1,
"word_count": 2
}
],
"redacted_file_url": "/download/uuid-here"
}
```
## Entity Types
Common entity types detected:
- `PER`: Person names
- `ORG`: Organizations
- `LOC`: Locations
- `DATE`: Dates
- `EMAIL`: Email addresses
- `PHONE`: Phone numbers
- And more...
## Local Development
### Prerequisites
- Python 3.10+
- Tesseract OCR
- Poppler utils
### Installation
```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
# Install Python dependencies
pip install -r requirements.txt
# Run the server
python main.py
```
The API will be available at `http://localhost:7860`
### Using Docker
```bash
# Build the image
docker build -t pdf-redaction-api .
# Run the container
docker run -p 7860:7860 pdf-redaction-api
```
## Configuration
Adjust the DPI parameter based on your needs:
- `150`: Fast processing, lower quality
- `300`: Recommended balance (default)
- `600`: High quality, slower processing
## Limitations
- Maximum file size: Dependent on Space resources
- Processing time increases with page count and DPI
- Files are automatically cleaned up after processing
## Privacy
- Uploaded files are processed in-memory and deleted after redaction
- No data is stored permanently
- Use your own deployment for sensitive documents
## Credits
Built with:
- [FastAPI](https://fastapi.tiangolo.com/)
- [Transformers](https://huggingface.co/transformers/)
- [PyPDF](https://github.com/py-pdf/pypdf)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
## License
MIT License - See LICENSE file for details