Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / README.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

3.43 kB

	---
	title: PDF Redaction API
	emoji: 🔒
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	license: mit
	---

	# PDF Redaction API 🔒

	Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).

	## Features

	- 🤖 Powered by NER: Uses state-of-the-art Named Entity Recognition
	- 📄 PDF Support: Upload and process PDF documents
	- 🎯 Accurate Redaction: Correctly positioned black rectangles over sensitive text
	- 🚀 Fast Processing: Optimized OCR and NER pipeline
	- 🔧 Configurable: Adjust DPI and filter entity types

	## API Endpoints

	### `POST /redact`

	Upload a PDF file and get it redacted.

	Parameters:
	- `file`: PDF file (required)
	- `dpi`: OCR quality (default: 300)
	- `entity_types`: Comma-separated entity types to redact (optional)

	Example using cURL:

	```bash
	curl -X POST "https://your-space.hf.space/redact" \
	-F "file=@document.pdf" \
	-F "dpi=300"
	```

	Example using Python:

	```python
	import requests

	url = "https://your-space.hf.space/redact"
	files = {"file": open("document.pdf", "rb")}
	params = {"dpi": 300}

	response = requests.post(url, files=files, params=params)
	result = response.json()

	# Download redacted file
	job_id = result["job_id"]
	download_url = f"https://your-space.hf.space/download/{job_id}"
	redacted_pdf = requests.get(download_url)

	with open("redacted.pdf", "wb") as f:
	f.write(redacted_pdf.content)
	```

	### `GET /download/{job_id}`

	Download the redacted PDF file.

	### `GET /health`

	Check API health and model status.

	### `GET /stats`

	Get API statistics.

	## Response Format

	```json
	{
	"job_id": "uuid-here",
	"status": "completed",
	"message": "Successfully redacted 5 entities",
	"entities": [
	{
	"entity_type": "PER",
	"entity_text": "John Doe",
	"page": 1,
	"word_count": 2
	}
	],
	"redacted_file_url": "/download/uuid-here"
	}
	```

	## Entity Types

	Common entity types detected:
	- `PER`: Person names
	- `ORG`: Organizations
	- `LOC`: Locations
	- `DATE`: Dates
	- `EMAIL`: Email addresses
	- `PHONE`: Phone numbers
	- And more...

	## Local Development

	### Prerequisites

	- Python 3.10+
	- Tesseract OCR
	- Poppler utils

	### Installation

	```bash
	# Install system dependencies (Ubuntu/Debian)
	sudo apt-get install tesseract-ocr poppler-utils

	# Install Python dependencies
	pip install -r requirements.txt

	# Run the server
	python main.py
	```

	The API will be available at `http://localhost:7860`

	### Using Docker

	```bash
	# Build the image
	docker build -t pdf-redaction-api .

	# Run the container
	docker run -p 7860:7860 pdf-redaction-api
	```

	## Configuration

	Adjust the DPI parameter based on your needs:
	- `150`: Fast processing, lower quality
	- `300`: Recommended balance (default)
	- `600`: High quality, slower processing

	## Limitations

	- Maximum file size: Dependent on Space resources
	- Processing time increases with page count and DPI
	- Files are automatically cleaned up after processing

	## Privacy

	- Uploaded files are processed in-memory and deleted after redaction
	- No data is stored permanently
	- Use your own deployment for sensitive documents

	## Credits

	Built with:
	- [FastAPI](https://fastapi.tiangolo.com/)
	- [Transformers](https://huggingface.co/transformers/)
	- [PyPDF](https://github.com/py-pdf/pypdf)
	- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)

	## License

	MIT License - See LICENSE file for details