Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / QUICKSTART.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

5.35 kB

	# Quick Start Guide 🚀

	## Local Development (5 minutes)

	### 1. Install System Dependencies

	Ubuntu/Debian:
	```bash
	sudo apt-get update
	sudo apt-get install -y tesseract-ocr poppler-utils
	```

	macOS:
	```bash
	brew install tesseract poppler
	```

	Windows:
	- Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
	- Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases

	### 2. Install Python Dependencies

	```bash
	pip install -r requirements.txt
	```

	### 3. Run the Server

	```bash
	python main.py
	```

	The API will be available at: `http://localhost:7860`

	### 4. Test with cURL

	```bash
	# Health check
	curl http://localhost:7860/health

	# Redact a PDF
	curl -X POST "http://localhost:7860/redact" \
	-F "file=@your_document.pdf" \
	-F "dpi=300"
	```

	### 5. Access API Documentation

	Open in browser: `http://localhost:7860/docs`

	## Using Docker (3 minutes)

	### 1. Build Image

	```bash
	docker build -t pdf-redaction-api .
	```

	### 2. Run Container

	```bash
	docker run -p 7860:7860 pdf-redaction-api
	```

	### 3. Test

	```bash
	curl http://localhost:7860/health
	```

	## Deploy to HuggingFace Spaces (10 minutes)

	### 1. Create Space

	1. Go to https://huggingface.co/spaces
	2. Click "Create new Space"
	3. Name: `pdf-redaction-api`
	4. SDK: Docker
	5. Click "Create Space"

	### 2. Push Code

	```bash
	# Clone your space
	git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
	cd pdf-redaction-api

	# Copy all project files
	cp -r /path/to/project/* .

	# Commit and push
	git add .
	git commit -m "Initial deployment"
	git push
	```

	### 3. Wait for Build

	Monitor at: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api`

	### 4. Test Your Deployed API

	```bash
	curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health
	```

	## Example Usage

	### Python Client

	```python
	import requests

	# Upload and redact
	files = {"file": open("document.pdf", "rb")}
	response = requests.post(
	"http://localhost:7860/redact",
	files=files,
	params={"dpi": 300}
	)

	result = response.json()
	job_id = result["job_id"]

	# Download redacted PDF
	redacted = requests.get(f"http://localhost:7860/download/{job_id}")
	with open("redacted.pdf", "wb") as f:
	f.write(redacted.content)

	print(f"Redacted {len(result['entities'])} entities")
	```

	### JavaScript/Node.js

	```javascript
	const FormData = require('form-data');
	const fs = require('fs');
	const axios = require('axios');

	async function redactPDF() {
	const form = new FormData();
	form.append('file', fs.createReadStream('document.pdf'));

	// Upload and redact
	const response = await axios.post(
	'http://localhost:7860/redact',
	form,
	{
	headers: form.getHeaders(),
	params: { dpi: 300 }
	}
	);

	const { job_id } = response.data;

	// Download redacted PDF
	const redacted = await axios.get(
	`http://localhost:7860/download/${job_id}`,
	{ responseType: 'arraybuffer' }
	);

	fs.writeFileSync('redacted.pdf', redacted.data);
	console.log('Redaction complete!');
	}

	redactPDF();
	```

	### cURL Advanced

	```bash
	# Redact only specific entity types
	curl -X POST "http://localhost:7860/redact" \
	-F "file=@document.pdf" \
	-F "dpi=300" \
	-F "entity_types=PER,ORG"

	# Get statistics
	curl http://localhost:7860/stats

	# Download specific file
	curl -O -J http://localhost:7860/download/JOB_ID_HERE
	```

	## Common Use Cases

	### 1. Redact All Personal Information

	```python
	response = requests.post(
	"http://localhost:7860/redact",
	files={"file": open("resume.pdf", "rb")},
	params={"dpi": 300}
	)
	```

	### 2. Redact Only Names and Organizations

	```python
	response = requests.post(
	"http://localhost:7860/redact",
	files={"file": open("contract.pdf", "rb")},
	params={
	"dpi": 300,
	"entity_types": "PER,ORG"
	}
	)
	```

	### 3. Fast Processing (Lower Quality)

	```python
	response = requests.post(
	"http://localhost:7860/redact",
	files={"file": open("large_doc.pdf", "rb")},
	params={"dpi": 150} # Faster but less accurate
	)
	```

	### 4. High Quality (Slower)

	```python
	response = requests.post(
	"http://localhost:7860/redact",
	files={"file": open("important.pdf", "rb")},
	params={"dpi": 600} # Best quality, slowest
	)
	```

	## Troubleshooting

	### "Model not loaded"
	Problem: NER model failed to load
	Solution: Check internet connection, wait for model download

	### "Tesseract not found"
	Problem: OCR engine not installed
	Solution: Install tesseract-ocr system package

	### "Poppler not found"
	Problem: PDF converter not installed
	Solution: Install poppler-utils system package

	### Slow processing
	Problem: Redaction takes too long
	Solution: Lower DPI to 150-200

	### Out of memory
	Problem: Large PDF crashes the API
	Solution:
	- Process one page at a time
	- Increase container memory
	- Lower DPI

	## Next Steps

	- ✅ Read full [README.md](README.md) for API details
	- ✅ Check [DEPLOYMENT.md](DEPLOYMENT.md) for production setup
	- ✅ Review [STRUCTURE.md](STRUCTURE.md) for code organization
	- ✅ Run tests: `pytest tests/`
	- ✅ Add authentication for production use
	- ✅ Set up monitoring and logging

	## Support

	- 📖 API Docs: `http://localhost:7860/docs`
	- 🐛 Issues: Create on your repository
	- 💬 HuggingFace: Community forums

	Happy redacting! 🔒