| # Quick Start Guide π |
|
|
| ## Local Development (5 minutes) |
|
|
| ### 1. Install System Dependencies |
|
|
| **Ubuntu/Debian:** |
| ```bash |
| sudo apt-get update |
| sudo apt-get install -y tesseract-ocr poppler-utils |
| ``` |
|
|
| **macOS:** |
| ```bash |
| brew install tesseract poppler |
| ``` |
|
|
| **Windows:** |
| - Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki |
| - Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases |
|
|
| ### 2. Install Python Dependencies |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 3. Run the Server |
|
|
| ```bash |
| python main.py |
| ``` |
|
|
| The API will be available at: `http://localhost:7860` |
|
|
| ### 4. Test with cURL |
|
|
| ```bash |
| # Health check |
| curl http://localhost:7860/health |
| |
| # Redact a PDF |
| curl -X POST "http://localhost:7860/redact" \ |
| -F "file=@your_document.pdf" \ |
| -F "dpi=300" |
| ``` |
|
|
| ### 5. Access API Documentation |
|
|
| Open in browser: `http://localhost:7860/docs` |
|
|
| ## Using Docker (3 minutes) |
|
|
| ### 1. Build Image |
|
|
| ```bash |
| docker build -t pdf-redaction-api . |
| ``` |
|
|
| ### 2. Run Container |
|
|
| ```bash |
| docker run -p 7860:7860 pdf-redaction-api |
| ``` |
|
|
| ### 3. Test |
|
|
| ```bash |
| curl http://localhost:7860/health |
| ``` |
|
|
| ## Deploy to HuggingFace Spaces (10 minutes) |
|
|
| ### 1. Create Space |
|
|
| 1. Go to https://huggingface.co/spaces |
| 2. Click "Create new Space" |
| 3. Name: `pdf-redaction-api` |
| 4. SDK: **Docker** |
| 5. Click "Create Space" |
|
|
| ### 2. Push Code |
|
|
| ```bash |
| # Clone your space |
| git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api |
| cd pdf-redaction-api |
| |
| # Copy all project files |
| cp -r /path/to/project/* . |
| |
| # Commit and push |
| git add . |
| git commit -m "Initial deployment" |
| git push |
| ``` |
|
|
| ### 3. Wait for Build |
|
|
| Monitor at: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api` |
|
|
| ### 4. Test Your Deployed API |
|
|
| ```bash |
| curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health |
| ``` |
|
|
| ## Example Usage |
|
|
| ### Python Client |
|
|
| ```python |
| import requests |
| |
| # Upload and redact |
| files = {"file": open("document.pdf", "rb")} |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files=files, |
| params={"dpi": 300} |
| ) |
| |
| result = response.json() |
| job_id = result["job_id"] |
| |
| # Download redacted PDF |
| redacted = requests.get(f"http://localhost:7860/download/{job_id}") |
| with open("redacted.pdf", "wb") as f: |
| f.write(redacted.content) |
| |
| print(f"Redacted {len(result['entities'])} entities") |
| ``` |
|
|
| ### JavaScript/Node.js |
|
|
| ```javascript |
| const FormData = require('form-data'); |
| const fs = require('fs'); |
| const axios = require('axios'); |
| |
| async function redactPDF() { |
| const form = new FormData(); |
| form.append('file', fs.createReadStream('document.pdf')); |
| |
| // Upload and redact |
| const response = await axios.post( |
| 'http://localhost:7860/redact', |
| form, |
| { |
| headers: form.getHeaders(), |
| params: { dpi: 300 } |
| } |
| ); |
| |
| const { job_id } = response.data; |
| |
| // Download redacted PDF |
| const redacted = await axios.get( |
| `http://localhost:7860/download/${job_id}`, |
| { responseType: 'arraybuffer' } |
| ); |
| |
| fs.writeFileSync('redacted.pdf', redacted.data); |
| console.log('Redaction complete!'); |
| } |
| |
| redactPDF(); |
| ``` |
|
|
| ### cURL Advanced |
|
|
| ```bash |
| # Redact only specific entity types |
| curl -X POST "http://localhost:7860/redact" \ |
| -F "file=@document.pdf" \ |
| -F "dpi=300" \ |
| -F "entity_types=PER,ORG" |
| |
| # Get statistics |
| curl http://localhost:7860/stats |
| |
| # Download specific file |
| curl -O -J http://localhost:7860/download/JOB_ID_HERE |
| ``` |
|
|
| ## Common Use Cases |
|
|
| ### 1. Redact All Personal Information |
|
|
| ```python |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files={"file": open("resume.pdf", "rb")}, |
| params={"dpi": 300} |
| ) |
| ``` |
|
|
| ### 2. Redact Only Names and Organizations |
|
|
| ```python |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files={"file": open("contract.pdf", "rb")}, |
| params={ |
| "dpi": 300, |
| "entity_types": "PER,ORG" |
| } |
| ) |
| ``` |
|
|
| ### 3. Fast Processing (Lower Quality) |
|
|
| ```python |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files={"file": open("large_doc.pdf", "rb")}, |
| params={"dpi": 150} # Faster but less accurate |
| ) |
| ``` |
|
|
| ### 4. High Quality (Slower) |
|
|
| ```python |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files={"file": open("important.pdf", "rb")}, |
| params={"dpi": 600} # Best quality, slowest |
| ) |
| ``` |
|
|
| ## Troubleshooting |
|
|
| ### "Model not loaded" |
| **Problem**: NER model failed to load |
| **Solution**: Check internet connection, wait for model download |
|
|
| ### "Tesseract not found" |
| **Problem**: OCR engine not installed |
| **Solution**: Install tesseract-ocr system package |
|
|
| ### "Poppler not found" |
| **Problem**: PDF converter not installed |
| **Solution**: Install poppler-utils system package |
|
|
| ### Slow processing |
| **Problem**: Redaction takes too long |
| **Solution**: Lower DPI to 150-200 |
|
|
| ### Out of memory |
| **Problem**: Large PDF crashes the API |
| **Solution**: |
| - Process one page at a time |
| - Increase container memory |
| - Lower DPI |
|
|
| ## Next Steps |
|
|
| - β
Read full [README.md](README.md) for API details |
| - β
Check [DEPLOYMENT.md](DEPLOYMENT.md) for production setup |
| - β
Review [STRUCTURE.md](STRUCTURE.md) for code organization |
| - β
Run tests: `pytest tests/` |
| - β
Add authentication for production use |
| - β
Set up monitoring and logging |
|
|
| ## Support |
|
|
| - π API Docs: `http://localhost:7860/docs` |
| - π Issues: Create on your repository |
| - π¬ HuggingFace: Community forums |
|
|
| Happy redacting! π |
|
|