# Quick Start Guide 🚀 ## Local Development (5 minutes) ### 1. Install System Dependencies **Ubuntu/Debian:** ```bash sudo apt-get update sudo apt-get install -y tesseract-ocr poppler-utils ``` **macOS:** ```bash brew install tesseract poppler ``` **Windows:** - Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki - Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases ### 2. Install Python Dependencies ```bash pip install -r requirements.txt ``` ### 3. Run the Server ```bash python main.py ``` The API will be available at: `http://localhost:7860` ### 4. Test with cURL ```bash # Health check curl http://localhost:7860/health # Redact a PDF curl -X POST "http://localhost:7860/redact" \ -F "file=@your_document.pdf" \ -F "dpi=300" ``` ### 5. Access API Documentation Open in browser: `http://localhost:7860/docs` ## Using Docker (3 minutes) ### 1. Build Image ```bash docker build -t pdf-redaction-api . ``` ### 2. Run Container ```bash docker run -p 7860:7860 pdf-redaction-api ``` ### 3. Test ```bash curl http://localhost:7860/health ``` ## Deploy to HuggingFace Spaces (10 minutes) ### 1. Create Space 1. Go to https://huggingface.co/spaces 2. Click "Create new Space" 3. Name: `pdf-redaction-api` 4. SDK: **Docker** 5. Click "Create Space" ### 2. Push Code ```bash # Clone your space git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api cd pdf-redaction-api # Copy all project files cp -r /path/to/project/* . # Commit and push git add . git commit -m "Initial deployment" git push ``` ### 3. Wait for Build Monitor at: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api` ### 4. Test Your Deployed API ```bash curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health ``` ## Example Usage ### Python Client ```python import requests # Upload and redact files = {"file": open("document.pdf", "rb")} response = requests.post( "http://localhost:7860/redact", files=files, params={"dpi": 300} ) result = response.json() job_id = result["job_id"] # Download redacted PDF redacted = requests.get(f"http://localhost:7860/download/{job_id}") with open("redacted.pdf", "wb") as f: f.write(redacted.content) print(f"Redacted {len(result['entities'])} entities") ``` ### JavaScript/Node.js ```javascript const FormData = require('form-data'); const fs = require('fs'); const axios = require('axios'); async function redactPDF() { const form = new FormData(); form.append('file', fs.createReadStream('document.pdf')); // Upload and redact const response = await axios.post( 'http://localhost:7860/redact', form, { headers: form.getHeaders(), params: { dpi: 300 } } ); const { job_id } = response.data; // Download redacted PDF const redacted = await axios.get( `http://localhost:7860/download/${job_id}`, { responseType: 'arraybuffer' } ); fs.writeFileSync('redacted.pdf', redacted.data); console.log('Redaction complete!'); } redactPDF(); ``` ### cURL Advanced ```bash # Redact only specific entity types curl -X POST "http://localhost:7860/redact" \ -F "file=@document.pdf" \ -F "dpi=300" \ -F "entity_types=PER,ORG" # Get statistics curl http://localhost:7860/stats # Download specific file curl -O -J http://localhost:7860/download/JOB_ID_HERE ``` ## Common Use Cases ### 1. Redact All Personal Information ```python response = requests.post( "http://localhost:7860/redact", files={"file": open("resume.pdf", "rb")}, params={"dpi": 300} ) ``` ### 2. Redact Only Names and Organizations ```python response = requests.post( "http://localhost:7860/redact", files={"file": open("contract.pdf", "rb")}, params={ "dpi": 300, "entity_types": "PER,ORG" } ) ``` ### 3. Fast Processing (Lower Quality) ```python response = requests.post( "http://localhost:7860/redact", files={"file": open("large_doc.pdf", "rb")}, params={"dpi": 150} # Faster but less accurate ) ``` ### 4. High Quality (Slower) ```python response = requests.post( "http://localhost:7860/redact", files={"file": open("important.pdf", "rb")}, params={"dpi": 600} # Best quality, slowest ) ``` ## Troubleshooting ### "Model not loaded" **Problem**: NER model failed to load **Solution**: Check internet connection, wait for model download ### "Tesseract not found" **Problem**: OCR engine not installed **Solution**: Install tesseract-ocr system package ### "Poppler not found" **Problem**: PDF converter not installed **Solution**: Install poppler-utils system package ### Slow processing **Problem**: Redaction takes too long **Solution**: Lower DPI to 150-200 ### Out of memory **Problem**: Large PDF crashes the API **Solution**: - Process one page at a time - Increase container memory - Lower DPI ## Next Steps - ✅ Read full [README.md](README.md) for API details - ✅ Check [DEPLOYMENT.md](DEPLOYMENT.md) for production setup - ✅ Review [STRUCTURE.md](STRUCTURE.md) for code organization - ✅ Run tests: `pytest tests/` - ✅ Add authentication for production use - ✅ Set up monitoring and logging ## Support - 📖 API Docs: `http://localhost:7860/docs` - 🐛 Issues: Create on your repository - 💬 HuggingFace: Community forums Happy redacting! 🔒