| # π Complete FastAPI Deployment Package |
|
|
| ## π¦ What You've Got |
|
|
| A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform. |
|
|
| --- |
|
|
| ## π Directory Structure |
|
|
| ``` |
| pdf-redaction-api/ |
| β |
| βββ π main.py # FastAPI application |
| βββ π³ Dockerfile # Production container |
| βββ π³ docker-compose.yml # Local development |
| βββ π requirements.txt # Python dependencies |
| β |
| βββ π± app/ |
| β βββ __init__.py |
| β βββ redaction.py # Core redaction engine |
| β |
| βββ π uploads/ # Temporary uploads |
| β βββ .gitkeep |
| β |
| βββ π outputs/ # Redacted PDFs |
| β βββ .gitkeep |
| β |
| βββ π§ͺ tests/ |
| β βββ test_api.py # API tests |
| β |
| βββ π Documentation/ |
| β βββ README.md # Main docs (for HF Spaces) |
| β βββ DEPLOYMENT.md # Deployment guide |
| β βββ QUICKSTART.md # Quick start guide |
| β βββ STRUCTURE.md # Project structure |
| β |
| βββ π§ Configuration/ |
| β βββ .env.example # Environment variables |
| β βββ .gitignore # Git ignore |
| β βββ .dockerignore # Docker ignore |
| β |
| βββ π€ .github/ |
| β βββ workflows/ |
| β βββ ci-cd.yml # GitHub Actions CI/CD |
| β |
| βββ π client_example.py # Example API client |
| βββ π LICENSE # MIT License |
| ``` |
|
|
| --- |
|
|
| ## β¨ Features |
|
|
| ### Core Functionality |
| β
PDF upload and processing |
| β
OCR with pytesseract (configurable DPI) |
| β
Named Entity Recognition (NER) |
| β
Accurate coordinate-based redaction |
| β
Multiple entity type support |
| β
Downloadable redacted PDFs |
|
|
| ### API Features |
| β
RESTful API with FastAPI |
| β
Automatic OpenAPI documentation |
| β
File upload handling |
| β
Background task cleanup |
| β
Health checks |
| β
Statistics endpoint |
| β
CORS support |
|
|
| ### DevOps |
| β
Docker containerization |
| β
Docker Compose for local dev |
| β
GitHub Actions CI/CD |
| β
HuggingFace Spaces ready |
| β
Comprehensive testing |
| β
Logging and monitoring |
|
|
| --- |
|
|
| ## π― Quick Deployment Paths |
|
|
| ### Option 1: HuggingFace Spaces (Recommended for Demo) |
|
|
| **Time: 10 minutes** |
|
|
| ```bash |
| # 1. Create Space on HuggingFace (select Docker SDK) |
| # 2. Clone your space |
| git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api |
| cd pdf-redaction-api |
| |
| # 3. Copy all files |
| cp -r /path/to/pdf-redaction-api/* . |
| |
| # 4. Deploy |
| git add . |
| git commit -m "Initial deployment" |
| git push |
| ``` |
|
|
| **Your API will be at:** `https://YOUR_USERNAME-pdf-redaction-api.hf.space` |
|
|
| **Cost:** FREE (with CPU Basic tier) |
|
|
| --- |
|
|
| ### Option 2: Docker Locally |
|
|
| **Time: 5 minutes** |
|
|
| ```bash |
| # Build |
| docker build -t pdf-redaction-api . |
| |
| # Run |
| docker run -p 7860:7860 pdf-redaction-api |
| |
| # Test |
| curl http://localhost:7860/health |
| ``` |
|
|
| --- |
|
|
| ### Option 3: Direct Python |
|
|
| **Time: 3 minutes** |
|
|
| ```bash |
| # Install dependencies |
| sudo apt-get install tesseract-ocr poppler-utils |
| pip install -r requirements.txt |
| |
| # Run |
| python main.py |
| |
| # Access at http://localhost:7860 |
| ``` |
|
|
| --- |
|
|
| ## π API Endpoints |
|
|
| ### Core Endpoints |
|
|
| | Method | Endpoint | Description | |
| |--------|----------|-------------| |
| | POST | `/redact` | Upload and redact PDF | |
| | GET | `/download/{job_id}` | Download redacted PDF | |
| | GET | `/health` | Health check | |
| | GET | `/stats` | API statistics | |
| | DELETE | `/cleanup/{job_id}` | Manual cleanup | |
| | GET | `/docs` | Interactive API docs | |
|
|
| ### Example Usage |
|
|
| **cURL:** |
| ```bash |
| curl -X POST "http://localhost:7860/redact" \ |
| -F "file=@document.pdf" \ |
| -F "dpi=300" |
| ``` |
|
|
| **Python:** |
| ```python |
| import requests |
| |
| response = requests.post( |
| "http://localhost:7860/redact", |
| files={"file": open("document.pdf", "rb")}, |
| params={"dpi": 300} |
| ) |
| |
| job_id = response.json()["job_id"] |
| redacted = requests.get(f"http://localhost:7860/download/{job_id}") |
| ``` |
|
|
| --- |
|
|
| ## π¨ Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β CLIENT REQUEST β |
| β (Upload PDF via POST /redact) β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β FASTAPI (main.py) β |
| β β’ Validate file β |
| β β’ Generate job_id β |
| β β’ Save to uploads/ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β PDFRedactor (app/redaction.py) β |
| β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β 1. OCR (pytesseract) β β |
| β β β’ Convert PDF β Images (pdf2image) β β |
| β β β’ Extract text + bounding boxes β β |
| β β β’ Store image dimensions β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β 2. NER (HuggingFace Transformers) β β |
| β β β’ Load model β β |
| β β β’ Identify entities in text β β |
| β β β’ Return entity types + positions β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β 3. Mapping β β |
| β β β’ Create character span index β β |
| β β β’ Match NER entities to OCR boxes β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| β β 4. Redaction (pypdf) β β |
| β β β’ Scale image coords β PDF coords β β |
| β β β’ Create black rectangle annotations β β |
| β β β’ Write redacted PDF β β |
| β βββββββββββββββββββββββββββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β RESPONSE β |
| β β’ job_id β |
| β β’ List of entities β |
| β β’ Download URL β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| --- |
|
|
| ## π Security Considerations |
|
|
| ### Current Implementation |
| - β
File validation (PDF only) |
| - β
Temporary file cleanup |
| - β
CORS middleware |
| - β
Error handling |
|
|
| ### For Production (TODO) |
| - β οΈ Add API key authentication |
| - β οΈ Implement rate limiting |
| - β οΈ Add file size limits |
| - β οΈ Use HTTPS only |
| - β οΈ Implement user quotas |
| - β οΈ Add input sanitization |
|
|
| **Example API Key Auth:** |
| ```python |
| # Add to main.py |
| from fastapi import Security, HTTPException |
| from fastapi.security import APIKeyHeader |
| |
| API_KEY = "your-secret-key" |
| api_key_header = APIKeyHeader(name="X-API-Key") |
| |
| def verify_api_key(key: str = Security(api_key_header)): |
| if key != API_KEY: |
| raise HTTPException(401, "Invalid API Key") |
| ``` |
|
|
| --- |
|
|
| ## π Performance Tuning |
|
|
| ### DPI Settings |
|
|
| | DPI | Quality | Speed | Use Case | |
| |-----|---------|-------|----------| |
| | 150 | Low | Fast | Quick previews | |
| | 200 | Medium | Medium | General use | |
| | 300 | High | Slow | **Recommended** | |
| | 600 | Very High | Very Slow | Critical documents | |
|
|
| ### Hardware Requirements |
|
|
| **Minimum (Free Tier):** |
| - CPU: 2 cores |
| - RAM: 2GB |
| - Storage: 1GB |
|
|
| **Recommended (Production):** |
| - CPU: 4+ cores |
| - RAM: 8GB |
| - Storage: 10GB |
| - GPU: Optional (speeds up NER) |
|
|
| --- |
|
|
| ## π§ͺ Testing |
|
|
| ```bash |
| # Install test dependencies |
| pip install pytest pytest-cov httpx |
| |
| # Run tests |
| pytest tests/ -v |
| |
| # With coverage |
| pytest tests/ --cov=app --cov-report=html |
| |
| # View coverage report |
| open htmlcov/index.html |
| ``` |
|
|
| --- |
|
|
| ## π Monitoring |
|
|
| ### Built-in Endpoints |
|
|
| **Health Check:** |
| ```bash |
| curl http://localhost:7860/health |
| ``` |
|
|
| **Statistics:** |
| ```bash |
| curl http://localhost:7860/stats |
| ``` |
|
|
| ### Logs |
|
|
| **Development:** |
| ```bash |
| python main.py |
| # Logs appear in console |
| ``` |
|
|
| **Docker:** |
| ```bash |
| docker logs -f container_name |
| ``` |
|
|
| **HuggingFace Spaces:** |
| - View in Space dashboard β Logs tab |
|
|
| --- |
|
|
| ## π° Cost Estimation |
|
|
| ### HuggingFace Spaces |
|
|
| | Tier | CPU | RAM | Price | Use Case | |
| |------|-----|-----|-------|----------| |
| | Basic | 2 | 16GB | **FREE** | Demo, testing | |
| | CPU Upgrade | 4 | 32GB | $0.50/hr | Production | |
| | GPU T4 | - | - | $0.60/hr | Heavy load | |
| | GPU A10G | - | - | $1.50/hr | Enterprise | |
|
|
| **Monthly Costs (if always on):** |
| - Free: $0 |
| - CPU Upgrade: ~$360/month |
| - GPU T4: ~$432/month |
|
|
| **Recommendation:** Start free, upgrade based on usage |
|
|
| ### Alternatives |
|
|
| **AWS ECS Fargate:** ~$30-100/month |
| **Google Cloud Run:** Pay per request (~$10-50/month) |
| **DigitalOcean App:** $12-24/month |
| **Self-hosted VPS:** $5-20/month |
|
|
| --- |
|
|
| ## π CI/CD Pipeline |
|
|
| ### Automated with GitHub Actions |
|
|
| ``` |
| Push to GitHub |
| β |
| [Run Tests] |
| β |
| [Build Docker] |
| β |
| [Test Container] |
| β |
| [Deploy to HuggingFace] |
| ``` |
|
|
| **Setup:** |
| 1. Add secrets in GitHub repo settings: |
| - `HF_TOKEN`: HuggingFace access token |
| - `HF_SPACE`: Your space name (username/space-name) |
|
|
| 2. Push to main branch β Auto-deploy! β¨ |
|
|
| --- |
|
|
| ## π Documentation Access |
|
|
| | Document | Purpose | |
| |----------|---------| |
| | `README.md` | Overview, API docs, usage examples | |
| | `QUICKSTART.md` | 5-minute setup guide | |
| | `DEPLOYMENT.md` | Production deployment | |
| | `STRUCTURE.md` | Code organization | |
| | `/docs` endpoint | Interactive API documentation | |
|
|
| --- |
|
|
| ## π Learning Resources |
|
|
| ### FastAPI |
| - Docs: https://fastapi.tiangolo.com |
| - Tutorial: https://fastapi.tiangolo.com/tutorial |
|
|
| ### HuggingFace |
| - Spaces: https://huggingface.co/docs/hub/spaces |
| - Transformers: https://huggingface.co/docs/transformers |
|
|
| ### Docker |
| - Getting Started: https://docs.docker.com/get-started |
|
|
| --- |
|
|
| ## π Troubleshooting |
|
|
| ### Common Issues |
|
|
| **Problem:** "Tesseract not found" |
| **Solution:** `apt-get install tesseract-ocr` |
|
|
| **Problem:** "Poppler not found" |
| **Solution:** `apt-get install poppler-utils` |
|
|
| **Problem:** Slow processing |
| **Solution:** Lower DPI to 150-200 |
|
|
| **Problem:** Out of memory |
| **Solution:** Upgrade hardware or reduce DPI |
|
|
| **Problem:** Model not loading |
| **Solution:** Check internet, wait for download |
|
|
| ### Debug Mode |
|
|
| ```python |
| # In main.py, add debug mode |
| if __name__ == "__main__": |
| uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug") |
| ``` |
|
|
| --- |
|
|
| ## β
Checklist for Production |
|
|
| - [ ] Test all endpoints thoroughly |
| - [ ] Add API key authentication |
| - [ ] Implement rate limiting |
| - [ ] Set up monitoring (Sentry, DataDog, etc.) |
| - [ ] Configure auto-scaling |
| - [ ] Set up backups |
| - [ ] Add usage analytics |
| - [ ] Create user documentation |
| - [ ] Set up SSL/TLS (HF provides by default) |
| - [ ] Test with large files |
| - [ ] Load testing |
| - [ ] Security audit |
| - [ ] Legal compliance (GDPR, etc.) |
|
|
| --- |
|
|
| ## π You're Ready! |
|
|
| Your FastAPI PDF Redaction application is complete and ready to deploy! |
|
|
| ### Next Steps: |
| 1. β¨ Deploy to HuggingFace Spaces (easiest) |
| 2. π§ͺ Test with real PDFs |
| 3. π Monitor usage |
| 4. π Add security for production |
| 5. π Scale as needed |
|
|
| ### Support: |
| - π Read the documentation |
| - π Check troubleshooting guide |
| - π¬ HuggingFace community forums |
| - π§ Create issues on your repo |
|
|
| **Happy Deploying! π** |
|
|