# ๐Ÿš€ Complete FastAPI Deployment Package ## ๐Ÿ“ฆ What You've Got A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform. --- ## ๐Ÿ“ Directory Structure ``` pdf-redaction-api/ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ main.py # FastAPI application โ”œโ”€โ”€ ๐Ÿณ Dockerfile # Production container โ”œโ”€โ”€ ๐Ÿณ docker-compose.yml # Local development โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt # Python dependencies โ”‚ โ”œโ”€โ”€ ๐Ÿ“ฑ app/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ””โ”€โ”€ redaction.py # Core redaction engine โ”‚ โ”œโ”€โ”€ ๐Ÿ“‚ uploads/ # Temporary uploads โ”‚ โ””โ”€โ”€ .gitkeep โ”‚ โ”œโ”€โ”€ ๐Ÿ“‚ outputs/ # Redacted PDFs โ”‚ โ””โ”€โ”€ .gitkeep โ”‚ โ”œโ”€โ”€ ๐Ÿงช tests/ โ”‚ โ””โ”€โ”€ test_api.py # API tests โ”‚ โ”œโ”€โ”€ ๐Ÿ“š Documentation/ โ”‚ โ”œโ”€โ”€ README.md # Main docs (for HF Spaces) โ”‚ โ”œโ”€โ”€ DEPLOYMENT.md # Deployment guide โ”‚ โ”œโ”€โ”€ QUICKSTART.md # Quick start guide โ”‚ โ””โ”€โ”€ STRUCTURE.md # Project structure โ”‚ โ”œโ”€โ”€ ๐Ÿ”ง Configuration/ โ”‚ โ”œโ”€โ”€ .env.example # Environment variables โ”‚ โ”œโ”€โ”€ .gitignore # Git ignore โ”‚ โ””โ”€โ”€ .dockerignore # Docker ignore โ”‚ โ”œโ”€โ”€ ๐Ÿค– .github/ โ”‚ โ””โ”€โ”€ workflows/ โ”‚ โ””โ”€โ”€ ci-cd.yml # GitHub Actions CI/CD โ”‚ โ”œโ”€โ”€ ๐Ÿ“ client_example.py # Example API client โ””โ”€โ”€ ๐Ÿ“œ LICENSE # MIT License ``` --- ## โœจ Features ### Core Functionality โœ… PDF upload and processing โœ… OCR with pytesseract (configurable DPI) โœ… Named Entity Recognition (NER) โœ… Accurate coordinate-based redaction โœ… Multiple entity type support โœ… Downloadable redacted PDFs ### API Features โœ… RESTful API with FastAPI โœ… Automatic OpenAPI documentation โœ… File upload handling โœ… Background task cleanup โœ… Health checks โœ… Statistics endpoint โœ… CORS support ### DevOps โœ… Docker containerization โœ… Docker Compose for local dev โœ… GitHub Actions CI/CD โœ… HuggingFace Spaces ready โœ… Comprehensive testing โœ… Logging and monitoring --- ## ๐ŸŽฏ Quick Deployment Paths ### Option 1: HuggingFace Spaces (Recommended for Demo) **Time: 10 minutes** ```bash # 1. Create Space on HuggingFace (select Docker SDK) # 2. Clone your space git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api cd pdf-redaction-api # 3. Copy all files cp -r /path/to/pdf-redaction-api/* . # 4. Deploy git add . git commit -m "Initial deployment" git push ``` **Your API will be at:** `https://YOUR_USERNAME-pdf-redaction-api.hf.space` **Cost:** FREE (with CPU Basic tier) --- ### Option 2: Docker Locally **Time: 5 minutes** ```bash # Build docker build -t pdf-redaction-api . # Run docker run -p 7860:7860 pdf-redaction-api # Test curl http://localhost:7860/health ``` --- ### Option 3: Direct Python **Time: 3 minutes** ```bash # Install dependencies sudo apt-get install tesseract-ocr poppler-utils pip install -r requirements.txt # Run python main.py # Access at http://localhost:7860 ``` --- ## ๐Ÿ”Œ API Endpoints ### Core Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/redact` | Upload and redact PDF | | GET | `/download/{job_id}` | Download redacted PDF | | GET | `/health` | Health check | | GET | `/stats` | API statistics | | DELETE | `/cleanup/{job_id}` | Manual cleanup | | GET | `/docs` | Interactive API docs | ### Example Usage **cURL:** ```bash curl -X POST "http://localhost:7860/redact" \ -F "file=@document.pdf" \ -F "dpi=300" ``` **Python:** ```python import requests response = requests.post( "http://localhost:7860/redact", files={"file": open("document.pdf", "rb")}, params={"dpi": 300} ) job_id = response.json()["job_id"] redacted = requests.get(f"http://localhost:7860/download/{job_id}") ``` --- ## ๐ŸŽจ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ CLIENT REQUEST โ”‚ โ”‚ (Upload PDF via POST /redact) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FASTAPI (main.py) โ”‚ โ”‚ โ€ข Validate file โ”‚ โ”‚ โ€ข Generate job_id โ”‚ โ”‚ โ€ข Save to uploads/ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ PDFRedactor (app/redaction.py) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 1. OCR (pytesseract) โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Convert PDF โ†’ Images (pdf2image) โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Extract text + bounding boxes โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Store image dimensions โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 2. NER (HuggingFace Transformers) โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Load model โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Identify entities in text โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Return entity types + positions โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 3. Mapping โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Create character span index โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Match NER entities to OCR boxes โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 4. Redaction (pypdf) โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Scale image coords โ†’ PDF coords โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Create black rectangle annotations โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Write redacted PDF โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ RESPONSE โ”‚ โ”‚ โ€ข job_id โ”‚ โ”‚ โ€ข List of entities โ”‚ โ”‚ โ€ข Download URL โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## ๐Ÿ” Security Considerations ### Current Implementation - โœ… File validation (PDF only) - โœ… Temporary file cleanup - โœ… CORS middleware - โœ… Error handling ### For Production (TODO) - โš ๏ธ Add API key authentication - โš ๏ธ Implement rate limiting - โš ๏ธ Add file size limits - โš ๏ธ Use HTTPS only - โš ๏ธ Implement user quotas - โš ๏ธ Add input sanitization **Example API Key Auth:** ```python # Add to main.py from fastapi import Security, HTTPException from fastapi.security import APIKeyHeader API_KEY = "your-secret-key" api_key_header = APIKeyHeader(name="X-API-Key") def verify_api_key(key: str = Security(api_key_header)): if key != API_KEY: raise HTTPException(401, "Invalid API Key") ``` --- ## ๐Ÿ“Š Performance Tuning ### DPI Settings | DPI | Quality | Speed | Use Case | |-----|---------|-------|----------| | 150 | Low | Fast | Quick previews | | 200 | Medium | Medium | General use | | 300 | High | Slow | **Recommended** | | 600 | Very High | Very Slow | Critical documents | ### Hardware Requirements **Minimum (Free Tier):** - CPU: 2 cores - RAM: 2GB - Storage: 1GB **Recommended (Production):** - CPU: 4+ cores - RAM: 8GB - Storage: 10GB - GPU: Optional (speeds up NER) --- ## ๐Ÿงช Testing ```bash # Install test dependencies pip install pytest pytest-cov httpx # Run tests pytest tests/ -v # With coverage pytest tests/ --cov=app --cov-report=html # View coverage report open htmlcov/index.html ``` --- ## ๐Ÿ“ˆ Monitoring ### Built-in Endpoints **Health Check:** ```bash curl http://localhost:7860/health ``` **Statistics:** ```bash curl http://localhost:7860/stats ``` ### Logs **Development:** ```bash python main.py # Logs appear in console ``` **Docker:** ```bash docker logs -f container_name ``` **HuggingFace Spaces:** - View in Space dashboard โ†’ Logs tab --- ## ๐Ÿ’ฐ Cost Estimation ### HuggingFace Spaces | Tier | CPU | RAM | Price | Use Case | |------|-----|-----|-------|----------| | Basic | 2 | 16GB | **FREE** | Demo, testing | | CPU Upgrade | 4 | 32GB | $0.50/hr | Production | | GPU T4 | - | - | $0.60/hr | Heavy load | | GPU A10G | - | - | $1.50/hr | Enterprise | **Monthly Costs (if always on):** - Free: $0 - CPU Upgrade: ~$360/month - GPU T4: ~$432/month **Recommendation:** Start free, upgrade based on usage ### Alternatives **AWS ECS Fargate:** ~$30-100/month **Google Cloud Run:** Pay per request (~$10-50/month) **DigitalOcean App:** $12-24/month **Self-hosted VPS:** $5-20/month --- ## ๐Ÿ”„ CI/CD Pipeline ### Automated with GitHub Actions ``` Push to GitHub โ†“ [Run Tests] โ†“ [Build Docker] โ†“ [Test Container] โ†“ [Deploy to HuggingFace] ``` **Setup:** 1. Add secrets in GitHub repo settings: - `HF_TOKEN`: HuggingFace access token - `HF_SPACE`: Your space name (username/space-name) 2. Push to main branch โ†’ Auto-deploy! โœจ --- ## ๐Ÿ“š Documentation Access | Document | Purpose | |----------|---------| | `README.md` | Overview, API docs, usage examples | | `QUICKSTART.md` | 5-minute setup guide | | `DEPLOYMENT.md` | Production deployment | | `STRUCTURE.md` | Code organization | | `/docs` endpoint | Interactive API documentation | --- ## ๐ŸŽ“ Learning Resources ### FastAPI - Docs: https://fastapi.tiangolo.com - Tutorial: https://fastapi.tiangolo.com/tutorial ### HuggingFace - Spaces: https://huggingface.co/docs/hub/spaces - Transformers: https://huggingface.co/docs/transformers ### Docker - Getting Started: https://docs.docker.com/get-started --- ## ๐Ÿ› Troubleshooting ### Common Issues **Problem:** "Tesseract not found" **Solution:** `apt-get install tesseract-ocr` **Problem:** "Poppler not found" **Solution:** `apt-get install poppler-utils` **Problem:** Slow processing **Solution:** Lower DPI to 150-200 **Problem:** Out of memory **Solution:** Upgrade hardware or reduce DPI **Problem:** Model not loading **Solution:** Check internet, wait for download ### Debug Mode ```python # In main.py, add debug mode if __name__ == "__main__": uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug") ``` --- ## โœ… Checklist for Production - [ ] Test all endpoints thoroughly - [ ] Add API key authentication - [ ] Implement rate limiting - [ ] Set up monitoring (Sentry, DataDog, etc.) - [ ] Configure auto-scaling - [ ] Set up backups - [ ] Add usage analytics - [ ] Create user documentation - [ ] Set up SSL/TLS (HF provides by default) - [ ] Test with large files - [ ] Load testing - [ ] Security audit - [ ] Legal compliance (GDPR, etc.) --- ## ๐ŸŽ‰ You're Ready! Your FastAPI PDF Redaction application is complete and ready to deploy! ### Next Steps: 1. โœจ Deploy to HuggingFace Spaces (easiest) 2. ๐Ÿงช Test with real PDFs 3. ๐Ÿ“Š Monitor usage 4. ๐Ÿ”’ Add security for production 5. ๐Ÿš€ Scale as needed ### Support: - ๐Ÿ“– Read the documentation - ๐Ÿ› Check troubleshooting guide - ๐Ÿ’ฌ HuggingFace community forums - ๐Ÿ“ง Create issues on your repo **Happy Deploying! ๐Ÿš€**