π Complete FastAPI Deployment Package
π¦ What You've Got
A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform.
π Directory Structure
pdf-redaction-api/
β
βββ π main.py # FastAPI application
βββ π³ Dockerfile # Production container
βββ π³ docker-compose.yml # Local development
βββ π requirements.txt # Python dependencies
β
βββ π± app/
β βββ __init__.py
β βββ redaction.py # Core redaction engine
β
βββ π uploads/ # Temporary uploads
β βββ .gitkeep
β
βββ π outputs/ # Redacted PDFs
β βββ .gitkeep
β
βββ π§ͺ tests/
β βββ test_api.py # API tests
β
βββ π Documentation/
β βββ README.md # Main docs (for HF Spaces)
β βββ DEPLOYMENT.md # Deployment guide
β βββ QUICKSTART.md # Quick start guide
β βββ STRUCTURE.md # Project structure
β
βββ π§ Configuration/
β βββ .env.example # Environment variables
β βββ .gitignore # Git ignore
β βββ .dockerignore # Docker ignore
β
βββ π€ .github/
β βββ workflows/
β βββ ci-cd.yml # GitHub Actions CI/CD
β
βββ π client_example.py # Example API client
βββ π LICENSE # MIT License
β¨ Features
Core Functionality
β PDF upload and processing β OCR with pytesseract (configurable DPI) β Named Entity Recognition (NER) β Accurate coordinate-based redaction β Multiple entity type support β Downloadable redacted PDFs
API Features
β RESTful API with FastAPI β Automatic OpenAPI documentation β File upload handling β Background task cleanup β Health checks β Statistics endpoint β CORS support
DevOps
β Docker containerization β Docker Compose for local dev β GitHub Actions CI/CD β HuggingFace Spaces ready β Comprehensive testing β Logging and monitoring
π― Quick Deployment Paths
Option 1: HuggingFace Spaces (Recommended for Demo)
Time: 10 minutes
# 1. Create Space on HuggingFace (select Docker SDK)
# 2. Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api
# 3. Copy all files
cp -r /path/to/pdf-redaction-api/* .
# 4. Deploy
git add .
git commit -m "Initial deployment"
git push
Your API will be at: https://YOUR_USERNAME-pdf-redaction-api.hf.space
Cost: FREE (with CPU Basic tier)
Option 2: Docker Locally
Time: 5 minutes
# Build
docker build -t pdf-redaction-api .
# Run
docker run -p 7860:7860 pdf-redaction-api
# Test
curl http://localhost:7860/health
Option 3: Direct Python
Time: 3 minutes
# Install dependencies
sudo apt-get install tesseract-ocr poppler-utils
pip install -r requirements.txt
# Run
python main.py
# Access at http://localhost:7860
π API Endpoints
Core Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /redact |
Upload and redact PDF |
| GET | /download/{job_id} |
Download redacted PDF |
| GET | /health |
Health check |
| GET | /stats |
API statistics |
| DELETE | /cleanup/{job_id} |
Manual cleanup |
| GET | /docs |
Interactive API docs |
Example Usage
cURL:
curl -X POST "http://localhost:7860/redact" \
-F "file=@document.pdf" \
-F "dpi=300"
Python:
import requests
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("document.pdf", "rb")},
params={"dpi": 300}
)
job_id = response.json()["job_id"]
redacted = requests.get(f"http://localhost:7860/download/{job_id}")
π¨ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT REQUEST β
β (Upload PDF via POST /redact) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FASTAPI (main.py) β
β β’ Validate file β
β β’ Generate job_id β
β β’ Save to uploads/ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PDFRedactor (app/redaction.py) β
β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β 1. OCR (pytesseract) β β
β β β’ Convert PDF β Images (pdf2image) β β
β β β’ Extract text + bounding boxes β β
β β β’ Store image dimensions β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β 2. NER (HuggingFace Transformers) β β
β β β’ Load model β β
β β β’ Identify entities in text β β
β β β’ Return entity types + positions β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β 3. Mapping β β
β β β’ Create character span index β β
β β β’ Match NER entities to OCR boxes β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β 4. Redaction (pypdf) β β
β β β’ Scale image coords β PDF coords β β
β β β’ Create black rectangle annotations β β
β β β’ Write redacted PDF β β
β βββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESPONSE β
β β’ job_id β
β β’ List of entities β
β β’ Download URL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Security Considerations
Current Implementation
- β File validation (PDF only)
- β Temporary file cleanup
- β CORS middleware
- β Error handling
For Production (TODO)
- β οΈ Add API key authentication
- β οΈ Implement rate limiting
- β οΈ Add file size limits
- β οΈ Use HTTPS only
- β οΈ Implement user quotas
- β οΈ Add input sanitization
Example API Key Auth:
# Add to main.py
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
API_KEY = "your-secret-key"
api_key_header = APIKeyHeader(name="X-API-Key")
def verify_api_key(key: str = Security(api_key_header)):
if key != API_KEY:
raise HTTPException(401, "Invalid API Key")
π Performance Tuning
DPI Settings
| DPI | Quality | Speed | Use Case |
|---|---|---|---|
| 150 | Low | Fast | Quick previews |
| 200 | Medium | Medium | General use |
| 300 | High | Slow | Recommended |
| 600 | Very High | Very Slow | Critical documents |
Hardware Requirements
Minimum (Free Tier):
- CPU: 2 cores
- RAM: 2GB
- Storage: 1GB
Recommended (Production):
- CPU: 4+ cores
- RAM: 8GB
- Storage: 10GB
- GPU: Optional (speeds up NER)
π§ͺ Testing
# Install test dependencies
pip install pytest pytest-cov httpx
# Run tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=app --cov-report=html
# View coverage report
open htmlcov/index.html
π Monitoring
Built-in Endpoints
Health Check:
curl http://localhost:7860/health
Statistics:
curl http://localhost:7860/stats
Logs
Development:
python main.py
# Logs appear in console
Docker:
docker logs -f container_name
HuggingFace Spaces:
- View in Space dashboard β Logs tab
π° Cost Estimation
HuggingFace Spaces
| Tier | CPU | RAM | Price | Use Case |
|---|---|---|---|---|
| Basic | 2 | 16GB | FREE | Demo, testing |
| CPU Upgrade | 4 | 32GB | $0.50/hr | Production |
| GPU T4 | - | - | $0.60/hr | Heavy load |
| GPU A10G | - | - | $1.50/hr | Enterprise |
Monthly Costs (if always on):
- Free: $0
- CPU Upgrade: ~$360/month
- GPU T4: ~$432/month
Recommendation: Start free, upgrade based on usage
Alternatives
AWS ECS Fargate: $30-100/month$10-50/month)
Google Cloud Run: Pay per request (
DigitalOcean App: $12-24/month
Self-hosted VPS: $5-20/month
π CI/CD Pipeline
Automated with GitHub Actions
Push to GitHub
β
[Run Tests]
β
[Build Docker]
β
[Test Container]
β
[Deploy to HuggingFace]
Setup:
Add secrets in GitHub repo settings:
HF_TOKEN: HuggingFace access tokenHF_SPACE: Your space name (username/space-name)
Push to main branch β Auto-deploy! β¨
π Documentation Access
| Document | Purpose |
|---|---|
README.md |
Overview, API docs, usage examples |
QUICKSTART.md |
5-minute setup guide |
DEPLOYMENT.md |
Production deployment |
STRUCTURE.md |
Code organization |
/docs endpoint |
Interactive API documentation |
π Learning Resources
FastAPI
- Docs: https://fastapi.tiangolo.com
- Tutorial: https://fastapi.tiangolo.com/tutorial
HuggingFace
- Spaces: https://huggingface.co/docs/hub/spaces
- Transformers: https://huggingface.co/docs/transformers
Docker
- Getting Started: https://docs.docker.com/get-started
π Troubleshooting
Common Issues
Problem: "Tesseract not found"
Solution: apt-get install tesseract-ocr
Problem: "Poppler not found"
Solution: apt-get install poppler-utils
Problem: Slow processing
Solution: Lower DPI to 150-200
Problem: Out of memory
Solution: Upgrade hardware or reduce DPI
Problem: Model not loading
Solution: Check internet, wait for download
Debug Mode
# In main.py, add debug mode
if __name__ == "__main__":
uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug")
β Checklist for Production
- Test all endpoints thoroughly
- Add API key authentication
- Implement rate limiting
- Set up monitoring (Sentry, DataDog, etc.)
- Configure auto-scaling
- Set up backups
- Add usage analytics
- Create user documentation
- Set up SSL/TLS (HF provides by default)
- Test with large files
- Load testing
- Security audit
- Legal compliance (GDPR, etc.)
π You're Ready!
Your FastAPI PDF Redaction application is complete and ready to deploy!
Next Steps:
- β¨ Deploy to HuggingFace Spaces (easiest)
- π§ͺ Test with real PDFs
- π Monitor usage
- π Add security for production
- π Scale as needed
Support:
- π Read the documentation
- π Check troubleshooting guide
- π¬ HuggingFace community forums
- π§ Create issues on your repo
Happy Deploying! π