PDF-Redaction-API / COMPLETE_GUIDE.md
Sammi1211's picture
adding url support
af107f1
# πŸš€ Complete FastAPI Deployment Package
## πŸ“¦ What You've Got
A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform.
---
## πŸ“ Directory Structure
```
pdf-redaction-api/
β”‚
β”œβ”€β”€ πŸ“„ main.py # FastAPI application
β”œβ”€β”€ 🐳 Dockerfile # Production container
β”œβ”€β”€ 🐳 docker-compose.yml # Local development
β”œβ”€β”€ πŸ“‹ requirements.txt # Python dependencies
β”‚
β”œβ”€β”€ πŸ“± app/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── redaction.py # Core redaction engine
β”‚
β”œβ”€β”€ πŸ“‚ uploads/ # Temporary uploads
β”‚ └── .gitkeep
β”‚
β”œβ”€β”€ πŸ“‚ outputs/ # Redacted PDFs
β”‚ └── .gitkeep
β”‚
β”œβ”€β”€ πŸ§ͺ tests/
β”‚ └── test_api.py # API tests
β”‚
β”œβ”€β”€ πŸ“š Documentation/
β”‚ β”œβ”€β”€ README.md # Main docs (for HF Spaces)
β”‚ β”œβ”€β”€ DEPLOYMENT.md # Deployment guide
β”‚ β”œβ”€β”€ QUICKSTART.md # Quick start guide
β”‚ └── STRUCTURE.md # Project structure
β”‚
β”œβ”€β”€ πŸ”§ Configuration/
β”‚ β”œβ”€β”€ .env.example # Environment variables
β”‚ β”œβ”€β”€ .gitignore # Git ignore
β”‚ └── .dockerignore # Docker ignore
β”‚
β”œβ”€β”€ πŸ€– .github/
β”‚ └── workflows/
β”‚ └── ci-cd.yml # GitHub Actions CI/CD
β”‚
β”œβ”€β”€ πŸ“ client_example.py # Example API client
└── πŸ“œ LICENSE # MIT License
```
---
## ✨ Features
### Core Functionality
βœ… PDF upload and processing
βœ… OCR with pytesseract (configurable DPI)
βœ… Named Entity Recognition (NER)
βœ… Accurate coordinate-based redaction
βœ… Multiple entity type support
βœ… Downloadable redacted PDFs
### API Features
βœ… RESTful API with FastAPI
βœ… Automatic OpenAPI documentation
βœ… File upload handling
βœ… Background task cleanup
βœ… Health checks
βœ… Statistics endpoint
βœ… CORS support
### DevOps
βœ… Docker containerization
βœ… Docker Compose for local dev
βœ… GitHub Actions CI/CD
βœ… HuggingFace Spaces ready
βœ… Comprehensive testing
βœ… Logging and monitoring
---
## 🎯 Quick Deployment Paths
### Option 1: HuggingFace Spaces (Recommended for Demo)
**Time: 10 minutes**
```bash
# 1. Create Space on HuggingFace (select Docker SDK)
# 2. Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api
# 3. Copy all files
cp -r /path/to/pdf-redaction-api/* .
# 4. Deploy
git add .
git commit -m "Initial deployment"
git push
```
**Your API will be at:** `https://YOUR_USERNAME-pdf-redaction-api.hf.space`
**Cost:** FREE (with CPU Basic tier)
---
### Option 2: Docker Locally
**Time: 5 minutes**
```bash
# Build
docker build -t pdf-redaction-api .
# Run
docker run -p 7860:7860 pdf-redaction-api
# Test
curl http://localhost:7860/health
```
---
### Option 3: Direct Python
**Time: 3 minutes**
```bash
# Install dependencies
sudo apt-get install tesseract-ocr poppler-utils
pip install -r requirements.txt
# Run
python main.py
# Access at http://localhost:7860
```
---
## πŸ”Œ API Endpoints
### Core Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/redact` | Upload and redact PDF |
| GET | `/download/{job_id}` | Download redacted PDF |
| GET | `/health` | Health check |
| GET | `/stats` | API statistics |
| DELETE | `/cleanup/{job_id}` | Manual cleanup |
| GET | `/docs` | Interactive API docs |
### Example Usage
**cURL:**
```bash
curl -X POST "http://localhost:7860/redact" \
-F "file=@document.pdf" \
-F "dpi=300"
```
**Python:**
```python
import requests
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("document.pdf", "rb")},
params={"dpi": 300}
)
job_id = response.json()["job_id"]
redacted = requests.get(f"http://localhost:7860/download/{job_id}")
```
---
## 🎨 Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CLIENT REQUEST β”‚
β”‚ (Upload PDF via POST /redact) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FASTAPI (main.py) β”‚
β”‚ β€’ Validate file β”‚
β”‚ β€’ Generate job_id β”‚
β”‚ β€’ Save to uploads/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PDFRedactor (app/redaction.py) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 1. OCR (pytesseract) β”‚ β”‚
β”‚ β”‚ β€’ Convert PDF β†’ Images (pdf2image) β”‚ β”‚
β”‚ β”‚ β€’ Extract text + bounding boxes β”‚ β”‚
β”‚ β”‚ β€’ Store image dimensions β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 2. NER (HuggingFace Transformers) β”‚ β”‚
β”‚ β”‚ β€’ Load model β”‚ β”‚
β”‚ β”‚ β€’ Identify entities in text β”‚ β”‚
β”‚ β”‚ β€’ Return entity types + positions β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 3. Mapping β”‚ β”‚
β”‚ β”‚ β€’ Create character span index β”‚ β”‚
β”‚ β”‚ β€’ Match NER entities to OCR boxes β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 4. Redaction (pypdf) β”‚ β”‚
β”‚ β”‚ β€’ Scale image coords β†’ PDF coords β”‚ β”‚
β”‚ β”‚ β€’ Create black rectangle annotations β”‚ β”‚
β”‚ β”‚ β€’ Write redacted PDF β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RESPONSE β”‚
β”‚ β€’ job_id β”‚
β”‚ β€’ List of entities β”‚
β”‚ β€’ Download URL β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## πŸ” Security Considerations
### Current Implementation
- βœ… File validation (PDF only)
- βœ… Temporary file cleanup
- βœ… CORS middleware
- βœ… Error handling
### For Production (TODO)
- ⚠️ Add API key authentication
- ⚠️ Implement rate limiting
- ⚠️ Add file size limits
- ⚠️ Use HTTPS only
- ⚠️ Implement user quotas
- ⚠️ Add input sanitization
**Example API Key Auth:**
```python
# Add to main.py
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
API_KEY = "your-secret-key"
api_key_header = APIKeyHeader(name="X-API-Key")
def verify_api_key(key: str = Security(api_key_header)):
if key != API_KEY:
raise HTTPException(401, "Invalid API Key")
```
---
## πŸ“Š Performance Tuning
### DPI Settings
| DPI | Quality | Speed | Use Case |
|-----|---------|-------|----------|
| 150 | Low | Fast | Quick previews |
| 200 | Medium | Medium | General use |
| 300 | High | Slow | **Recommended** |
| 600 | Very High | Very Slow | Critical documents |
### Hardware Requirements
**Minimum (Free Tier):**
- CPU: 2 cores
- RAM: 2GB
- Storage: 1GB
**Recommended (Production):**
- CPU: 4+ cores
- RAM: 8GB
- Storage: 10GB
- GPU: Optional (speeds up NER)
---
## πŸ§ͺ Testing
```bash
# Install test dependencies
pip install pytest pytest-cov httpx
# Run tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=app --cov-report=html
# View coverage report
open htmlcov/index.html
```
---
## πŸ“ˆ Monitoring
### Built-in Endpoints
**Health Check:**
```bash
curl http://localhost:7860/health
```
**Statistics:**
```bash
curl http://localhost:7860/stats
```
### Logs
**Development:**
```bash
python main.py
# Logs appear in console
```
**Docker:**
```bash
docker logs -f container_name
```
**HuggingFace Spaces:**
- View in Space dashboard β†’ Logs tab
---
## πŸ’° Cost Estimation
### HuggingFace Spaces
| Tier | CPU | RAM | Price | Use Case |
|------|-----|-----|-------|----------|
| Basic | 2 | 16GB | **FREE** | Demo, testing |
| CPU Upgrade | 4 | 32GB | $0.50/hr | Production |
| GPU T4 | - | - | $0.60/hr | Heavy load |
| GPU A10G | - | - | $1.50/hr | Enterprise |
**Monthly Costs (if always on):**
- Free: $0
- CPU Upgrade: ~$360/month
- GPU T4: ~$432/month
**Recommendation:** Start free, upgrade based on usage
### Alternatives
**AWS ECS Fargate:** ~$30-100/month
**Google Cloud Run:** Pay per request (~$10-50/month)
**DigitalOcean App:** $12-24/month
**Self-hosted VPS:** $5-20/month
---
## πŸ”„ CI/CD Pipeline
### Automated with GitHub Actions
```
Push to GitHub
↓
[Run Tests]
↓
[Build Docker]
↓
[Test Container]
↓
[Deploy to HuggingFace]
```
**Setup:**
1. Add secrets in GitHub repo settings:
- `HF_TOKEN`: HuggingFace access token
- `HF_SPACE`: Your space name (username/space-name)
2. Push to main branch β†’ Auto-deploy! ✨
---
## πŸ“š Documentation Access
| Document | Purpose |
|----------|---------|
| `README.md` | Overview, API docs, usage examples |
| `QUICKSTART.md` | 5-minute setup guide |
| `DEPLOYMENT.md` | Production deployment |
| `STRUCTURE.md` | Code organization |
| `/docs` endpoint | Interactive API documentation |
---
## πŸŽ“ Learning Resources
### FastAPI
- Docs: https://fastapi.tiangolo.com
- Tutorial: https://fastapi.tiangolo.com/tutorial
### HuggingFace
- Spaces: https://huggingface.co/docs/hub/spaces
- Transformers: https://huggingface.co/docs/transformers
### Docker
- Getting Started: https://docs.docker.com/get-started
---
## πŸ› Troubleshooting
### Common Issues
**Problem:** "Tesseract not found"
**Solution:** `apt-get install tesseract-ocr`
**Problem:** "Poppler not found"
**Solution:** `apt-get install poppler-utils`
**Problem:** Slow processing
**Solution:** Lower DPI to 150-200
**Problem:** Out of memory
**Solution:** Upgrade hardware or reduce DPI
**Problem:** Model not loading
**Solution:** Check internet, wait for download
### Debug Mode
```python
# In main.py, add debug mode
if __name__ == "__main__":
uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug")
```
---
## βœ… Checklist for Production
- [ ] Test all endpoints thoroughly
- [ ] Add API key authentication
- [ ] Implement rate limiting
- [ ] Set up monitoring (Sentry, DataDog, etc.)
- [ ] Configure auto-scaling
- [ ] Set up backups
- [ ] Add usage analytics
- [ ] Create user documentation
- [ ] Set up SSL/TLS (HF provides by default)
- [ ] Test with large files
- [ ] Load testing
- [ ] Security audit
- [ ] Legal compliance (GDPR, etc.)
---
## πŸŽ‰ You're Ready!
Your FastAPI PDF Redaction application is complete and ready to deploy!
### Next Steps:
1. ✨ Deploy to HuggingFace Spaces (easiest)
2. πŸ§ͺ Test with real PDFs
3. πŸ“Š Monitor usage
4. πŸ”’ Add security for production
5. πŸš€ Scale as needed
### Support:
- πŸ“– Read the documentation
- πŸ› Check troubleshooting guide
- πŸ’¬ HuggingFace community forums
- πŸ“§ Create issues on your repo
**Happy Deploying! πŸš€**