PDF-Redaction-API / COMPLETE_GUIDE.md
Sammi1211's picture
adding url support
af107f1

πŸš€ Complete FastAPI Deployment Package

πŸ“¦ What You've Got

A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform.


πŸ“ Directory Structure

pdf-redaction-api/
β”‚
β”œβ”€β”€ πŸ“„ main.py                     # FastAPI application
β”œβ”€β”€ 🐳 Dockerfile                  # Production container
β”œβ”€β”€ 🐳 docker-compose.yml          # Local development
β”œβ”€β”€ πŸ“‹ requirements.txt            # Python dependencies
β”‚
β”œβ”€β”€ πŸ“± app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── redaction.py              # Core redaction engine
β”‚
β”œβ”€β”€ πŸ“‚ uploads/                    # Temporary uploads
β”‚   └── .gitkeep
β”‚
β”œβ”€β”€ πŸ“‚ outputs/                    # Redacted PDFs
β”‚   └── .gitkeep
β”‚
β”œβ”€β”€ πŸ§ͺ tests/
β”‚   └── test_api.py               # API tests
β”‚
β”œβ”€β”€ πŸ“š Documentation/
β”‚   β”œβ”€β”€ README.md                 # Main docs (for HF Spaces)
β”‚   β”œβ”€β”€ DEPLOYMENT.md             # Deployment guide
β”‚   β”œβ”€β”€ QUICKSTART.md             # Quick start guide
β”‚   └── STRUCTURE.md              # Project structure
β”‚
β”œβ”€β”€ πŸ”§ Configuration/
β”‚   β”œβ”€β”€ .env.example              # Environment variables
β”‚   β”œβ”€β”€ .gitignore                # Git ignore
β”‚   └── .dockerignore             # Docker ignore
β”‚
β”œβ”€β”€ πŸ€– .github/
β”‚   └── workflows/
β”‚       └── ci-cd.yml             # GitHub Actions CI/CD
β”‚
β”œβ”€β”€ πŸ“ client_example.py           # Example API client
└── πŸ“œ LICENSE                     # MIT License

✨ Features

Core Functionality

βœ… PDF upload and processing βœ… OCR with pytesseract (configurable DPI) βœ… Named Entity Recognition (NER) βœ… Accurate coordinate-based redaction βœ… Multiple entity type support βœ… Downloadable redacted PDFs

API Features

βœ… RESTful API with FastAPI βœ… Automatic OpenAPI documentation βœ… File upload handling βœ… Background task cleanup βœ… Health checks βœ… Statistics endpoint βœ… CORS support

DevOps

βœ… Docker containerization βœ… Docker Compose for local dev βœ… GitHub Actions CI/CD βœ… HuggingFace Spaces ready βœ… Comprehensive testing βœ… Logging and monitoring


🎯 Quick Deployment Paths

Option 1: HuggingFace Spaces (Recommended for Demo)

Time: 10 minutes

# 1. Create Space on HuggingFace (select Docker SDK)
# 2. Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api

# 3. Copy all files
cp -r /path/to/pdf-redaction-api/* .

# 4. Deploy
git add .
git commit -m "Initial deployment"
git push

Your API will be at: https://YOUR_USERNAME-pdf-redaction-api.hf.space

Cost: FREE (with CPU Basic tier)


Option 2: Docker Locally

Time: 5 minutes

# Build
docker build -t pdf-redaction-api .

# Run
docker run -p 7860:7860 pdf-redaction-api

# Test
curl http://localhost:7860/health

Option 3: Direct Python

Time: 3 minutes

# Install dependencies
sudo apt-get install tesseract-ocr poppler-utils
pip install -r requirements.txt

# Run
python main.py

# Access at http://localhost:7860

πŸ”Œ API Endpoints

Core Endpoints

Method Endpoint Description
POST /redact Upload and redact PDF
GET /download/{job_id} Download redacted PDF
GET /health Health check
GET /stats API statistics
DELETE /cleanup/{job_id} Manual cleanup
GET /docs Interactive API docs

Example Usage

cURL:

curl -X POST "http://localhost:7860/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300"

Python:

import requests

response = requests.post(
    "http://localhost:7860/redact",
    files={"file": open("document.pdf", "rb")},
    params={"dpi": 300}
)

job_id = response.json()["job_id"]
redacted = requests.get(f"http://localhost:7860/download/{job_id}")

🎨 Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CLIENT REQUEST                       β”‚
β”‚              (Upload PDF via POST /redact)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FASTAPI (main.py)                     β”‚
β”‚  β€’ Validate file                                        β”‚
β”‚  β€’ Generate job_id                                      β”‚
β”‚  β€’ Save to uploads/                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              PDFRedactor (app/redaction.py)             β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ 1. OCR (pytesseract)                    β”‚           β”‚
β”‚  β”‚    β€’ Convert PDF β†’ Images (pdf2image)   β”‚           β”‚
β”‚  β”‚    β€’ Extract text + bounding boxes      β”‚           β”‚
β”‚  β”‚    β€’ Store image dimensions             β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                     ↓                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ 2. NER (HuggingFace Transformers)       β”‚           β”‚
β”‚  β”‚    β€’ Load model                         β”‚           β”‚
β”‚  β”‚    β€’ Identify entities in text          β”‚           β”‚
β”‚  β”‚    β€’ Return entity types + positions    β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                     ↓                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ 3. Mapping                              β”‚           β”‚
β”‚  β”‚    β€’ Create character span index        β”‚           β”‚
β”‚  β”‚    β€’ Match NER entities to OCR boxes    β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                     ↓                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ 4. Redaction (pypdf)                    β”‚           β”‚
β”‚  β”‚    β€’ Scale image coords β†’ PDF coords    β”‚           β”‚
β”‚  β”‚    β€’ Create black rectangle annotations β”‚           β”‚
β”‚  β”‚    β€’ Write redacted PDF                 β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   RESPONSE                              β”‚
β”‚  β€’ job_id                                               β”‚
β”‚  β€’ List of entities                                     β”‚
β”‚  β€’ Download URL                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Security Considerations

Current Implementation

  • βœ… File validation (PDF only)
  • βœ… Temporary file cleanup
  • βœ… CORS middleware
  • βœ… Error handling

For Production (TODO)

  • ⚠️ Add API key authentication
  • ⚠️ Implement rate limiting
  • ⚠️ Add file size limits
  • ⚠️ Use HTTPS only
  • ⚠️ Implement user quotas
  • ⚠️ Add input sanitization

Example API Key Auth:

# Add to main.py
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader

API_KEY = "your-secret-key"
api_key_header = APIKeyHeader(name="X-API-Key")

def verify_api_key(key: str = Security(api_key_header)):
    if key != API_KEY:
        raise HTTPException(401, "Invalid API Key")

πŸ“Š Performance Tuning

DPI Settings

DPI Quality Speed Use Case
150 Low Fast Quick previews
200 Medium Medium General use
300 High Slow Recommended
600 Very High Very Slow Critical documents

Hardware Requirements

Minimum (Free Tier):

  • CPU: 2 cores
  • RAM: 2GB
  • Storage: 1GB

Recommended (Production):

  • CPU: 4+ cores
  • RAM: 8GB
  • Storage: 10GB
  • GPU: Optional (speeds up NER)

πŸ§ͺ Testing

# Install test dependencies
pip install pytest pytest-cov httpx

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=app --cov-report=html

# View coverage report
open htmlcov/index.html

πŸ“ˆ Monitoring

Built-in Endpoints

Health Check:

curl http://localhost:7860/health

Statistics:

curl http://localhost:7860/stats

Logs

Development:

python main.py
# Logs appear in console

Docker:

docker logs -f container_name

HuggingFace Spaces:

  • View in Space dashboard β†’ Logs tab

πŸ’° Cost Estimation

HuggingFace Spaces

Tier CPU RAM Price Use Case
Basic 2 16GB FREE Demo, testing
CPU Upgrade 4 32GB $0.50/hr Production
GPU T4 - - $0.60/hr Heavy load
GPU A10G - - $1.50/hr Enterprise

Monthly Costs (if always on):

  • Free: $0
  • CPU Upgrade: ~$360/month
  • GPU T4: ~$432/month

Recommendation: Start free, upgrade based on usage

Alternatives

AWS ECS Fargate: $30-100/month
Google Cloud Run: Pay per request (
$10-50/month)
DigitalOcean App: $12-24/month
Self-hosted VPS: $5-20/month


πŸ”„ CI/CD Pipeline

Automated with GitHub Actions

Push to GitHub
      ↓
   [Run Tests]
      ↓
  [Build Docker]
      ↓
   [Test Container]
      ↓
[Deploy to HuggingFace]

Setup:

  1. Add secrets in GitHub repo settings:

    • HF_TOKEN: HuggingFace access token
    • HF_SPACE: Your space name (username/space-name)
  2. Push to main branch β†’ Auto-deploy! ✨


πŸ“š Documentation Access

Document Purpose
README.md Overview, API docs, usage examples
QUICKSTART.md 5-minute setup guide
DEPLOYMENT.md Production deployment
STRUCTURE.md Code organization
/docs endpoint Interactive API documentation

πŸŽ“ Learning Resources

FastAPI

HuggingFace

Docker


πŸ› Troubleshooting

Common Issues

Problem: "Tesseract not found"
Solution: apt-get install tesseract-ocr

Problem: "Poppler not found"
Solution: apt-get install poppler-utils

Problem: Slow processing
Solution: Lower DPI to 150-200

Problem: Out of memory
Solution: Upgrade hardware or reduce DPI

Problem: Model not loading
Solution: Check internet, wait for download

Debug Mode

# In main.py, add debug mode
if __name__ == "__main__":
    uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug")

βœ… Checklist for Production

  • Test all endpoints thoroughly
  • Add API key authentication
  • Implement rate limiting
  • Set up monitoring (Sentry, DataDog, etc.)
  • Configure auto-scaling
  • Set up backups
  • Add usage analytics
  • Create user documentation
  • Set up SSL/TLS (HF provides by default)
  • Test with large files
  • Load testing
  • Security audit
  • Legal compliance (GDPR, etc.)

πŸŽ‰ You're Ready!

Your FastAPI PDF Redaction application is complete and ready to deploy!

Next Steps:

  1. ✨ Deploy to HuggingFace Spaces (easiest)
  2. πŸ§ͺ Test with real PDFs
  3. πŸ“Š Monitor usage
  4. πŸ”’ Add security for production
  5. πŸš€ Scale as needed

Support:

  • πŸ“– Read the documentation
  • πŸ› Check troubleshooting guide
  • πŸ’¬ HuggingFace community forums
  • πŸ“§ Create issues on your repo

Happy Deploying! πŸš€