Spaces:

Text-to-Document-Generation
/

Redact_with_openai

Sleeping

App Files Files Community

Sammi1211 commited on May 2

Commit

955cf91

1 Parent(s): d5d3492

Resolving conflicts

Browse files

Files changed (19) hide show

.dockerignore +28 -0
.github/workflows/ci-cd.yml +82 -0
.gitignore +59 -0
COMPLETE_GUIDE.md +488 -0
DEPLOYMENT.md +298 -0
Dockerfile +36 -0
LICENSE +21 -0
QUICKSTART.md +271 -0
README.md +162 -5
STRUCTURE.md +269 -0
app/__init__.py +6 -0
app/redaction.py +327 -0
client_example.py +142 -0
client_supabase.py +9 -0
docker-compose.yml +48 -0
main.py +344 -0
outputs/.gitkeep +0 -0
requirements.txt +14 -0
uploads/.gitkeep +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,28 @@

+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info
+dist
+build
+.git
+.gitignore
+.env
+.venv
+venv/
+env/
+*.log
+.DS_Store
+.pytest_cache
+.coverage
+htmlcov/
+uploads/*
+outputs/*
+!uploads/.gitkeep
+!outputs/.gitkeep
+*.pdf
+README.md
+.github/

.github/workflows/ci-cd.yml ADDED Viewed

	@@ -0,0 +1,82 @@

+name: CI/CD Pipeline
+on:
+  push:
+    branches: [ main, develop ]
+  pull_request:
+    branches: [ main ]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.10'
+    - name: Install system dependencies
+      run: |
+        sudo apt-get update
+        sudo apt-get install -y tesseract-ocr poppler-utils
+    - name: Install Python dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt
+        pip install pytest pytest-cov httpx
+    - name: Run tests
+      run: |
+        pytest tests/ -v --cov=app --cov-report=xml
+    - name: Upload coverage
+      uses: codecov/codecov-action@v3
+      with:
+        file: ./coverage.xml
+        fail_ci_if_error: false
+  docker-build:
+    runs-on: ubuntu-latest
+    needs: test
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Docker Buildx
+      uses: docker/setup-buildx-action@v2
+    - name: Build Docker image
+      run: |
+        docker build -t pdf-redaction-api:test .
+    - name: Test Docker image
+      run: |
+        docker run -d -p 7860:7860 --name test-api pdf-redaction-api:test
+        sleep 10
+        curl -f http://localhost:7860/health || exit 1
+        docker stop test-api
+  deploy-huggingface:
+    runs-on: ubuntu-latest
+    needs: [test, docker-build]
+    if: github.ref == 'refs/heads/main'
+    steps:
+    - uses: actions/checkout@v3
+    - name: Deploy to HuggingFace Spaces
+      env:
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+      run: |
+        git config --global user.email "github-actions@github.com"
+        git config --global user.name "GitHub Actions"
+        # Add HuggingFace remote if it doesn't exist
+        git remote add hf https://user:$HF_TOKEN@huggingface.co/spaces/${{ secrets.HF_SPACE }} || true
+        # Push to HuggingFace
+        git push hf main:main

.gitignore ADDED Viewed

	@@ -0,0 +1,59 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+redact/
+venv/
+env/
+ENV/
+.venv
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Project specific
+uploads/*.pdf
+outputs/*.pdf
+*.log
+# Environment
+.env
+.env.local
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+# Model cache
+cache/
+models/
+tests

COMPLETE_GUIDE.md ADDED Viewed

	@@ -0,0 +1,488 @@

+# 🚀 Complete FastAPI Deployment Package
+## 📦 What You've Got
+A production-ready FastAPI application for PDF redaction with Named Entity Recognition, ready to deploy on HuggingFace Spaces or any cloud platform.
+---
+## 📁 Directory Structure
+```
+pdf-redaction-api/
+│
+├── 📄 main.py                     # FastAPI application
+├── 🐳 Dockerfile                  # Production container
+├── 🐳 docker-compose.yml          # Local development
+├── 📋 requirements.txt            # Python dependencies
+│
+├── 📱 app/
+│   ├── __init__.py
+│   └── redaction.py              # Core redaction engine
+│
+├── 📂 uploads/                    # Temporary uploads
+│   └── .gitkeep
+│
+├── 📂 outputs/                    # Redacted PDFs
+│   └── .gitkeep
+│
+├── 🧪 tests/
+│   └── test_api.py               # API tests
+│
+├── 📚 Documentation/
+│   ├── README.md                 # Main docs (for HF Spaces)
+│   ├── DEPLOYMENT.md             # Deployment guide
+│   ├── QUICKSTART.md             # Quick start guide
+│   └── STRUCTURE.md              # Project structure
+│
+├── 🔧 Configuration/
+│   ├── .env.example              # Environment variables
+│   ├── .gitignore                # Git ignore
+│   └── .dockerignore             # Docker ignore
+│
+├── 🤖 .github/
+│   └── workflows/
+│       └── ci-cd.yml             # GitHub Actions CI/CD
+│
+├── 📝 client_example.py           # Example API client
+└── 📜 LICENSE                     # MIT License
+```
+---
+## ✨ Features
+### Core Functionality
+✅ PDF upload and processing
+✅ OCR with pytesseract (configurable DPI)
+✅ Named Entity Recognition (NER)
+✅ Accurate coordinate-based redaction
+✅ Multiple entity type support
+✅ Downloadable redacted PDFs
+### API Features
+✅ RESTful API with FastAPI
+✅ Automatic OpenAPI documentation
+✅ File upload handling
+✅ Background task cleanup
+✅ Health checks
+✅ Statistics endpoint
+✅ CORS support
+### DevOps
+✅ Docker containerization
+✅ Docker Compose for local dev
+✅ GitHub Actions CI/CD
+✅ HuggingFace Spaces ready
+✅ Comprehensive testing
+✅ Logging and monitoring
+---
+## 🎯 Quick Deployment Paths
+### Option 1: HuggingFace Spaces (Recommended for Demo)
+**Time: 10 minutes**
+```bash
+# 1. Create Space on HuggingFace (select Docker SDK)
+# 2. Clone your space
+git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
+cd pdf-redaction-api
+# 3. Copy all files
+cp -r /path/to/pdf-redaction-api/* .
+# 4. Deploy
+git add .
+git commit -m "Initial deployment"
+git push
+```
+**Your API will be at:** `https://YOUR_USERNAME-pdf-redaction-api.hf.space`
+**Cost:** FREE (with CPU Basic tier)
+---
+### Option 2: Docker Locally
+**Time: 5 minutes**
+```bash
+# Build
+docker build -t pdf-redaction-api .
+# Run
+docker run -p 7860:7860 pdf-redaction-api
+# Test
+curl http://localhost:7860/health
+```
+---
+### Option 3: Direct Python
+**Time: 3 minutes**
+```bash
+# Install dependencies
+sudo apt-get install tesseract-ocr poppler-utils
+pip install -r requirements.txt
+# Run
+python main.py
+# Access at http://localhost:7860
+```
+---
+## 🔌 API Endpoints
+### Core Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| POST | `/redact` | Upload and redact PDF |
+| GET | `/download/{job_id}` | Download redacted PDF |
+| GET | `/health` | Health check |
+| GET | `/stats` | API statistics |
+| DELETE | `/cleanup/{job_id}` | Manual cleanup |
+| GET | `/docs` | Interactive API docs |
+### Example Usage
+**cURL:**
+```bash
+curl -X POST "http://localhost:7860/redact" \
+  -F "file=@document.pdf" \
+  -F "dpi=300"
+```
+**Python:**
+```python
+import requests
+response = requests.post(
+    "http://localhost:7860/redact",
+    files={"file": open("document.pdf", "rb")},
+    params={"dpi": 300}
+)
+job_id = response.json()["job_id"]
+redacted = requests.get(f"http://localhost:7860/download/{job_id}")
+```
+---
+## 🎨 Architecture
+```
+┌─────────────────────────────────────────────────────────┐
+│                    CLIENT REQUEST                       │
+│              (Upload PDF via POST /redact)              │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│                   FASTAPI (main.py)                     │
+│  • Validate file                                        │
+│  • Generate job_id                                      │
+│  • Save to uploads/                                     │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│              PDFRedactor (app/redaction.py)             │
+│                                                         │
+│  ┌─────────────────────────────────────────┐           │
+│  │ 1. OCR (pytesseract)                    │           │
+│  │    • Convert PDF → Images (pdf2image)   │           │
+│  │    • Extract text + bounding boxes      │           │
+│  │    • Store image dimensions             │           │
+│  └─────────────────────────────────────────┘           │
+│                     ↓                                   │
+│  ┌─────────────────────────────────────────┐           │
+│  │ 2. NER (HuggingFace Transformers)       │           │
+│  │    • Load model                         │           │
+│  │    • Identify entities in text          │           │
+│  │    • Return entity types + positions    │           │
+│  └─────────────────────────────────────────┘           │
+│                     ↓                                   │
+│  ┌─────────────────────────────────────────┐           │
+│  │ 3. Mapping                              │           │
+│  │    • Create character span index        │           │
+│  │    • Match NER entities to OCR boxes    │           │
+│  └─────────────────────────────────────────┘           │
+│                     ↓                                   │
+│  ┌─────────────────────────────────────────┐           │
+│  │ 4. Redaction (pypdf)                    │           │
+│  │    • Scale image coords → PDF coords    │           │
+│  │    • Create black rectangle annotations │           │
+│  │    • Write redacted PDF                 │           │
+│  └─────────────────────────────────────────┘           │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│                   RESPONSE                              │
+│  • job_id                                               │
+│  • List of entities                                     │
+│  • Download URL                                         │
+└─────────────────────────────────────────────────────────┘
+```
+---
+## 🔐 Security Considerations
+### Current Implementation
+- ✅ File validation (PDF only)
+- ✅ Temporary file cleanup
+- ✅ CORS middleware
+- ✅ Error handling
+### For Production (TODO)
+- ⚠️ Add API key authentication
+- ⚠️ Implement rate limiting
+- ⚠️ Add file size limits
+- ⚠️ Use HTTPS only
+- ⚠️ Implement user quotas
+- ⚠️ Add input sanitization
+**Example API Key Auth:**
+```python
+# Add to main.py
+from fastapi import Security, HTTPException
+from fastapi.security import APIKeyHeader
+API_KEY = "your-secret-key"
+api_key_header = APIKeyHeader(name="X-API-Key")
+def verify_api_key(key: str = Security(api_key_header)):
+    if key != API_KEY:
+        raise HTTPException(401, "Invalid API Key")
+```
+---
+## 📊 Performance Tuning
+### DPI Settings
+| DPI | Quality | Speed | Use Case |
+|-----|---------|-------|----------|
+| 150 | Low | Fast | Quick previews |
+| 200 | Medium | Medium | General use |
+| 300 | High | Slow | **Recommended** |
+| 600 | Very High | Very Slow | Critical documents |
+### Hardware Requirements
+**Minimum (Free Tier):**
+- CPU: 2 cores
+- RAM: 2GB
+- Storage: 1GB
+**Recommended (Production):**
+- CPU: 4+ cores
+- RAM: 8GB
+- Storage: 10GB
+- GPU: Optional (speeds up NER)
+---
+## 🧪 Testing
+```bash
+# Install test dependencies
+pip install pytest pytest-cov httpx
+# Run tests
+pytest tests/ -v
+# With coverage
+pytest tests/ --cov=app --cov-report=html
+# View coverage report
+open htmlcov/index.html
+```
+---
+## 📈 Monitoring
+### Built-in Endpoints
+**Health Check:**
+```bash
+curl http://localhost:7860/health
+```
+**Statistics:**
+```bash
+curl http://localhost:7860/stats
+```
+### Logs
+**Development:**
+```bash
+python main.py
+# Logs appear in console
+```
+**Docker:**
+```bash
+docker logs -f container_name
+```
+**HuggingFace Spaces:**
+- View in Space dashboard → Logs tab
+---
+## 💰 Cost Estimation
+### HuggingFace Spaces
+| Tier | CPU | RAM | Price | Use Case |
+|------|-----|-----|-------|----------|
+| Basic | 2 | 16GB | **FREE** | Demo, testing |
+| CPU Upgrade | 4 | 32GB | $0.50/hr | Production |
+| GPU T4 | - | - | $0.60/hr | Heavy load |
+| GPU A10G | - | - | $1.50/hr | Enterprise |
+**Monthly Costs (if always on):**
+- Free: $0
+- CPU Upgrade: ~$360/month
+- GPU T4: ~$432/month
+**Recommendation:** Start free, upgrade based on usage
+### Alternatives
+**AWS ECS Fargate:** ~$30-100/month
+**Google Cloud Run:** Pay per request (~$10-50/month)
+**DigitalOcean App:** $12-24/month
+**Self-hosted VPS:** $5-20/month
+---
+## 🔄 CI/CD Pipeline
+### Automated with GitHub Actions
+```
+Push to GitHub
+      ↓
+   [Run Tests]
+      ↓
+  [Build Docker]
+      ↓
+   [Test Container]
+      ↓
+[Deploy to HuggingFace]
+```
+**Setup:**
+1. Add secrets in GitHub repo settings:
+   - `HF_TOKEN`: HuggingFace access token
+   - `HF_SPACE`: Your space name (username/space-name)
+2. Push to main branch → Auto-deploy! ✨
+---
+## 📚 Documentation Access
+| Document | Purpose |
+|----------|---------|
+| `README.md` | Overview, API docs, usage examples |
+| `QUICKSTART.md` | 5-minute setup guide |
+| `DEPLOYMENT.md` | Production deployment |
+| `STRUCTURE.md` | Code organization |
+| `/docs` endpoint | Interactive API documentation |
+---
+## 🎓 Learning Resources
+### FastAPI
+- Docs: https://fastapi.tiangolo.com
+- Tutorial: https://fastapi.tiangolo.com/tutorial
+### HuggingFace
+- Spaces: https://huggingface.co/docs/hub/spaces
+- Transformers: https://huggingface.co/docs/transformers
+### Docker
+- Getting Started: https://docs.docker.com/get-started
+---
+## 🐛 Troubleshooting
+### Common Issues
+**Problem:** "Tesseract not found"
+**Solution:** `apt-get install tesseract-ocr`
+**Problem:** "Poppler not found"
+**Solution:** `apt-get install poppler-utils`
+**Problem:** Slow processing
+**Solution:** Lower DPI to 150-200
+**Problem:** Out of memory
+**Solution:** Upgrade hardware or reduce DPI
+**Problem:** Model not loading
+**Solution:** Check internet, wait for download
+### Debug Mode
+```python
+# In main.py, add debug mode
+if __name__ == "__main__":
+    uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=True, log_level="debug")
+```
+---
+## ✅ Checklist for Production
+- [ ] Test all endpoints thoroughly
+- [ ] Add API key authentication
+- [ ] Implement rate limiting
+- [ ] Set up monitoring (Sentry, DataDog, etc.)
+- [ ] Configure auto-scaling
+- [ ] Set up backups
+- [ ] Add usage analytics
+- [ ] Create user documentation
+- [ ] Set up SSL/TLS (HF provides by default)
+- [ ] Test with large files
+- [ ] Load testing
+- [ ] Security audit
+- [ ] Legal compliance (GDPR, etc.)
+---
+## 🎉 You're Ready!
+Your FastAPI PDF Redaction application is complete and ready to deploy!
+### Next Steps:
+1. ✨ Deploy to HuggingFace Spaces (easiest)
+2. 🧪 Test with real PDFs
+3. 📊 Monitor usage
+4. 🔒 Add security for production
+5. 🚀 Scale as needed
+### Support:
+- 📖 Read the documentation
+- 🐛 Check troubleshooting guide
+- 💬 HuggingFace community forums
+- 📧 Create issues on your repo
+**Happy Deploying! 🚀**

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,298 @@

+# Deployment Guide for HuggingFace Spaces
+## Prerequisites
+1. **HuggingFace Account**: Sign up at https://huggingface.co/
+2. **Git**: Installed on your local machine
+3. **Git LFS**: For large file storage (optional)
+## Step-by-Step Deployment
+### 1. Create a New Space
+1. Go to https://huggingface.co/spaces
+2. Click "Create new Space"
+3. Fill in the details:
+   - **Space name**: `pdf-redaction-api` (or your preferred name)
+   - **License**: MIT
+   - **SDK**: Docker
+   - **Hardware**: CPU Basic (free tier) or upgrade if needed
+4. Click "Create Space"
+### 2. Clone Your Space Repository
+```bash
+git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
+cd pdf-redaction-api
+```
+### 3. Copy All Files to the Repository
+Copy all files from this project to your cloned space:
+```bash
+# Copy all files
+cp -r /path/to/pdf-redaction-api/* .
+# Check the files
+ls -la
+```
+You should see:
+- `main.py`
+- `app/`
+- `Dockerfile`
+- `requirements.txt`
+- `README.md`
+- `.gitignore`
+- `.dockerignore`
+- `uploads/` (with .gitkeep)
+- `outputs/` (with .gitkeep)
+### 4. Commit and Push
+```bash
+# Add all files
+git add .
+# Commit
+git commit -m "Initial deployment of PDF Redaction API"
+# Push to HuggingFace
+git push
+```
+### 5. Monitor Deployment
+1. Go to your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api`
+2. You'll see the build logs
+3. Wait for the build to complete (usually 5-10 minutes)
+4. Once complete, your API will be live!
+### 6. Test Your Deployment
+```bash
+# Check health
+curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health
+# Test with a PDF
+curl -X POST "https://YOUR_USERNAME-pdf-redaction-api.hf.space/redact" \
+  -F "file=@test.pdf" \
+  -F "dpi=300"
+```
+## Configuration Options
+### Hardware Upgrades
+For better performance, consider upgrading your Space hardware:
+1. Go to Space Settings
+2. Click on "Hardware"
+3. Choose:
+   - **CPU Basic** (Free): Good for testing, slower processing
+   - **CPU Upgrade** (~$0.50/hour): Faster processing
+   - **GPU** (~$0.60-3/hour): Best for large documents
+### Environment Variables
+Add environment variables in Space Settings if needed:
+```bash
+HF_HOME=/app/cache
+PYTHONUNBUFFERED=1
+```
+### Persistent Storage
+For persistent file storage:
+1. Go to Space Settings
+2. Enable "Persistent Storage"
+3. This keeps uploaded/processed files between restarts
+## Custom Domain (Optional)
+To use a custom domain:
+1. Go to Space Settings
+2. Click "Domains"
+3. Add your custom domain
+4. Follow DNS configuration instructions
+## Monitoring and Logs
+### View Logs
+1. Go to your Space page
+2. Click on "Logs" tab
+3. Monitor real-time logs
+### Check Resource Usage
+1. Click on "Insights" tab
+2. View CPU/Memory usage
+3. Monitor request patterns
+## Security Considerations
+### For Production Use
+1. **Add Authentication**:
+   - Implement API key authentication
+   - Use OAuth2 for user management
+2. **Rate Limiting**:
+   - Add rate limiting to prevent abuse
+   - Use slowapi or similar libraries
+3. **File Size Limits**:
+   - Restrict upload file sizes
+   - Implement timeout for long-running requests
+4. **HTTPS Only**:
+   - HuggingFace Spaces provides HTTPS by default
+   - Ensure all requests use HTTPS
+Example with API key authentication:
+```python
+from fastapi import Security, HTTPException, status
+from fastapi.security import APIKeyHeader
+API_KEY = "your-secret-key"
+api_key_header = APIKeyHeader(name="X-API-Key")
+def verify_api_key(api_key: str = Security(api_key_header)):
+    if api_key != API_KEY:
+        raise HTTPException(
+            status_code=status.HTTP_401_UNAUTHORIZED,
+            detail="Invalid API Key"
+        )
+    return api_key
+# Add to endpoints
+@app.post("/redact")
+async def redact_pdf(
+    file: UploadFile = File(...),
+    api_key: str = Security(verify_api_key)
+):
+    # Your code here
+```
+## Troubleshooting
+### Build Fails
+**Problem**: Docker build fails
+**Solution**:
+- Check Dockerfile syntax
+- Ensure all dependencies are in requirements.txt
+- Review build logs for specific errors
+### Out of Memory
+**Problem**: API crashes with OOM errors
+**Solution**:
+- Reduce default DPI to 200
+- Upgrade to larger hardware
+- Implement request queuing
+### Slow Processing
+**Problem**: Redaction takes too long
+**Solution**:
+- Lower DPI (150-200 for faster processing)
+- Upgrade to GPU hardware
+- Optimize batch processing
+### Model Download Issues
+**Problem**: Model fails to download
+**Solution**:
+- Check HuggingFace model availability
+- Verify internet access in Space
+- Pre-download model and include in Docker image
+## Updating Your Space
+To update your deployed API:
+```bash
+# Make changes locally
+# Test changes
+# Commit and push
+git add .
+git commit -m "Update: description of changes"
+git push
+# HuggingFace will automatically rebuild
+```
+## Cost Estimation
+### Free Tier
+- CPU Basic
+- Limited to 2 CPU cores
+- 16GB RAM
+- Good for: Testing, low-traffic demos
+### Paid Tiers
+- CPU Upgrade: ~$0.50/hour (~$360/month if always on)
+- GPU T4: ~$0.60/hour (~$432/month)
+- GPU A10G: ~$1.50/hour (~$1,080/month)
+**Recommendation**: Start with free tier, upgrade based on usage
+## Alternative Deployment Options
+### 1. Deploy on Your Own Server
+```bash
+# Build Docker image
+docker build -t pdf-redaction-api .
+# Run container
+docker run -p 7860:7860 pdf-redaction-api
+```
+### 2. Deploy on Cloud Platforms
+- **AWS ECS/Fargate**: For scalable production
+- **Google Cloud Run**: Serverless container deployment
+- **Azure Container Instances**: Easy container deployment
+- **DigitalOcean App Platform**: Simple PaaS deployment
+### 3. Deploy on Render.com
+1. Connect your GitHub repo
+2. Select "Docker" as environment
+3. Deploy automatically
+## Support
+For issues:
+1. Check HuggingFace Spaces documentation
+2. Review logs in Space dashboard
+3. Test locally with Docker first
+4. Open issue on your repository
+## Next Steps
+After successful deployment:
+1. ✅ Test all API endpoints
+2. ✅ Set up monitoring
+3. ✅ Configure custom domain (optional)
+4. ✅ Add authentication for production
+5. ✅ Implement rate limiting
+6. ✅ Set up error tracking (e.g., Sentry)
+7. ✅ Create API documentation with examples
+8. ✅ Add usage analytics
+Your API is now live and ready to use! 🚀

Dockerfile ADDED Viewed

	@@ -0,0 +1,36 @@

+FROM python:3.12-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    tesseract-ocr \
+    tesseract-ocr-eng \
+    poppler-utils \
+    libgl1 \
+    libglib2.0-0 \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Create necessary directories
+RUN mkdir -p uploads outputs
+# Expose port (HuggingFace Spaces uses 7860)
+EXPOSE 7860
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV HF_HOME=/app/cache
+# Run the application
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 PDF Redaction API
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,271 @@

+# Quick Start Guide 🚀
+## Local Development (5 minutes)
+### 1. Install System Dependencies
+**Ubuntu/Debian:**
+```bash
+sudo apt-get update
+sudo apt-get install -y tesseract-ocr poppler-utils
+```
+**macOS:**
+```bash
+brew install tesseract poppler
+```
+**Windows:**
+- Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
+- Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases
+### 2. Install Python Dependencies
+```bash
+pip install -r requirements.txt
+```
+### 3. Run the Server
+```bash
+python main.py
+```
+The API will be available at: `http://localhost:7860`
+### 4. Test with cURL
+```bash
+# Health check
+curl http://localhost:7860/health
+# Redact a PDF
+curl -X POST "http://localhost:7860/redact" \
+  -F "file=@your_document.pdf" \
+  -F "dpi=300"
+```
+### 5. Access API Documentation
+Open in browser: `http://localhost:7860/docs`
+## Using Docker (3 minutes)
+### 1. Build Image
+```bash
+docker build -t pdf-redaction-api .
+```
+### 2. Run Container
+```bash
+docker run -p 7860:7860 pdf-redaction-api
+```
+### 3. Test
+```bash
+curl http://localhost:7860/health
+```
+## Deploy to HuggingFace Spaces (10 minutes)
+### 1. Create Space
+1. Go to https://huggingface.co/spaces
+2. Click "Create new Space"
+3. Name: `pdf-redaction-api`
+4. SDK: **Docker**
+5. Click "Create Space"
+### 2. Push Code
+```bash
+# Clone your space
+git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
+cd pdf-redaction-api
+# Copy all project files
+cp -r /path/to/project/* .
+# Commit and push
+git add .
+git commit -m "Initial deployment"
+git push
+```
+### 3. Wait for Build
+Monitor at: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api`
+### 4. Test Your Deployed API
+```bash
+curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health
+```
+## Example Usage
+### Python Client
+```python
+import requests
+# Upload and redact
+files = {"file": open("document.pdf", "rb")}
+response = requests.post(
+    "http://localhost:7860/redact",
+    files=files,
+    params={"dpi": 300}
+)
+result = response.json()
+job_id = result["job_id"]
+# Download redacted PDF
+redacted = requests.get(f"http://localhost:7860/download/{job_id}")
+with open("redacted.pdf", "wb") as f:
+    f.write(redacted.content)
+print(f"Redacted {len(result['entities'])} entities")
+```
+### JavaScript/Node.js
+```javascript
+const FormData = require('form-data');
+const fs = require('fs');
+const axios = require('axios');
+async function redactPDF() {
+  const form = new FormData();
+  form.append('file', fs.createReadStream('document.pdf'));
+  // Upload and redact
+  const response = await axios.post(
+    'http://localhost:7860/redact',
+    form,
+    {
+      headers: form.getHeaders(),
+      params: { dpi: 300 }
+    }
+  );
+  const { job_id } = response.data;
+  // Download redacted PDF
+  const redacted = await axios.get(
+    `http://localhost:7860/download/${job_id}`,
+    { responseType: 'arraybuffer' }
+  );
+  fs.writeFileSync('redacted.pdf', redacted.data);
+  console.log('Redaction complete!');
+}
+redactPDF();
+```
+### cURL Advanced
+```bash
+# Redact only specific entity types
+curl -X POST "http://localhost:7860/redact" \
+  -F "file=@document.pdf" \
+  -F "dpi=300" \
+  -F "entity_types=PER,ORG"
+# Get statistics
+curl http://localhost:7860/stats
+# Download specific file
+curl -O -J http://localhost:7860/download/JOB_ID_HERE
+```
+## Common Use Cases
+### 1. Redact All Personal Information
+```python
+response = requests.post(
+    "http://localhost:7860/redact",
+    files={"file": open("resume.pdf", "rb")},
+    params={"dpi": 300}
+)
+```
+### 2. Redact Only Names and Organizations
+```python
+response = requests.post(
+    "http://localhost:7860/redact",
+    files={"file": open("contract.pdf", "rb")},
+    params={
+        "dpi": 300,
+        "entity_types": "PER,ORG"
+    }
+)
+```
+### 3. Fast Processing (Lower Quality)
+```python
+response = requests.post(
+    "http://localhost:7860/redact",
+    files={"file": open("large_doc.pdf", "rb")},
+    params={"dpi": 150}  # Faster but less accurate
+)
+```
+### 4. High Quality (Slower)
+```python
+response = requests.post(
+    "http://localhost:7860/redact",
+    files={"file": open("important.pdf", "rb")},
+    params={"dpi": 600}  # Best quality, slowest
+)
+```
+## Troubleshooting
+### "Model not loaded"
+**Problem**: NER model failed to load
+**Solution**: Check internet connection, wait for model download
+### "Tesseract not found"
+**Problem**: OCR engine not installed
+**Solution**: Install tesseract-ocr system package
+### "Poppler not found"
+**Problem**: PDF converter not installed
+**Solution**: Install poppler-utils system package
+### Slow processing
+**Problem**: Redaction takes too long
+**Solution**: Lower DPI to 150-200
+### Out of memory
+**Problem**: Large PDF crashes the API
+**Solution**:
+- Process one page at a time
+- Increase container memory
+- Lower DPI
+## Next Steps
+- ✅ Read full [README.md](README.md) for API details
+- ✅ Check [DEPLOYMENT.md](DEPLOYMENT.md) for production setup
+- ✅ Review [STRUCTURE.md](STRUCTURE.md) for code organization
+- ✅ Run tests: `pytest tests/`
+- ✅ Add authentication for production use
+- ✅ Set up monitoring and logging
+## Support
+- 📖 API Docs: `http://localhost:7860/docs`
+- 🐛 Issues: Create on your repository
+- 💬 HuggingFace: Community forums
+Happy redacting! 🔒

README.md CHANGED Viewed

@@ -1,10 +1,167 @@
 ---
-title: Redact With Openai
-emoji: 📉
-colorFrom: green
-colorTo: gray
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: PDF Redaction API
+emoji: 🔒
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
+license: mit
 ---
+# PDF Redaction API 🔒
+Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).
+## Features
+- 🤖 **Powered by NER**: Uses state-of-the-art Named Entity Recognition
+- 📄 **PDF Support**: Upload and process PDF documents
+- 🎯 **Accurate Redaction**: Correctly positioned black rectangles over sensitive text
+- 🚀 **Fast Processing**: Optimized OCR and NER pipeline
+- 🔧 **Configurable**: Adjust DPI and filter entity types
+## API Endpoints
+### `POST /redact`
+Upload a PDF file and get it redacted.
+**Parameters:**
+- `file`: PDF file (required)
+- `dpi`: OCR quality (default: 300)
+- `entity_types`: Comma-separated entity types to redact (optional)
+**Example using cURL:**
+```bash
+curl -X POST "https://your-space.hf.space/redact" \
+  -F "file=@document.pdf" \
+  -F "dpi=300"
+```
+**Example using Python:**
+```python
+import requests
+url = "https://your-space.hf.space/redact"
+files = {"file": open("document.pdf", "rb")}
+params = {"dpi": 300}
+response = requests.post(url, files=files, params=params)
+result = response.json()
+# Download redacted file
+job_id = result["job_id"]
+download_url = f"https://your-space.hf.space/download/{job_id}"
+redacted_pdf = requests.get(download_url)
+with open("redacted.pdf", "wb") as f:
+    f.write(redacted_pdf.content)
+```
+### `GET /download/{job_id}`
+Download the redacted PDF file.
+### `GET /health`
+Check API health and model status.
+### `GET /stats`
+Get API statistics.
+## Response Format
+```json
+{
+  "job_id": "uuid-here",
+  "status": "completed",
+  "message": "Successfully redacted 5 entities",
+  "entities": [
+    {
+      "entity_type": "PER",
+      "entity_text": "John Doe",
+      "page": 1,
+      "word_count": 2
+    }
+  ],
+  "redacted_file_url": "/download/uuid-here"
+}
+```
+## Entity Types
+Common entity types detected:
+- `PER`: Person names
+- `ORG`: Organizations
+- `LOC`: Locations
+- `DATE`: Dates
+- `EMAIL`: Email addresses
+- `PHONE`: Phone numbers
+- And more...
+## Local Development
+### Prerequisites
+- Python 3.10+
+- Tesseract OCR
+- Poppler utils
+### Installation
+```bash
+# Install system dependencies (Ubuntu/Debian)
+sudo apt-get install tesseract-ocr poppler-utils
+# Install Python dependencies
+pip install -r requirements.txt
+# Run the server
+python main.py
+```
+The API will be available at `http://localhost:7860`
+### Using Docker
+```bash
+# Build the image
+docker build -t pdf-redaction-api .
+# Run the container
+docker run -p 7860:7860 pdf-redaction-api
+```
+## Configuration
+Adjust the DPI parameter based on your needs:
+- `150`: Fast processing, lower quality
+- `300`: Recommended balance (default)
+- `600`: High quality, slower processing
+## Limitations
+- Maximum file size: Dependent on Space resources
+- Processing time increases with page count and DPI
+- Files are automatically cleaned up after processing
+## Privacy
+- Uploaded files are processed in-memory and deleted after redaction
+- No data is stored permanently
+- Use your own deployment for sensitive documents
+## Credits
+Built with:
+- [FastAPI](https://fastapi.tiangolo.com/)
+- [Transformers](https://huggingface.co/transformers/)
+- [PyPDF](https://github.com/py-pdf/pypdf)
+- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
+## License
+MIT License - See LICENSE file for details

STRUCTURE.md ADDED Viewed

	@@ -0,0 +1,269 @@

+# Project Structure
+```
+pdf-redaction-api/
+│
+├── main.py                      # FastAPI application entry point
+├── Dockerfile                   # Docker configuration for deployment
+├── requirements.txt             # Python dependencies
+├── README.md                    # Project documentation (for HuggingFace)
+├── DEPLOYMENT.md               # Deployment guide
+├── .gitignore                  # Git ignore rules
+├── .dockerignore               # Docker ignore rules
+│
+├── app/                        # Application modules
+│   ├── __init__.py            # Package initialization
+│   └── redaction.py           # Core redaction logic (PDFRedactor class)
+│
+├── uploads/                    # Temporary upload directory
+│   └── .gitkeep               # Keep directory in git
+│
+├── outputs/                    # Redacted PDF output directory
+│   └── .gitkeep               # Keep directory in git
+│
+├── tests/                      # Test suite
+│   ├── __init__.py
+│   └── test_api.py            # API endpoint tests
+│
+└── client_example.py           # Example client for API usage
+```
+## File Descriptions
+### Core Files
+#### `main.py`
+FastAPI application with endpoints:
+- `POST /redact` - Upload and redact PDF
+- `GET /download/{job_id}` - Download redacted PDF
+- `GET /health` - Health check
+- `GET /stats` - API statistics
+- `DELETE /cleanup/{job_id}` - Manual cleanup
+#### `app/redaction.py`
+Core redaction logic:
+- `PDFRedactor` class
+- OCR processing with pytesseract
+- NER using HuggingFace transformers
+- Entity-to-box mapping
+- PDF redaction with coordinate scaling
+### Configuration Files
+#### `requirements.txt`
+Python dependencies:
+- FastAPI & Uvicorn (API framework)
+- Transformers & Torch (NER model)
+- PyPDF (PDF manipulation)
+- pdf2image (PDF to image conversion)
+- pytesseract (OCR)
+- Pillow (Image processing)
+#### `Dockerfile`
+Multi-stage build:
+1. Install system dependencies (tesseract, poppler)
+2. Install Python dependencies
+3. Copy application code
+4. Configure for port 7860 (HuggingFace default)
+### Documentation
+#### `README.md`
+HuggingFace Space documentation:
+- Features overview
+- API endpoint documentation
+- Usage examples (cURL, Python)
+- Response format
+- Local development setup
+#### `DEPLOYMENT.md`
+Step-by-step deployment guide:
+- HuggingFace Spaces setup
+- Git workflow
+- Configuration options
+- Security considerations
+- Troubleshooting
+- Cost estimation
+### Testing & Examples
+#### `tests/test_api.py`
+Unit tests for API endpoints:
+- Health check tests
+- Upload validation tests
+- Error handling tests
+#### `client_example.py`
+Example client implementation:
+- Upload PDF
+- Download redacted file
+- Health check
+- Statistics
+## Data Flow
+```
+┌─────────────────────────────────────────────────────────┐
+│ 1. Client uploads PDF                                   │
+│    POST /redact with file                               │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│ 2. FastAPI (main.py)                                    │
+│    - Validates file                                     │
+│    - Generates job_id                                   │
+│    - Saves to uploads/                                  │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│ 3. PDFRedactor (app/redaction.py)                       │
+│    - perform_ocr() → Extract text + boxes               │
+│    - run_ner() → Identify entities                      │
+│    - map_entities_to_boxes() → Link entities to coords  │
+│    - create_redacted_pdf() → Generate output            │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│ 4. Response                                             │
+│    - Return job_id and entity list                      │
+│    - Save redacted PDF to outputs/                      │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│ 5. Client downloads                                     │
+│    GET /download/{job_id}                               │
+└─────────────────────────────────────────────────────────┘
+```
+## Key Components
+### 1. FastAPI Application (`main.py`)
+**Endpoints:**
+- RESTful API design
+- File upload handling
+- Background task cleanup
+- CORS middleware for web access
+**Features:**
+- Automatic OpenAPI documentation at `/docs`
+- JSON response models with Pydantic
+- Error handling with HTTP exceptions
+- Request validation
+### 2. Redaction Engine (`app/redaction.py`)
+**Pipeline Steps:**
+1. **OCR Processing**
+   - Convert PDF pages to images (pdf2image)
+   - Extract text and bounding boxes (pytesseract)
+   - Store image dimensions for coordinate scaling
+2. **NER Processing**
+   - Load HuggingFace model
+   - Identify entities in text
+   - Return entity types and character positions
+3. **Mapping**
+   - Create character span index for OCR words
+   - Match NER entities to OCR bounding boxes
+   - Handle partial word matches
+4. **Redaction**
+   - Scale OCR image coordinates to PDF points
+   - Create black rectangle annotations
+   - Write redacted PDF with pypdf
+### 3. Docker Container
+**Layers:**
+- Base: Python 3.10 slim
+- System packages: tesseract-ocr, poppler-utils
+- Python packages: From requirements.txt
+- Application code: Copied last for better caching
+**Optimizations:**
+- Multi-stage build (not used here, but possible)
+- Minimal base image
+- Cached dependency layers
+- .dockerignore to reduce context size
+## Environment Variables
+Default configuration (can be overridden):
+```bash
+PYTHONUNBUFFERED=1        # Immediate log output
+HF_HOME=/app/cache        # HuggingFace cache directory
+```
+## Port Configuration
+- **Development**: 7860 (configurable in main.py)
+- **Production (HF Spaces)**: 7860 (required)
+## Directory Permissions
+Ensure write permissions for:
+- `uploads/` - Temporary PDF storage
+- `outputs/` - Redacted PDF storage
+- `cache/` - Model cache (created automatically)
+## Adding New Features
+### Add New Endpoint
+1. Define in `main.py`:
+```python
+@app.get("/new-endpoint")
+async def new_endpoint():
+    return {"message": "Hello"}
+```
+2. Add response model if needed
+3. Update README.md documentation
+4. Add tests in `tests/test_api.py`
+### Add New Redaction Option
+1. Modify `PDFRedactor` class in `app/redaction.py`
+2. Add parameter to `redact_document()` method
+3. Update API endpoint in `main.py`
+4. Document in README.md
+### Add Authentication
+1. Install: `pip install python-jose passlib`
+2. Create `app/auth.py` with JWT logic
+3. Add middleware to `main.py`
+4. Protect endpoints with dependencies
+## Best Practices
+1. **Logging**: Use `logger` for all important events
+2. **Error Handling**: Catch exceptions and return meaningful errors
+3. **Validation**: Use Pydantic models for request/response validation
+4. **Cleanup**: Always clean up temporary files
+5. **Documentation**: Keep README.md and code comments updated
+6. **Testing**: Add tests for new features
+## Performance Considerations
+### Bottlenecks
+1. OCR processing (most time-consuming)
+2. Model inference (NER)
+3. File I/O
+### Optimizations
+- Lower DPI for faster OCR (trade-off with accuracy)
+- Cache loaded models in memory
+- Use async file operations
+- Implement request queuing for high load
+- Consider GPU for NER model
+### Scaling
+- Horizontal: Multiple container instances
+- Vertical: Larger CPU/RAM allocation
+- Caching: Redis for temporary results
+- Queue: Celery for background processing

app/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""
+App module for PDF redaction API
+"""
+from .redaction import PDFRedactor
+__all__ = ['PDFRedactor']

app/redaction.py ADDED Viewed

	@@ -0,0 +1,327 @@

+"""
+PDF Redaction module using NER
+"""
+from pdf2image import convert_from_path
+import pytesseract
+from pypdf import PdfReader, PdfWriter
+from pypdf.generic import DictionaryObject, ArrayObject, NameObject, NumberObject
+from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
+from typing import List, Dict, Optional
+import logging
+logger = logging.getLogger(__name__)
+class PDFRedactor:
+    """PDF Redaction using Named Entity Recognition"""
+    def __init__(self, model_name: str = "openai/privacy-filter"):
+        """
+        Initialize the PDF Redactor
+        Args:
+            model_name: HuggingFace model ID for NER
+        """
+        self.model_name = model_name
+        self.ner_pipeline = None
+        self._load_model()
+    def _load_model(self):
+        """Load the NER model"""
+        try:
+            logger.info(f"Loading NER model: {self.model_name}")
+            tokenizer = AutoTokenizer.from_pretrained(
+                self.model_name, trust_remote_code=True
+            )
+            model = AutoModelForTokenClassification.from_pretrained(
+                self.model_name, trust_remote_code=True, device_map="auto"
+            )
+            self.ner_pipeline = pipeline(
+                "token-classification",
+                model=model,
+                tokenizer=tokenizer,
+                aggregation_strategy="simple",
+            )
+            logger.info("NER model loaded successfully")
+        except Exception as e:
+            logger.error(f"Error loading NER model: {str(e)}")
+            raise
+    def is_model_loaded(self) -> bool:
+        """Check if the model is loaded"""
+        return self.ner_pipeline is not None
+    def perform_ocr(self, pdf_path: str, dpi: int = 300) -> List[Dict]:
+        """
+        Perform OCR on PDF and extract word bounding boxes
+        Args:
+            pdf_path: Path to the PDF file
+            dpi: DPI for PDF to image conversion
+        Returns:
+            List of word data with bounding boxes and image dimensions
+        """
+        logger.info(f"Starting OCR on {pdf_path} at {dpi} DPI")
+        all_words_data = []
+        try:
+            images = convert_from_path(pdf_path, dpi=dpi)
+            logger.info(f"Converted PDF to {len(images)} images")
+            for page_num, image in enumerate(images):
+                # Get image dimensions
+                image_width, image_height = image.size
+                # Perform OCR
+                data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
+                logger.info(f"OCR data: {data['text']}")
+                num_words = len(data['text'])
+                for i in range(num_words):
+                    word_text = data['text'][i].strip()
+                    confidence = int(data['conf'][i])
+                    # Filter out empty or low-confidence words
+                    if word_text and confidence > 0:
+                        all_words_data.append({
+                            'text': word_text,
+                            'box': (data['left'][i], data['top'][i],
+                                   data['width'][i], data['height'][i]),
+                            'page': page_num + 1,
+                            'confidence': confidence,
+                            'image_width': image_width,
+                            'image_height': image_height
+                        })
+                logger.info(f"Processed page {page_num + 1}: {len([w for w in all_words_data if w['page'] == page_num + 1])} words")
+            logger.info(f"OCR complete: {len(all_words_data)} total words extracted")
+            return all_words_data
+        except Exception as e:
+            logger.error(f"Error during OCR: {str(e)}")
+            raise
+    def run_ner(self, text: str) -> List[Dict]:
+        """
+        Run NER on text
+        Args:
+            text: Input text
+        Returns:
+            List of identified entities
+        """
+        if not self.ner_pipeline:
+            raise RuntimeError("NER model not loaded")
+        logger.info(f"Running NER on text of length {len(text)}")
+        try:
+            results = self.ner_pipeline(text)
+            logger.info(f"NER identified {len(results)} entities")
+            return results
+        except Exception as e:
+            logger.error(f"Error during NER: {str(e)}")
+            raise
+    def map_entities_to_boxes(self, ner_results: List[Dict],
+                             ocr_data: List[Dict]) -> List[Dict]:
+        """
+        Map NER entities to OCR bounding boxes
+        Args:
+            ner_results: List of NER entities
+            ocr_data: List of OCR word data
+        Returns:
+            List of mapped entities with bounding boxes
+        """
+        logger.info("Mapping NER entities to OCR bounding boxes")
+        mapped_entities = []
+        # Create character span mapping
+        ocr_word_char_spans = []
+        current_char_index = 0
+        for ocr_data_idx, word_info in enumerate(ocr_data):
+            word_text = word_info['text']
+            length = len(word_text)
+            ocr_word_char_spans.append({
+                'ocr_data_idx': ocr_data_idx,
+                'start_char': current_char_index,
+                'end_char': current_char_index + length
+            })
+            current_char_index += length + 1
+        # Map each NER entity to OCR words
+        for ner_entity in ner_results:
+            ner_entity_type = ner_entity['entity_group']
+            ner_start = ner_entity['start']
+            ner_end = ner_entity['end']
+            ner_word = ner_entity['word']
+            matching_ocr_words = []
+            for ocr_word_span in ocr_word_char_spans:
+                ocr_start = ocr_word_span['start_char']
+                ocr_end = ocr_word_span['end_char']
+                # Check for overlap
+                if max(ocr_start, ner_start) < min(ocr_end, ner_end):
+                    matching_ocr_words.append(ocr_data[ocr_word_span['ocr_data_idx']])
+            if matching_ocr_words:
+                mapped_entities.append({
+                    'entity_type': ner_entity_type,
+                    'entity_text': ner_word,
+                    'words': matching_ocr_words
+                })
+        logger.info(f"Mapped {len(mapped_entities)} entities to bounding boxes")
+        return mapped_entities
+    def create_redacted_pdf(self, original_pdf_path: str,
+                           mapped_entities: List[Dict],
+                           output_path: str) -> str:
+        """
+        Create redacted PDF with black rectangles over entities
+        Args:
+            original_pdf_path: Path to original PDF
+            mapped_entities: List of entities with bounding boxes
+            output_path: Path for output PDF
+        Returns:
+            Path to redacted PDF
+        """
+        logger.info(f"Creating redacted PDF: {output_path}")
+        try:
+            reader = PdfReader(original_pdf_path)
+            writer = PdfWriter()
+            for page_num in range(len(reader.pages)):
+                page = reader.pages[page_num]
+                media_box = page.mediabox
+                page_width = float(media_box.width)
+                page_height = float(media_box.height)
+                writer.add_page(page)
+                page_entities = 0
+                for entity_info in mapped_entities:
+                    for word_info in entity_info['words']:
+                        if word_info['page'] == page_num + 1:
+                            x, y, w, h = word_info['box']
+                            # Get image dimensions
+                            image_width = word_info['image_width']
+                            image_height = word_info['image_height']
+                            # Scale coordinates
+                            scale_x = page_width / image_width
+                            scale_y = page_height / image_height
+                            x_scaled = x * scale_x
+                            y_scaled = y * scale_y
+                            w_scaled = w * scale_x
+                            h_scaled = h * scale_y
+                            # Convert to PDF coordinates
+                            llx = x_scaled
+                            lly = page_height - (y_scaled + h_scaled)
+                            urx = x_scaled + w_scaled
+                            ury = page_height - y_scaled
+                            # Create redaction annotation
+                            redaction_annotation = DictionaryObject()
+                            redaction_annotation.update({
+                                NameObject("/Type"): NameObject("/Annot"),
+                                NameObject("/Subtype"): NameObject("/Square"),
+                                NameObject("/Rect"): ArrayObject([
+                                    NumberObject(llx),
+                                    NumberObject(lly),
+                                    NumberObject(urx),
+                                    NumberObject(ury),
+                                ]),
+                                NameObject("/C"): ArrayObject([
+                                    NumberObject(0), NumberObject(0), NumberObject(0)
+                                ]),
+                                NameObject("/IC"): ArrayObject([
+                                    NumberObject(0), NumberObject(0), NumberObject(0)
+                                ]),
+                                NameObject("/BS"): DictionaryObject({
+                                    NameObject("/W"): NumberObject(0)
+                                })
+                            })
+                            writer.add_annotation(page_number=page_num,
+                                                annotation=redaction_annotation)
+                            page_entities += 1
+                logger.info(f"Page {page_num + 1}: Added {page_entities} redactions")
+            # Write output
+            with open(output_path, "wb") as output_file:
+                writer.write(output_file)
+            logger.info(f"Redacted PDF created successfully: {output_path}")
+            return output_path
+        except Exception as e:
+            logger.error(f"Error creating redacted PDF: {str(e)}")
+            raise
+    def redact_document(self, pdf_path: str, output_path: str,
+                       dpi: int = 300,
+                       entity_filter: Optional[List[str]] = None) -> Dict:
+        """
+        Complete redaction pipeline
+        Args:
+            pdf_path: Path to input PDF
+            output_path: Path for output PDF
+            dpi: DPI for OCR
+            entity_filter: List of entity types to redact (None = all). Valid
+                values: account_number, private_address, private_email,
+                private_person, private_phone, private_url, private_date, secret
+        Returns:
+            Dictionary with redaction results
+        """
+        logger.info(f"Starting redaction pipeline for {pdf_path}")
+        # Step 1: OCR
+        ocr_data = self.perform_ocr(pdf_path, dpi)
+        # Step 2: Extract text
+        full_text = " ".join([word['text'] for word in ocr_data])
+        # Step 3: NER
+        ner_results = self.run_ner(full_text)
+        # Step 4: Map entities to boxes
+        mapped_entities = self.map_entities_to_boxes(ner_results, ocr_data)
+        # Step 5: Filter entities if requested
+        if entity_filter:
+            mapped_entities = [
+                e for e in mapped_entities
+                if e['entity_type'] in entity_filter
+            ]
+            logger.info(f"Filtered to {len(mapped_entities)} entities of types: {entity_filter}")
+        # Step 6: Create redacted PDF
+        self.create_redacted_pdf(pdf_path, mapped_entities, output_path)
+        return {
+            'output_path': output_path,
+            'total_words': len(ocr_data),
+            'total_entities': len(ner_results),
+            'redacted_entities': len(mapped_entities),
+            'entities': mapped_entities
+        }

client_example.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+Example client for PDF Redaction API
+"""
+import requests
+from pathlib import Path
+import sys
+def redact_pdf(api_url: str, pdf_path: str, output_path: str = "redacted.pdf",
+               dpi: int = 300, entity_types: str = None):
+    """
+    Redact a PDF file using the API
+    Args:
+        api_url: Base URL of the API
+        pdf_path: Path to the PDF file to redact
+        output_path: Path to save the redacted PDF
+        dpi: DPI for OCR processing
+        entity_types: Comma-separated list of entity types to redact
+    """
+    # Check if file exists
+    if not Path(pdf_path).exists():
+        print(f"Error: File {pdf_path} not found")
+        return False
+    print(f"Uploading {pdf_path}...")
+    # Prepare request
+    files = {"file": open(pdf_path, "rb")}
+    params = {"dpi": dpi}
+    if entity_types:
+        params["entity_types"] = entity_types
+    try:
+        # Upload and redact
+        response = requests.post(f"{api_url}/redact", files=files, params=params)
+        response.raise_for_status()
+        result = response.json()
+        print(f"\nStatus: {result['status']}")
+        print(f"Message: {result['message']}")
+        # Display found entities
+        if result.get('entities'):
+            print("\nEntities redacted:")
+            for i, entity in enumerate(result['entities'], 1):
+                print(f"  {i}. {entity['entity_type']}: {entity['entity_text']} "
+                      f"(Page {entity['page']}, {entity['word_count']} words)")
+        # Download redacted file
+        job_id = result['job_id']
+        print(f"\nDownloading redacted PDF...")
+        download_response = requests.get(f"{api_url}/download/{job_id}")
+        download_response.raise_for_status()
+        # Save file
+        with open(output_path, "wb") as f:
+            f.write(download_response.content)
+        print(f"✓ Redacted PDF saved to: {output_path}")
+        # Cleanup (optional)
+        # requests.delete(f"{api_url}/cleanup/{job_id}")
+        return True
+    except requests.exceptions.RequestException as e:
+        print(f"Error: {e}")
+        return False
+    finally:
+        files["file"].close()
+def check_health(api_url: str):
+    """Check API health"""
+    try:
+        response = requests.get(f"{api_url}/health")
+        response.raise_for_status()
+        data = response.json()
+        print(f"API Status: {data['status']}")
+        print(f"Version: {data['version']}")
+        print(f"Model Loaded: {data['model_loaded']}")
+        return True
+    except requests.exceptions.RequestException as e:
+        print(f"Error checking health: {e}")
+        return False
+def get_stats(api_url: str):
+    """Get API statistics"""
+    try:
+        response = requests.get(f"{api_url}/stats")
+        response.raise_for_status()
+        data = response.json()
+        print("API Statistics:")
+        print(f"  Pending uploads: {data['pending_uploads']}")
+        print(f"  Processed files: {data['processed_files']}")
+        print(f"  Model loaded: {data['model_loaded']}")
+        return True
+    except requests.exceptions.RequestException as e:
+        print(f"Error getting stats: {e}")
+        return False
+if __name__ == "__main__":
+    # Example usage
+    # For local development
+    API_URL = "http://localhost:7860"
+    # For HuggingFace Spaces (replace with your space URL)
+    # API_URL = "https://your-username-pdf-redaction-api.hf.space"
+    if len(sys.argv) < 2:
+        print("Usage:")
+        print("  python client_example.py <pdf_file> [output_file] [dpi]")
+        print("\nOr check health:")
+        print("  python client_example.py --health")
+        print("\nOr get stats:")
+        print("  python client_example.py --stats")
+        sys.exit(1)
+    if sys.argv[1] == "--health":
+        check_health(API_URL)
+    elif sys.argv[1] == "--stats":
+        get_stats(API_URL)
+    else:
+        pdf_path = sys.argv[1]
+        output_path = sys.argv[2] if len(sys.argv) > 2 else "redacted.pdf"
+        dpi = int(sys.argv[3]) if len(sys.argv) > 3 else 300
+        # Optional: Filter specific entity types
+        # entity_types = "PER,ORG"  # Only redact persons and organizations
+        entity_types = None  # Redact all entity types
+        redact_pdf(API_URL, pdf_path, output_path, dpi, entity_types)

client_supabase.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from supabase import create_client, Client
+import os
+from dotenv import load_dotenv
+load_dotenv()
+SUPABASE_URL = os.getenv("SUPABASE_URL")
+SUPABASE_KEY = os.getenv("SERVICE_ROLE_KEY")  # server-side key
+supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,48 @@

+version: '3.8'
+services:
+  api:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    ports:
+      - "7860:7860"
+    volumes:
+      # Mount code for development (hot reload)
+      - .:/app
+      # Persistent storage for uploads/outputs
+      - ./uploads:/app/uploads
+      - ./outputs:/app/outputs
+    environment:
+      - PYTHONUNBUFFERED=1
+      - HF_HOME=/app/cache
+      - LOG_LEVEL=DEBUG
+    command: uvicorn main:app --host 0.0.0.0 --port 7860 --reload
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+  # Optional: Add nginx for production
+  # nginx:
+  #   image: nginx:alpine
+  #   ports:
+  #     - "80:80"
+  #   volumes:
+  #     - ./nginx.conf:/etc/nginx/nginx.conf
+  #   depends_on:
+  #     - api
+  # Optional: Add Redis for caching
+  # redis:
+  #   image: redis:alpine
+  #   ports:
+  #     - "6379:6379"
+  #   volumes:
+  #     - redis-data:/data
+# volumes:
+#   redis-data:

main.py ADDED Viewed

	@@ -0,0 +1,344 @@

+"""
+FastAPI application for PDF redaction using NER
+"""
+from fastapi import FastAPI, File, UploadFile, HTTPException, BackgroundTasks
+from fastapi.responses import FileResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import List, Optional, Dict
+import uvicorn
+import os
+import uuid
+import shutil
+from pathlib import Path
+import logging
+import sys
+from app.redaction import PDFRedactor
+from client_supabase import supabase  # Supabase client in separate file
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    stream=sys.stdout,
+    force=True,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+# Initialize FastAPI app
+app = FastAPI(
+    title="PDF Redaction API",
+    description="Redact sensitive information from PDFs using Named Entity Recognition",
+    version="1.0.0"
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Create directories
+UPLOAD_DIR = Path("uploads")
+OUTPUT_DIR = Path("outputs")
+UPLOAD_DIR.mkdir(exist_ok=True)
+OUTPUT_DIR.mkdir(exist_ok=True)
+# Initialize redactor
+redactor = PDFRedactor()
+# ---------------- Response Models ----------------
+class RedactionEntity(BaseModel):
+    entity_type: str
+    entity_text: str
+    page: int
+    word_count: int
+class RedactionResponse(BaseModel):
+    job_id: str
+    status: str
+    message: str
+    entities: Optional[List[RedactionEntity]] = None
+    redacted_file_url: Optional[str] = None
+class RedactionStatusResponse(BaseModel):
+    request_id: str
+    status: str
+    files: List[str]
+    message: str
+class HealthResponse(BaseModel):
+    status: str
+    version: str
+    model_loaded: bool
+# ---------------- DB Status Helpers ----------------
+def set_request_status(request_id: str, status: str):
+    """Update the status column in document_requests for the given request_id."""
+    supabase.from_("document_requests").update({"status": status}).eq("id", request_id).execute()
+    logger.info(f"Request {request_id} status -> {status}")
+def get_request_status(request_id: str) -> str:
+    """Fetch current status from document_requests."""
+    response = (
+        supabase
+        .from_("document_requests")
+        .select("status")
+        .eq("id", request_id)
+        .maybe_single()
+        .execute()
+    )
+    if response.data:
+        return response.data["status"]
+    return "not_found"
+# ---------------- Helper Functions ----------------
+def get_public_url(bucket: str, storage_path: str) -> str:
+    return f"{os.getenv('SUPABASE_URL')}/storage/v1/object/public/{bucket}/{storage_path}"
+def cleanup_files(job_id: str):
+    """Clean up temporary files after a delay"""
+    try:
+        upload_path = UPLOAD_DIR / f"{job_id}.pdf"
+        if upload_path.exists():
+            upload_path.unlink()
+        logger.info(f"Cleaned up files for job {job_id}")
+    except Exception as e:
+        logger.error(f"Error cleaning up files for job {job_id}: {str(e)}")
+def cleanup_temp_files(paths: List[Path]):
+    for path in paths:
+        if path.exists():
+            path.unlink()
+def download_file_from_supabase(bucket: str, storage_path: str, local_path: Path):
+    logger.info(f"Downloading {storage_path} to {local_path}")
+    data = supabase.storage.from_(bucket).download(storage_path)
+    if not data:
+        raise Exception(f"Failed to download {storage_path}")
+    with local_path.open("wb") as f:
+        f.write(data)
+def upload_file_to_supabase(bucket: str, storage_path: str, local_path: Path):
+    logger.info(f"Uploading {local_path} to {storage_path}")
+    with local_path.open("rb") as f:
+        content = f.read()
+    supabase.storage.from_(bucket).upload(
+        path=storage_path,
+        file=content,
+        file_options={
+            "upsert": "true",
+            "content-type": "application/pdf"
+        }
+    )
+def redact_request(request_id: str, bucket: str = "doc_storage"):
+    """
+    Background task: redact all files for a given request_id.
+    DB writes: 2 total — one at start (redacting), one at end (redacted | failed).
+    The 'pending' write is done by the endpoint before this task is dispatched.
+    """
+    try:
+        print("Request arrived at redact_request function")
+        # Write 1: mark as redacting
+        set_request_status(request_id, "redacting")
+        response = (
+            supabase
+            .from_("request_files")
+            .select("id, storage_path")
+            .eq("request_id", request_id)
+            .eq("file_role","seed")
+            .execute()
+        )
+        files = response.data
+        if not files:
+            set_request_status(request_id, "approved")
+            raise Exception(f"No files found for request {request_id}")
+        for file in files:
+            storage_path = file["storage_path"]
+            local_upload = UPLOAD_DIR / f"{uuid.uuid4()}.pdf"
+            local_output = OUTPUT_DIR / f"{uuid.uuid4()}_redacted.pdf"
+            download_file_from_supabase(bucket, storage_path, local_upload)
+            redactor.redact_document(pdf_path=str(local_upload), output_path=str(local_output))
+            upload_file_to_supabase(bucket, storage_path, local_output)
+            cleanup_temp_files([local_upload, local_output])
+        # Write 2: mark as redacted
+        set_request_status(request_id, "redacted")
+    except Exception as e:
+        print(f"Redaction failed for {request_id}: {str(e)}")
+        logger.error(f"Redaction failed for {request_id}: {str(e)}")
+        # Write 2 (error path): mark as failed
+        set_request_status(request_id, "failed")
+# ----------------- Existing Endpoints -----------------
+@app.get("/", response_model=HealthResponse)
+async def root():
+    return HealthResponse(
+        status="healthy",
+        version="1.0.0",
+        model_loaded=redactor.is_model_loaded()
+    )
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    return HealthResponse(
+        status="healthy",
+        version="1.0.0",
+        model_loaded=redactor.is_model_loaded()
+    )
+@app.post("/redact", response_model=RedactionResponse)
+async def redact_pdf(
+    background_tasks: BackgroundTasks,
+    file: UploadFile = File(...),
+    dpi: int = 300,
+    entity_types: Optional[str] = None
+):
+    if not file.filename.endswith('.pdf'):
+        raise HTTPException(status_code=400, detail="Only PDF files are supported")
+    job_id = str(uuid.uuid4())
+    upload_path = UPLOAD_DIR / f"{job_id}.pdf"
+    output_path = OUTPUT_DIR / f"{job_id}_redacted.pdf"
+    try:
+        with upload_path.open("wb") as buffer:
+            shutil.copyfileobj(file.file, buffer)
+        entity_filter = None
+        if entity_types:
+            entity_filter = [et.strip() for et in entity_types.split(',')]
+        result = redactor.redact_document(
+            pdf_path=str(upload_path),
+            output_path=str(output_path),
+            dpi=dpi,
+            entity_filter=entity_filter
+        )
+        response_entities = [
+            RedactionEntity(
+                entity_type=e['entity_type'],
+                entity_text=e['entity_text'],
+                page=e['words'][0]['page'] if e['words'] else 0,
+                word_count=len(e['words'])
+            ) for e in result['entities']
+        ]
+        background_tasks.add_task(cleanup_files, job_id)
+        return RedactionResponse(
+            job_id=job_id,
+            status="completed",
+            message=f"Successfully redacted {len(result['entities'])} entities",
+            entities=response_entities,
+            redacted_file_url=f"/download/{job_id}"
+        )
+    except Exception as e:
+        logger.error(f"Error processing job {job_id}: {str(e)}")
+        if upload_path.exists():
+            upload_path.unlink()
+        if output_path.exists():
+            output_path.unlink()
+        raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")
+@app.get("/download/{job_id}")
+async def download_redacted_pdf(job_id: str):
+    output_path = OUTPUT_DIR / f"{job_id}_redacted.pdf"
+    if not output_path.exists():
+        raise HTTPException(status_code=404, detail="Redacted file not found")
+    return FileResponse(
+        path=output_path,
+        media_type="application/pdf",
+        filename=f"redacted_{job_id}.pdf"
+    )
+@app.delete("/cleanup/{job_id}")
+async def cleanup_job(job_id: str):
+    try:
+        cleanup_files(job_id)
+        output_path = OUTPUT_DIR / f"{job_id}_redacted.pdf"
+        if output_path.exists():
+            output_path.unlink()
+        return {"message": f"Successfully cleaned up files for job {job_id}"}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error cleaning up: {str(e)}")
+@app.get("/stats")
+async def get_stats():
+    upload_count = len(list(UPLOAD_DIR.glob("*.pdf")))
+    output_count = len(list(OUTPUT_DIR.glob("*.pdf")))
+    return {
+        "pending_uploads": upload_count,
+        "processed_files": output_count,
+        "model_loaded": redactor.is_model_loaded()
+    }
+# ----------------- NEW Endpoints -----------------
+@app.post("/redact_by_request/{request_id}", response_model=RedactionStatusResponse)
+async def redact_by_request(request_id: str, background_tasks: BackgroundTasks):
+    # Check current DB status to avoid re-triggering an in-progress job
+    current_status = get_request_status(request_id)
+    if current_status == "redacting":
+        return RedactionStatusResponse(
+            request_id=request_id,
+            status="redacting",
+            files=[],
+            message="Redaction already in progress"
+        )
+    # Write 1: set pending before dispatching background task
+    set_request_status(request_id, "pending")
+    background_tasks.add_task(redact_request, request_id)
+    return RedactionStatusResponse(
+        request_id=request_id,
+        status="pending",
+        files=[],
+        message="Redaction started in background"
+    )
+@app.get("/redaction_status/{request_id}", response_model=RedactionStatusResponse)
+async def get_redaction_status(request_id: str):
+    status = get_request_status(request_id)
+    files: List[str] = []
+    if status == "redacted":
+        response = (
+            supabase
+            .from_("request_files")
+            .select("storage_path")
+            .eq("file_role","seed")
+            .eq("request_id", request_id)
+            .execute()
+        )
+        if response.data:
+            files = [
+                get_public_url("doc_storage", row["storage_path"])
+                for row in response.data
+            ]
+    message = {
+        "redacted": "Redaction completed",
+        "pending": "Redaction pending",
+        "redacting": "Redaction in progress",
+        "failed": "Redaction failed",
+        "not_found": "Request not found",
+    }.get(status, status)
+    return RedactionStatusResponse(
+        request_id=request_id,
+        status=status,
+        files=files,
+        message=message
+    )

outputs/.gitkeep ADDED Viewed

File without changes

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+fastapi==0.109.0
+uvicorn[standard]==0.27.0
+python-multipart==0.0.6
+transformers>=4.45,<5.0
+accelerate>=0.30
+torch==2.2.2
+pypdf==4.0.1
+pdf2image==1.17.0
+pytesseract==0.3.10
+Pillow==10.2.0
+pydantic==2.5.3
+python-dotenv==1.0.0
+supabase
+numpy==1.26.4

uploads/.gitkeep ADDED Viewed

File without changes