Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / QUICKSTART.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

5.35 kB

Quick Start Guide 🚀

Local Development (5 minutes)

1. Install System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Windows:

Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases

2. Install Python Dependencies

pip install -r requirements.txt

3. Run the Server

python main.py

The API will be available at: http://localhost:7860

4. Test with cURL

# Health check
curl http://localhost:7860/health

# Redact a PDF
curl -X POST "http://localhost:7860/redact" \
  -F "file=@your_document.pdf" \
  -F "dpi=300"

5. Access API Documentation

Open in browser: http://localhost:7860/docs

Using Docker (3 minutes)

1. Build Image

docker build -t pdf-redaction-api .

2. Run Container

docker run -p 7860:7860 pdf-redaction-api

3. Test

curl http://localhost:7860/health

Deploy to HuggingFace Spaces (10 minutes)

1. Create Space

Go to https://huggingface.co/spaces
Click "Create new Space"
Name: pdf-redaction-api
SDK: Docker
Click "Create Space"

2. Push Code

# Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api

# Copy all project files
cp -r /path/to/project/* .

# Commit and push
git add .
git commit -m "Initial deployment"
git push

3. Wait for Build

Monitor at: https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api

4. Test Your Deployed API

curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health

Example Usage

Python Client

import requests

# Upload and redact
files = {"file": open("document.pdf", "rb")}
response = requests.post(
    "http://localhost:7860/redact",
    files=files,
    params={"dpi": 300}
)

result = response.json()
job_id = result["job_id"]

# Download redacted PDF
redacted = requests.get(f"http://localhost:7860/download/{job_id}")
with open("redacted.pdf", "wb") as f:
    f.write(redacted.content)

print(f"Redacted {len(result['entities'])} entities")

JavaScript/Node.js

const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');

async function redactPDF() {
  const form = new FormData();
  form.append('file', fs.createReadStream('document.pdf'));
  
  // Upload and redact
  const response = await axios.post(
    'http://localhost:7860/redact',
    form,
    {
      headers: form.getHeaders(),
      params: { dpi: 300 }
    }
  );
  
  const { job_id } = response.data;
  
  // Download redacted PDF
  const redacted = await axios.get(
    `http://localhost:7860/download/${job_id}`,
    { responseType: 'arraybuffer' }
  );
  
  fs.writeFileSync('redacted.pdf', redacted.data);
  console.log('Redaction complete!');
}

redactPDF();

cURL Advanced

# Redact only specific entity types
curl -X POST "http://localhost:7860/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300" \
  -F "entity_types=PER,ORG"

# Get statistics
curl http://localhost:7860/stats

# Download specific file
curl -O -J http://localhost:7860/download/JOB_ID_HERE

Common Use Cases

1. Redact All Personal Information

response = requests.post(
    "http://localhost:7860/redact",
    files={"file": open("resume.pdf", "rb")},
    params={"dpi": 300}
)

2. Redact Only Names and Organizations

response = requests.post(
    "http://localhost:7860/redact",
    files={"file": open("contract.pdf", "rb")},
    params={
        "dpi": 300,
        "entity_types": "PER,ORG"
    }
)

3. Fast Processing (Lower Quality)

response = requests.post(
    "http://localhost:7860/redact",
    files={"file": open("large_doc.pdf", "rb")},
    params={"dpi": 150}  # Faster but less accurate
)

4. High Quality (Slower)

response = requests.post(
    "http://localhost:7860/redact",
    files={"file": open("important.pdf", "rb")},
    params={"dpi": 600}  # Best quality, slowest
)

Troubleshooting

"Model not loaded"

Problem: NER model failed to load
Solution: Check internet connection, wait for model download

"Tesseract not found"

Problem: OCR engine not installed
Solution: Install tesseract-ocr system package

"Poppler not found"

Problem: PDF converter not installed
Solution: Install poppler-utils system package

Slow processing

Problem: Redaction takes too long
Solution: Lower DPI to 150-200

Out of memory

Problem: Large PDF crashes the API
Solution:

Process one page at a time
Increase container memory
Lower DPI

Next Steps

✅ Read full README.md for API details
✅ Check DEPLOYMENT.md for production setup
✅ Review STRUCTURE.md for code organization
✅ Run tests: pytest tests/
✅ Add authentication for production use
✅ Set up monitoring and logging

Support

📖 API Docs: http://localhost:7860/docs
🐛 Issues: Create on your repository
💬 HuggingFace: Community forums

Happy redacting! 🔒