Quick Start Guide π
Local Development (5 minutes)
1. Install System Dependencies
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils
macOS:
brew install tesseract poppler
Windows:
- Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
- Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases
2. Install Python Dependencies
pip install -r requirements.txt
3. Run the Server
python main.py
The API will be available at: http://localhost:7860
4. Test with cURL
# Health check
curl http://localhost:7860/health
# Redact a PDF
curl -X POST "http://localhost:7860/redact" \
-F "file=@your_document.pdf" \
-F "dpi=300"
5. Access API Documentation
Open in browser: http://localhost:7860/docs
Using Docker (3 minutes)
1. Build Image
docker build -t pdf-redaction-api .
2. Run Container
docker run -p 7860:7860 pdf-redaction-api
3. Test
curl http://localhost:7860/health
Deploy to HuggingFace Spaces (10 minutes)
1. Create Space
- Go to https://huggingface.co/spaces
- Click "Create new Space"
- Name:
pdf-redaction-api - SDK: Docker
- Click "Create Space"
2. Push Code
# Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api
# Copy all project files
cp -r /path/to/project/* .
# Commit and push
git add .
git commit -m "Initial deployment"
git push
3. Wait for Build
Monitor at: https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
4. Test Your Deployed API
curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health
Example Usage
Python Client
import requests
# Upload and redact
files = {"file": open("document.pdf", "rb")}
response = requests.post(
"http://localhost:7860/redact",
files=files,
params={"dpi": 300}
)
result = response.json()
job_id = result["job_id"]
# Download redacted PDF
redacted = requests.get(f"http://localhost:7860/download/{job_id}")
with open("redacted.pdf", "wb") as f:
f.write(redacted.content)
print(f"Redacted {len(result['entities'])} entities")
JavaScript/Node.js
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function redactPDF() {
const form = new FormData();
form.append('file', fs.createReadStream('document.pdf'));
// Upload and redact
const response = await axios.post(
'http://localhost:7860/redact',
form,
{
headers: form.getHeaders(),
params: { dpi: 300 }
}
);
const { job_id } = response.data;
// Download redacted PDF
const redacted = await axios.get(
`http://localhost:7860/download/${job_id}`,
{ responseType: 'arraybuffer' }
);
fs.writeFileSync('redacted.pdf', redacted.data);
console.log('Redaction complete!');
}
redactPDF();
cURL Advanced
# Redact only specific entity types
curl -X POST "http://localhost:7860/redact" \
-F "file=@document.pdf" \
-F "dpi=300" \
-F "entity_types=PER,ORG"
# Get statistics
curl http://localhost:7860/stats
# Download specific file
curl -O -J http://localhost:7860/download/JOB_ID_HERE
Common Use Cases
1. Redact All Personal Information
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("resume.pdf", "rb")},
params={"dpi": 300}
)
2. Redact Only Names and Organizations
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("contract.pdf", "rb")},
params={
"dpi": 300,
"entity_types": "PER,ORG"
}
)
3. Fast Processing (Lower Quality)
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("large_doc.pdf", "rb")},
params={"dpi": 150} # Faster but less accurate
)
4. High Quality (Slower)
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("important.pdf", "rb")},
params={"dpi": 600} # Best quality, slowest
)
Troubleshooting
"Model not loaded"
Problem: NER model failed to load
Solution: Check internet connection, wait for model download
"Tesseract not found"
Problem: OCR engine not installed
Solution: Install tesseract-ocr system package
"Poppler not found"
Problem: PDF converter not installed
Solution: Install poppler-utils system package
Slow processing
Problem: Redaction takes too long
Solution: Lower DPI to 150-200
Out of memory
Problem: Large PDF crashes the API
Solution:
- Process one page at a time
- Increase container memory
- Lower DPI
Next Steps
- β Read full README.md for API details
- β Check DEPLOYMENT.md for production setup
- β Review STRUCTURE.md for code organization
- β
Run tests:
pytest tests/ - β Add authentication for production use
- β Set up monitoring and logging
Support
- π API Docs:
http://localhost:7860/docs - π Issues: Create on your repository
- π¬ HuggingFace: Community forums
Happy redacting! π