PDF-Redaction-API / QUICKSTART.md
Sammi1211's picture
adding url support
af107f1
# Quick Start Guide πŸš€
## Local Development (5 minutes)
### 1. Install System Dependencies
**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils
```
**macOS:**
```bash
brew install tesseract poppler
```
**Windows:**
- Download Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
- Download Poppler: https://github.com/oschwartz10612/poppler-windows/releases
### 2. Install Python Dependencies
```bash
pip install -r requirements.txt
```
### 3. Run the Server
```bash
python main.py
```
The API will be available at: `http://localhost:7860`
### 4. Test with cURL
```bash
# Health check
curl http://localhost:7860/health
# Redact a PDF
curl -X POST "http://localhost:7860/redact" \
-F "file=@your_document.pdf" \
-F "dpi=300"
```
### 5. Access API Documentation
Open in browser: `http://localhost:7860/docs`
## Using Docker (3 minutes)
### 1. Build Image
```bash
docker build -t pdf-redaction-api .
```
### 2. Run Container
```bash
docker run -p 7860:7860 pdf-redaction-api
```
### 3. Test
```bash
curl http://localhost:7860/health
```
## Deploy to HuggingFace Spaces (10 minutes)
### 1. Create Space
1. Go to https://huggingface.co/spaces
2. Click "Create new Space"
3. Name: `pdf-redaction-api`
4. SDK: **Docker**
5. Click "Create Space"
### 2. Push Code
```bash
# Clone your space
git clone https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api
cd pdf-redaction-api
# Copy all project files
cp -r /path/to/project/* .
# Commit and push
git add .
git commit -m "Initial deployment"
git push
```
### 3. Wait for Build
Monitor at: `https://huggingface.co/spaces/YOUR_USERNAME/pdf-redaction-api`
### 4. Test Your Deployed API
```bash
curl https://YOUR_USERNAME-pdf-redaction-api.hf.space/health
```
## Example Usage
### Python Client
```python
import requests
# Upload and redact
files = {"file": open("document.pdf", "rb")}
response = requests.post(
"http://localhost:7860/redact",
files=files,
params={"dpi": 300}
)
result = response.json()
job_id = result["job_id"]
# Download redacted PDF
redacted = requests.get(f"http://localhost:7860/download/{job_id}")
with open("redacted.pdf", "wb") as f:
f.write(redacted.content)
print(f"Redacted {len(result['entities'])} entities")
```
### JavaScript/Node.js
```javascript
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function redactPDF() {
const form = new FormData();
form.append('file', fs.createReadStream('document.pdf'));
// Upload and redact
const response = await axios.post(
'http://localhost:7860/redact',
form,
{
headers: form.getHeaders(),
params: { dpi: 300 }
}
);
const { job_id } = response.data;
// Download redacted PDF
const redacted = await axios.get(
`http://localhost:7860/download/${job_id}`,
{ responseType: 'arraybuffer' }
);
fs.writeFileSync('redacted.pdf', redacted.data);
console.log('Redaction complete!');
}
redactPDF();
```
### cURL Advanced
```bash
# Redact only specific entity types
curl -X POST "http://localhost:7860/redact" \
-F "file=@document.pdf" \
-F "dpi=300" \
-F "entity_types=PER,ORG"
# Get statistics
curl http://localhost:7860/stats
# Download specific file
curl -O -J http://localhost:7860/download/JOB_ID_HERE
```
## Common Use Cases
### 1. Redact All Personal Information
```python
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("resume.pdf", "rb")},
params={"dpi": 300}
)
```
### 2. Redact Only Names and Organizations
```python
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("contract.pdf", "rb")},
params={
"dpi": 300,
"entity_types": "PER,ORG"
}
)
```
### 3. Fast Processing (Lower Quality)
```python
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("large_doc.pdf", "rb")},
params={"dpi": 150} # Faster but less accurate
)
```
### 4. High Quality (Slower)
```python
response = requests.post(
"http://localhost:7860/redact",
files={"file": open("important.pdf", "rb")},
params={"dpi": 600} # Best quality, slowest
)
```
## Troubleshooting
### "Model not loaded"
**Problem**: NER model failed to load
**Solution**: Check internet connection, wait for model download
### "Tesseract not found"
**Problem**: OCR engine not installed
**Solution**: Install tesseract-ocr system package
### "Poppler not found"
**Problem**: PDF converter not installed
**Solution**: Install poppler-utils system package
### Slow processing
**Problem**: Redaction takes too long
**Solution**: Lower DPI to 150-200
### Out of memory
**Problem**: Large PDF crashes the API
**Solution**:
- Process one page at a time
- Increase container memory
- Lower DPI
## Next Steps
- βœ… Read full [README.md](README.md) for API details
- βœ… Check [DEPLOYMENT.md](DEPLOYMENT.md) for production setup
- βœ… Review [STRUCTURE.md](STRUCTURE.md) for code organization
- βœ… Run tests: `pytest tests/`
- βœ… Add authentication for production use
- βœ… Set up monitoring and logging
## Support
- πŸ“– API Docs: `http://localhost:7860/docs`
- πŸ› Issues: Create on your repository
- πŸ’¬ HuggingFace: Community forums
Happy redacting! πŸ”’