PDF-Redaction-API / STRUCTURE.md
Sammi1211's picture
adding url support
af107f1
# Project Structure
```
pdf-redaction-api/
β”‚
β”œβ”€β”€ main.py # FastAPI application entry point
β”œβ”€β”€ Dockerfile # Docker configuration for deployment
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # Project documentation (for HuggingFace)
β”œβ”€β”€ DEPLOYMENT.md # Deployment guide
β”œβ”€β”€ .gitignore # Git ignore rules
β”œβ”€β”€ .dockerignore # Docker ignore rules
β”‚
β”œβ”€β”€ app/ # Application modules
β”‚ β”œβ”€β”€ __init__.py # Package initialization
β”‚ └── redaction.py # Core redaction logic (PDFRedactor class)
β”‚
β”œβ”€β”€ uploads/ # Temporary upload directory
β”‚ └── .gitkeep # Keep directory in git
β”‚
β”œβ”€β”€ outputs/ # Redacted PDF output directory
β”‚ └── .gitkeep # Keep directory in git
β”‚
β”œβ”€β”€ tests/ # Test suite
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── test_api.py # API endpoint tests
β”‚
└── client_example.py # Example client for API usage
```
## File Descriptions
### Core Files
#### `main.py`
FastAPI application with endpoints:
- `POST /redact` - Upload and redact PDF
- `GET /download/{job_id}` - Download redacted PDF
- `GET /health` - Health check
- `GET /stats` - API statistics
- `DELETE /cleanup/{job_id}` - Manual cleanup
#### `app/redaction.py`
Core redaction logic:
- `PDFRedactor` class
- OCR processing with pytesseract
- NER using HuggingFace transformers
- Entity-to-box mapping
- PDF redaction with coordinate scaling
### Configuration Files
#### `requirements.txt`
Python dependencies:
- FastAPI & Uvicorn (API framework)
- Transformers & Torch (NER model)
- PyPDF (PDF manipulation)
- pdf2image (PDF to image conversion)
- pytesseract (OCR)
- Pillow (Image processing)
#### `Dockerfile`
Multi-stage build:
1. Install system dependencies (tesseract, poppler)
2. Install Python dependencies
3. Copy application code
4. Configure for port 7860 (HuggingFace default)
### Documentation
#### `README.md`
HuggingFace Space documentation:
- Features overview
- API endpoint documentation
- Usage examples (cURL, Python)
- Response format
- Local development setup
#### `DEPLOYMENT.md`
Step-by-step deployment guide:
- HuggingFace Spaces setup
- Git workflow
- Configuration options
- Security considerations
- Troubleshooting
- Cost estimation
### Testing & Examples
#### `tests/test_api.py`
Unit tests for API endpoints:
- Health check tests
- Upload validation tests
- Error handling tests
#### `client_example.py`
Example client implementation:
- Upload PDF
- Download redacted file
- Health check
- Statistics
## Data Flow
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Client uploads PDF β”‚
β”‚ POST /redact with file β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. FastAPI (main.py) β”‚
β”‚ - Validates file β”‚
β”‚ - Generates job_id β”‚
β”‚ - Saves to uploads/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. PDFRedactor (app/redaction.py) β”‚
β”‚ - perform_ocr() β†’ Extract text + boxes β”‚
β”‚ - run_ner() β†’ Identify entities β”‚
β”‚ - map_entities_to_boxes() β†’ Link entities to coords β”‚
β”‚ - create_redacted_pdf() β†’ Generate output β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Response β”‚
β”‚ - Return job_id and entity list β”‚
β”‚ - Save redacted PDF to outputs/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Client downloads β”‚
β”‚ GET /download/{job_id} β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Key Components
### 1. FastAPI Application (`main.py`)
**Endpoints:**
- RESTful API design
- File upload handling
- Background task cleanup
- CORS middleware for web access
**Features:**
- Automatic OpenAPI documentation at `/docs`
- JSON response models with Pydantic
- Error handling with HTTP exceptions
- Request validation
### 2. Redaction Engine (`app/redaction.py`)
**Pipeline Steps:**
1. **OCR Processing**
- Convert PDF pages to images (pdf2image)
- Extract text and bounding boxes (pytesseract)
- Store image dimensions for coordinate scaling
2. **NER Processing**
- Load HuggingFace model
- Identify entities in text
- Return entity types and character positions
3. **Mapping**
- Create character span index for OCR words
- Match NER entities to OCR bounding boxes
- Handle partial word matches
4. **Redaction**
- Scale OCR image coordinates to PDF points
- Create black rectangle annotations
- Write redacted PDF with pypdf
### 3. Docker Container
**Layers:**
- Base: Python 3.10 slim
- System packages: tesseract-ocr, poppler-utils
- Python packages: From requirements.txt
- Application code: Copied last for better caching
**Optimizations:**
- Multi-stage build (not used here, but possible)
- Minimal base image
- Cached dependency layers
- .dockerignore to reduce context size
## Environment Variables
Default configuration (can be overridden):
```bash
PYTHONUNBUFFERED=1 # Immediate log output
HF_HOME=/app/cache # HuggingFace cache directory
```
## Port Configuration
- **Development**: 7860 (configurable in main.py)
- **Production (HF Spaces)**: 7860 (required)
## Directory Permissions
Ensure write permissions for:
- `uploads/` - Temporary PDF storage
- `outputs/` - Redacted PDF storage
- `cache/` - Model cache (created automatically)
## Adding New Features
### Add New Endpoint
1. Define in `main.py`:
```python
@app.get("/new-endpoint")
async def new_endpoint():
return {"message": "Hello"}
```
2. Add response model if needed
3. Update README.md documentation
4. Add tests in `tests/test_api.py`
### Add New Redaction Option
1. Modify `PDFRedactor` class in `app/redaction.py`
2. Add parameter to `redact_document()` method
3. Update API endpoint in `main.py`
4. Document in README.md
### Add Authentication
1. Install: `pip install python-jose passlib`
2. Create `app/auth.py` with JWT logic
3. Add middleware to `main.py`
4. Protect endpoints with dependencies
## Best Practices
1. **Logging**: Use `logger` for all important events
2. **Error Handling**: Catch exceptions and return meaningful errors
3. **Validation**: Use Pydantic models for request/response validation
4. **Cleanup**: Always clean up temporary files
5. **Documentation**: Keep README.md and code comments updated
6. **Testing**: Add tests for new features
## Performance Considerations
### Bottlenecks
1. OCR processing (most time-consuming)
2. Model inference (NER)
3. File I/O
### Optimizations
- Lower DPI for faster OCR (trade-off with accuracy)
- Cache loaded models in memory
- Use async file operations
- Implement request queuing for high load
- Consider GPU for NER model
### Scaling
- Horizontal: Multiple container instances
- Vertical: Larger CPU/RAM allocation
- Caching: Redis for temporary results
- Queue: Celery for background processing