Spaces:

Text-to-Document-Generation
/

PDF-Redaction-API

Sleeping

App Files Files Community

PDF-Redaction-API / STRUCTURE.md

Sammi1211

adding url support

af107f1 about 2 months ago

preview code

raw

history blame contribute delete

9.11 kB

	# Project Structure

	```
	pdf-redaction-api/
	│
	├── main.py # FastAPI application entry point
	├── Dockerfile # Docker configuration for deployment
	├── requirements.txt # Python dependencies
	├── README.md # Project documentation (for HuggingFace)
	├── DEPLOYMENT.md # Deployment guide
	├── .gitignore # Git ignore rules
	├── .dockerignore # Docker ignore rules
	│
	├── app/ # Application modules
	│ ├── __init__.py # Package initialization
	│ └── redaction.py # Core redaction logic (PDFRedactor class)
	│
	├── uploads/ # Temporary upload directory
	│ └── .gitkeep # Keep directory in git
	│
	├── outputs/ # Redacted PDF output directory
	│ └── .gitkeep # Keep directory in git
	│
	├── tests/ # Test suite
	│ ├── __init__.py
	│ └── test_api.py # API endpoint tests
	│
	└── client_example.py # Example client for API usage
	```

	## File Descriptions

	### Core Files

	#### `main.py`
	FastAPI application with endpoints:
	- `POST /redact` - Upload and redact PDF
	- `GET /download/{job_id}` - Download redacted PDF
	- `GET /health` - Health check
	- `GET /stats` - API statistics
	- `DELETE /cleanup/{job_id}` - Manual cleanup

	#### `app/redaction.py`
	Core redaction logic:
	- `PDFRedactor` class
	- OCR processing with pytesseract
	- NER using HuggingFace transformers
	- Entity-to-box mapping
	- PDF redaction with coordinate scaling

	### Configuration Files

	#### `requirements.txt`
	Python dependencies:
	- FastAPI & Uvicorn (API framework)
	- Transformers & Torch (NER model)
	- PyPDF (PDF manipulation)
	- pdf2image (PDF to image conversion)
	- pytesseract (OCR)
	- Pillow (Image processing)

	#### `Dockerfile`
	Multi-stage build:
	1. Install system dependencies (tesseract, poppler)
	2. Install Python dependencies
	3. Copy application code
	4. Configure for port 7860 (HuggingFace default)

	### Documentation

	#### `README.md`
	HuggingFace Space documentation:
	- Features overview
	- API endpoint documentation
	- Usage examples (cURL, Python)
	- Response format
	- Local development setup

	#### `DEPLOYMENT.md`
	Step-by-step deployment guide:
	- HuggingFace Spaces setup
	- Git workflow
	- Configuration options
	- Security considerations
	- Troubleshooting
	- Cost estimation

	### Testing & Examples

	#### `tests/test_api.py`
	Unit tests for API endpoints:
	- Health check tests
	- Upload validation tests
	- Error handling tests

	#### `client_example.py`
	Example client implementation:
	- Upload PDF
	- Download redacted file
	- Health check
	- Statistics

	## Data Flow

	```
	┌─────────────────────────────────────────────────────────┐
	│ 1. Client uploads PDF │
	│ POST /redact with file │
	└─────────────────────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────────────────────┐
	│ 2. FastAPI (main.py) │
	│ - Validates file │
	│ - Generates job_id │
	│ - Saves to uploads/ │
	└─────────────────────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────────────────────┐
	│ 3. PDFRedactor (app/redaction.py) │
	│ - perform_ocr() → Extract text + boxes │
	│ - run_ner() → Identify entities │
	│ - map_entities_to_boxes() → Link entities to coords │
	│ - create_redacted_pdf() → Generate output │
	└─────────────────────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────────────────────┐
	│ 4. Response │
	│ - Return job_id and entity list │
	│ - Save redacted PDF to outputs/ │
	└─────────────────────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────────────────────┐
	│ 5. Client downloads │
	│ GET /download/{job_id} │
	└─────────────────────────────────────────────────────────┘
	```

	## Key Components

	### 1. FastAPI Application (`main.py`)

	Endpoints:
	- RESTful API design
	- File upload handling
	- Background task cleanup
	- CORS middleware for web access

	Features:
	- Automatic OpenAPI documentation at `/docs`
	- JSON response models with Pydantic
	- Error handling with HTTP exceptions
	- Request validation

	### 2. Redaction Engine (`app/redaction.py`)

	Pipeline Steps:

	1. OCR Processing
	- Convert PDF pages to images (pdf2image)
	- Extract text and bounding boxes (pytesseract)
	- Store image dimensions for coordinate scaling

	2. NER Processing
	- Load HuggingFace model
	- Identify entities in text
	- Return entity types and character positions

	3. Mapping
	- Create character span index for OCR words
	- Match NER entities to OCR bounding boxes
	- Handle partial word matches

	4. Redaction
	- Scale OCR image coordinates to PDF points
	- Create black rectangle annotations
	- Write redacted PDF with pypdf

	### 3. Docker Container

	Layers:
	- Base: Python 3.10 slim
	- System packages: tesseract-ocr, poppler-utils
	- Python packages: From requirements.txt
	- Application code: Copied last for better caching

	Optimizations:
	- Multi-stage build (not used here, but possible)
	- Minimal base image
	- Cached dependency layers
	- .dockerignore to reduce context size

	## Environment Variables

	Default configuration (can be overridden):

	```bash
	PYTHONUNBUFFERED=1 # Immediate log output
	HF_HOME=/app/cache # HuggingFace cache directory
	```

	## Port Configuration

	- Development: 7860 (configurable in main.py)
	- Production (HF Spaces): 7860 (required)

	## Directory Permissions

	Ensure write permissions for:
	- `uploads/` - Temporary PDF storage
	- `outputs/` - Redacted PDF storage
	- `cache/` - Model cache (created automatically)

	## Adding New Features

	### Add New Endpoint

	1. Define in `main.py`:
	```python
	@app.get("/new-endpoint")
	async def new_endpoint():
	return {"message": "Hello"}
	```

	2. Add response model if needed
	3. Update README.md documentation
	4. Add tests in `tests/test_api.py`

	### Add New Redaction Option

	1. Modify `PDFRedactor` class in `app/redaction.py`
	2. Add parameter to `redact_document()` method
	3. Update API endpoint in `main.py`
	4. Document in README.md

	### Add Authentication

	1. Install: `pip install python-jose passlib`
	2. Create `app/auth.py` with JWT logic
	3. Add middleware to `main.py`
	4. Protect endpoints with dependencies

	## Best Practices

	1. Logging: Use `logger` for all important events
	2. Error Handling: Catch exceptions and return meaningful errors
	3. Validation: Use Pydantic models for request/response validation
	4. Cleanup: Always clean up temporary files
	5. Documentation: Keep README.md and code comments updated
	6. Testing: Add tests for new features

	## Performance Considerations

	### Bottlenecks
	1. OCR processing (most time-consuming)
	2. Model inference (NER)
	3. File I/O

	### Optimizations
	- Lower DPI for faster OCR (trade-off with accuracy)
	- Cache loaded models in memory
	- Use async file operations
	- Implement request queuing for high load
	- Consider GPU for NER model

	### Scaling
	- Horizontal: Multiple container instances
	- Vertical: Larger CPU/RAM allocation
	- Caching: Redis for temporary results
	- Queue: Celery for background processing