setu / utility /README.md
khagu's picture
chore: finally untrack large database files
3998131
# Utility Module
This module contains shared utility functions and classes for the Nepal Justice Weaver project.
## Components
### 1. PDF Processor (`pdf_processor.py`)
A comprehensive PDF processing module for extracting and refining Nepali text.
**Features:**
- PDF text extraction using PyMuPDF
- Intelligent Nepali sentence segmentation
- LLM-based refinement using Mistral
- Integration with bias detection pipeline
**Key Classes:**
- `PDFProcessor`: Main class for PDF processing
**Usage:**
```python
from utility.pdf_processor import PDFProcessor
processor = PDFProcessor()
result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=True
)
```
**API Endpoints:**
- `POST /api/v1/process-pdf` - Extract sentences from PDF
- `POST /api/v1/process-pdf-to-bias` - Extract and analyze bias
- `GET /api/v1/pdf-health` - Service health check
## Dependencies
```
fitz (pymupdf) - PDF text extraction
mistralai - LLM for sentence refinement
fastapi - API framework (for routes)
```
## Documentation
See [docs/pdf_processing.md](../docs/pdf_processing.md) for:
- Complete API documentation
- Usage examples
- Configuration guide
- Troubleshooting
See [pdf_processor_examples.py](pdf_processor_examples.py) for code examples.
## Testing
Run tests:
```bash
pytest utility/test_pdf_processor.py -v
```
Manual tests:
```bash
python utility/test_pdf_processor.py
```
## Architecture
```
PDF Upload
↓
PDFProcessor
β”œβ”€ extract_text_from_pdf() [PyMuPDF]
β”œβ”€ clean_text() [Regex]
β”œβ”€ split_into_sentences() [Regex + Unicode]
└─ refine_sentences_with_llm() [Mistral API]
↓
List of Sentences
↓
Bias Detection API
```
## File Structure
```
utility/
β”œβ”€β”€ __init__.py # Module initialization
β”œβ”€β”€ pdf_processor.py # Main PDF processor class
β”œβ”€β”€ pdf_processor_examples.py # Usage examples
β”œβ”€β”€ test_pdf_processor.py # Test suite
└── README.md # This file
```
## Future Enhancements
- [ ] OCR support for scanned PDFs
- [ ] Language auto-detection
- [ ] Additional document format support
- [ ] Caching optimization
- [ ] Batch processing improvements
## Contributing
When adding new utilities:
1. Add classes/functions to appropriate module
2. Update `__init__.py` with exports
3. Add comprehensive docstrings
4. Include examples in `*_examples.py`
5. Add tests to `test_*.py`
---
For more information about PDF processing, see [docs/pdf_processing.md](../docs/pdf_processing.md).