File size: 2,583 Bytes
3998131 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | # Utility Module
This module contains shared utility functions and classes for the Nepal Justice Weaver project.
## Components
### 1. PDF Processor (`pdf_processor.py`)
A comprehensive PDF processing module for extracting and refining Nepali text.
**Features:**
- PDF text extraction using PyMuPDF
- Intelligent Nepali sentence segmentation
- LLM-based refinement using Mistral
- Integration with bias detection pipeline
**Key Classes:**
- `PDFProcessor`: Main class for PDF processing
**Usage:**
```python
from utility.pdf_processor import PDFProcessor
processor = PDFProcessor()
result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=True
)
```
**API Endpoints:**
- `POST /api/v1/process-pdf` - Extract sentences from PDF
- `POST /api/v1/process-pdf-to-bias` - Extract and analyze bias
- `GET /api/v1/pdf-health` - Service health check
## Dependencies
```
fitz (pymupdf) - PDF text extraction
mistralai - LLM for sentence refinement
fastapi - API framework (for routes)
```
## Documentation
See [docs/pdf_processing.md](../docs/pdf_processing.md) for:
- Complete API documentation
- Usage examples
- Configuration guide
- Troubleshooting
See [pdf_processor_examples.py](pdf_processor_examples.py) for code examples.
## Testing
Run tests:
```bash
pytest utility/test_pdf_processor.py -v
```
Manual tests:
```bash
python utility/test_pdf_processor.py
```
## Architecture
```
PDF Upload
β
PDFProcessor
ββ extract_text_from_pdf() [PyMuPDF]
ββ clean_text() [Regex]
ββ split_into_sentences() [Regex + Unicode]
ββ refine_sentences_with_llm() [Mistral API]
β
List of Sentences
β
Bias Detection API
```
## File Structure
```
utility/
βββ __init__.py # Module initialization
βββ pdf_processor.py # Main PDF processor class
βββ pdf_processor_examples.py # Usage examples
βββ test_pdf_processor.py # Test suite
βββ README.md # This file
```
## Future Enhancements
- [ ] OCR support for scanned PDFs
- [ ] Language auto-detection
- [ ] Additional document format support
- [ ] Caching optimization
- [ ] Batch processing improvements
## Contributing
When adding new utilities:
1. Add classes/functions to appropriate module
2. Update `__init__.py` with exports
3. Add comprehensive docstrings
4. Include examples in `*_examples.py`
5. Add tests to `test_*.py`
---
For more information about PDF processing, see [docs/pdf_processing.md](../docs/pdf_processing.md).
|