Utility Module
This module contains shared utility functions and classes for the Nepal Justice Weaver project.
Components
1. PDF Processor (pdf_processor.py)
A comprehensive PDF processing module for extracting and refining Nepali text.
Features:
- PDF text extraction using PyMuPDF
- Intelligent Nepali sentence segmentation
- LLM-based refinement using Mistral
- Integration with bias detection pipeline
Key Classes:
PDFProcessor: Main class for PDF processing
Usage:
from utility.pdf_processor import PDFProcessor
processor = PDFProcessor()
result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=True
)
API Endpoints:
POST /api/v1/process-pdf- Extract sentences from PDFPOST /api/v1/process-pdf-to-bias- Extract and analyze biasGET /api/v1/pdf-health- Service health check
Dependencies
fitz (pymupdf) - PDF text extraction
mistralai - LLM for sentence refinement
fastapi - API framework (for routes)
Documentation
See docs/pdf_processing.md for:
- Complete API documentation
- Usage examples
- Configuration guide
- Troubleshooting
See pdf_processor_examples.py for code examples.
Testing
Run tests:
pytest utility/test_pdf_processor.py -v
Manual tests:
python utility/test_pdf_processor.py
Architecture
PDF Upload
β
PDFProcessor
ββ extract_text_from_pdf() [PyMuPDF]
ββ clean_text() [Regex]
ββ split_into_sentences() [Regex + Unicode]
ββ refine_sentences_with_llm() [Mistral API]
β
List of Sentences
β
Bias Detection API
File Structure
utility/
βββ __init__.py # Module initialization
βββ pdf_processor.py # Main PDF processor class
βββ pdf_processor_examples.py # Usage examples
βββ test_pdf_processor.py # Test suite
βββ README.md # This file
Future Enhancements
- OCR support for scanned PDFs
- Language auto-detection
- Additional document format support
- Caching optimization
- Batch processing improvements
Contributing
When adding new utilities:
- Add classes/functions to appropriate module
- Update
__init__.pywith exports - Add comprehensive docstrings
- Include examples in
*_examples.py - Add tests to
test_*.py
For more information about PDF processing, see docs/pdf_processing.md.