| # Utility Module | |
| This module contains shared utility functions and classes for the Nepal Justice Weaver project. | |
| ## Components | |
| ### 1. PDF Processor (`pdf_processor.py`) | |
| A comprehensive PDF processing module for extracting and refining Nepali text. | |
| **Features:** | |
| - PDF text extraction using PyMuPDF | |
| - Intelligent Nepali sentence segmentation | |
| - LLM-based refinement using Mistral | |
| - Integration with bias detection pipeline | |
| **Key Classes:** | |
| - `PDFProcessor`: Main class for PDF processing | |
| **Usage:** | |
| ```python | |
| from utility.pdf_processor import PDFProcessor | |
| processor = PDFProcessor() | |
| result = processor.process_pdf( | |
| pdf_path="document.pdf", | |
| refine_with_llm=True | |
| ) | |
| ``` | |
| **API Endpoints:** | |
| - `POST /api/v1/process-pdf` - Extract sentences from PDF | |
| - `POST /api/v1/process-pdf-to-bias` - Extract and analyze bias | |
| - `GET /api/v1/pdf-health` - Service health check | |
| ## Dependencies | |
| ``` | |
| fitz (pymupdf) - PDF text extraction | |
| mistralai - LLM for sentence refinement | |
| fastapi - API framework (for routes) | |
| ``` | |
| ## Documentation | |
| See [docs/pdf_processing.md](../docs/pdf_processing.md) for: | |
| - Complete API documentation | |
| - Usage examples | |
| - Configuration guide | |
| - Troubleshooting | |
| See [pdf_processor_examples.py](pdf_processor_examples.py) for code examples. | |
| ## Testing | |
| Run tests: | |
| ```bash | |
| pytest utility/test_pdf_processor.py -v | |
| ``` | |
| Manual tests: | |
| ```bash | |
| python utility/test_pdf_processor.py | |
| ``` | |
| ## Architecture | |
| ``` | |
| PDF Upload | |
| β | |
| PDFProcessor | |
| ββ extract_text_from_pdf() [PyMuPDF] | |
| ββ clean_text() [Regex] | |
| ββ split_into_sentences() [Regex + Unicode] | |
| ββ refine_sentences_with_llm() [Mistral API] | |
| β | |
| List of Sentences | |
| β | |
| Bias Detection API | |
| ``` | |
| ## File Structure | |
| ``` | |
| utility/ | |
| βββ __init__.py # Module initialization | |
| βββ pdf_processor.py # Main PDF processor class | |
| βββ pdf_processor_examples.py # Usage examples | |
| βββ test_pdf_processor.py # Test suite | |
| βββ README.md # This file | |
| ``` | |
| ## Future Enhancements | |
| - [ ] OCR support for scanned PDFs | |
| - [ ] Language auto-detection | |
| - [ ] Additional document format support | |
| - [ ] Caching optimization | |
| - [ ] Batch processing improvements | |
| ## Contributing | |
| When adding new utilities: | |
| 1. Add classes/functions to appropriate module | |
| 2. Update `__init__.py` with exports | |
| 3. Add comprehensive docstrings | |
| 4. Include examples in `*_examples.py` | |
| 5. Add tests to `test_*.py` | |
| --- | |
| For more information about PDF processing, see [docs/pdf_processing.md](../docs/pdf_processing.md). | |