setu / utility /README.md
khagu's picture
chore: finally untrack large database files
3998131

Utility Module

This module contains shared utility functions and classes for the Nepal Justice Weaver project.

Components

1. PDF Processor (pdf_processor.py)

A comprehensive PDF processing module for extracting and refining Nepali text.

Features:

  • PDF text extraction using PyMuPDF
  • Intelligent Nepali sentence segmentation
  • LLM-based refinement using Mistral
  • Integration with bias detection pipeline

Key Classes:

  • PDFProcessor: Main class for PDF processing

Usage:

from utility.pdf_processor import PDFProcessor

processor = PDFProcessor()
result = processor.process_pdf(
    pdf_path="document.pdf",
    refine_with_llm=True
)

API Endpoints:

  • POST /api/v1/process-pdf - Extract sentences from PDF
  • POST /api/v1/process-pdf-to-bias - Extract and analyze bias
  • GET /api/v1/pdf-health - Service health check

Dependencies

fitz (pymupdf)  - PDF text extraction
mistralai       - LLM for sentence refinement
fastapi         - API framework (for routes)

Documentation

See docs/pdf_processing.md for:

  • Complete API documentation
  • Usage examples
  • Configuration guide
  • Troubleshooting

See pdf_processor_examples.py for code examples.

Testing

Run tests:

pytest utility/test_pdf_processor.py -v

Manual tests:

python utility/test_pdf_processor.py

Architecture

PDF Upload
   ↓
PDFProcessor
   β”œβ”€ extract_text_from_pdf()      [PyMuPDF]
   β”œβ”€ clean_text()                  [Regex]
   β”œβ”€ split_into_sentences()        [Regex + Unicode]
   └─ refine_sentences_with_llm()   [Mistral API]
   ↓
List of Sentences
   ↓
Bias Detection API

File Structure

utility/
β”œβ”€β”€ __init__.py                  # Module initialization
β”œβ”€β”€ pdf_processor.py             # Main PDF processor class
β”œβ”€β”€ pdf_processor_examples.py    # Usage examples
β”œβ”€β”€ test_pdf_processor.py        # Test suite
└── README.md                    # This file

Future Enhancements

  • OCR support for scanned PDFs
  • Language auto-detection
  • Additional document format support
  • Caching optimization
  • Batch processing improvements

Contributing

When adding new utilities:

  1. Add classes/functions to appropriate module
  2. Update __init__.py with exports
  3. Add comprehensive docstrings
  4. Include examples in *_examples.py
  5. Add tests to test_*.py

For more information about PDF processing, see docs/pdf_processing.md.