# PDF Processing Module for Bias Detection ## Overview The PDF Processing module (`utility/pdf_processor.py`) provides a complete pipeline for extracting text from Nepali PDFs and preparing sentences for bias detection analysis. **Key Features:** - ✓ PDF text extraction using PyMuPDF (fitz) - ✓ Intelligent Nepali sentence segmentation - ✓ LLM-based sentence refinement using Mistral - ✓ Integration with bias detection API - ✓ File upload support via API endpoints - ✓ Error handling and logging ## Architecture ``` User Upload (PDF) ↓ PDFProcessor.process_pdf_from_bytes() ↓ [Extract Text] → [Clean Text] → [Split Sentences] → [Refine with LLM] ↓ List of Refined Sentences ↓ Bias Detection Model ↓ Bias Analysis Results ``` ## Installation ### Required Dependencies ```bash # PyMuPDF for PDF text extraction pip install pymupdf # Already included in module_a # mistralai - Mistral LLM client ``` ### Setup 1. Ensure `mistralai` is installed in your environment 2. Set `MISTRAL_API_KEY` environment variable 3. Module uses existing `MistralClient` from `module_a/llm_client.py` ## Usage ### 1. Basic Python Usage ```python from utility.pdf_processor import PDFProcessor # Initialize processor processor = PDFProcessor() # Process PDF from file path result = processor.process_pdf( pdf_path="path/to/document.pdf", refine_with_llm=True ) if result["success"]: sentences = result["sentences"] print(f"Extracted {result['total_sentences']} sentences") for sentence in sentences: print(f"- {sentence}") ``` ### 2. Process from Bytes (File Uploads) ```python # For API file uploads processor = PDFProcessor() pdf_bytes = await request.file.read() result = processor.process_pdf_from_bytes( pdf_bytes=pdf_bytes, refine_with_llm=True ) ``` ### 3. API Endpoints #### A. Extract Sentences Only **Endpoint:** `POST /api/v1/process-pdf` **Request:** ```bash curl -X POST "http://localhost:8000/api/v1/process-pdf" \ -F "file=@nepali_document.pdf" \ -F "refine_with_llm=true" ``` **Response:** ```json { "success": true, "sentences": [ "पहिलो वाक्य यहाँ छ।", "दोस्रो वाक्य यहाँ छ।", "तेस्रो वाक्य यहाँ छ।" ], "total_sentences": 3, "filename": "nepali_document.pdf", "raw_text": "पहिलो वाक्य यहाँ छ। दोस्रो वाक्य यहाँ छ। तेस्रो वाक्य यहाँ छ।" } ``` #### B. Extract Sentences + Bias Detection **Endpoint:** `POST /api/v1/process-pdf-to-bias` **Request:** ```bash curl -X POST "http://localhost:8000/api/v1/process-pdf-to-bias" \ -F "file=@nepali_document.pdf" \ -F "refine_with_llm=true" \ -F "confidence_threshold=0.7" ``` **Response:** ```json { "success": true, "total_sentences": 3, "biased_count": 1, "neutral_count": 2, "results": [ { "sentence": "पहिलो वाक्य यहाँ छ।", "category": "neutral", "confidence": 0.95, "is_biased": false }, { "sentence": "दोस्रो वाक्य यहाँ छ।", "category": "gender", "confidence": 0.82, "is_biased": true }, { "sentence": "तेस्रो वाक्य यहाँ छ।", "category": "neutral", "confidence": 0.91, "is_biased": false } ], "filename": "nepali_document.pdf" } ``` #### C. Service Health Check **Endpoint:** `GET /api/v1/pdf-health` **Response:** ```json { "status": "healthy", "pdf_processor": "ready", "mistral_client": "connected", "features": { "pdf_extraction": true, "sentence_segmentation": true, "llm_refinement": true } } ``` ## API Schemas ### PDFProcessingResponse ```python { "success": bool, "sentences": List[str], "total_sentences": int, "raw_text": Optional[str], "error": Optional[str], "filename": Optional[str] } ``` ### PDFToBiasDetectionResponse ```python { "success": bool, "total_sentences": int, "biased_count": int, "neutral_count": int, "results": List[BiasResult], "error": Optional[str], "filename": Optional[str] } ``` Where `BiasResult`: ```python { "sentence": str, "category": str, "confidence": float, "is_biased": bool } ``` ## Processing Pipeline ### Step 1: Text Extraction - Uses PyMuPDF (fitz) to extract text from PDF - Handles multi-page documents - Detects image-based PDFs (requires OCR) ### Step 2: Text Cleaning - Removes extra whitespace - Normalizes newlines - Fixes formatting issues ### Step 3: Sentence Segmentation - Uses regex patterns for Nepali sentence boundaries - Recognizes: । (danda), . , ! , ? - Filters out short fragments (< 5 characters) ### Step 4: LLM Refinement (Optional) - Sends sentences to Mistral LLM - Corrects mis-segmented sentences - Removes duplicates - Returns properly formatted JSON array ## Configuration ### Environment Variables ```bash # Required for LLM refinement export MISTRAL_API_KEY="your-api-key" # Optional export MISTRAL_MODEL="mistral-small" # Default: mistral-small export LOG_LEVEL="INFO" ``` ### Processing Options ```python # With LLM refinement (more accurate, slower) result = processor.process_pdf( pdf_path="document.pdf", refine_with_llm=True # Uses Mistral LLM ) # Without LLM refinement (faster, regex-based) result = processor.process_pdf( pdf_path="document.pdf", refine_with_llm=False # Regex-based segmentation only ) ``` ## Error Handling The module handles various error scenarios: ```python result = processor.process_pdf(pdf_path="file.pdf") if not result["success"]: error = result["error"] # Possible errors: # - "No text could be extracted from the PDF" # - "Could not segment sentences from extracted text" # - "PDF might be image-based (requires OCR)" # - "File not found: path/to/file.pdf" ``` ## Performance Considerations ### Execution Time Estimates | Operation | Time | Notes | |-----------|------|-------| | PDF Text Extraction | ~100-500ms | Depends on PDF size | | Sentence Segmentation | ~50-200ms | Regex-based | | LLM Refinement | ~2-5s | API call to Mistral | | Total (with LLM) | ~3-6s | Per document | | Total (without LLM) | ~150-700ms | Regex only | ### Optimization Tips 1. **Disable LLM refinement** for faster processing when accuracy is less critical 2. **Batch multiple PDFs** to amortize API overhead 3. **Cache results** if processing same PDFs repeatedly ## Integration with Bias Detection ### Workflow ``` 1. User uploads PDF ↓ 2. Extract sentences using PDFProcessor ↓ 3. Send sentences to Bias Detection model ↓ 4. Classify each sentence (neutral/gender/caste/religion/etc.) ↓ 5. Return analysis results to user ``` ### Code Example ```python from utility.pdf_processor import PDFProcessor from api.routes.bias_detection import run_bias_detection processor = PDFProcessor() # Process PDF pdf_result = processor.process_pdf( pdf_path="document.pdf", refine_with_llm=True ) if pdf_result["success"]: sentences = pdf_result["sentences"] combined_text = " ".join(sentences) # Run bias detection bias_result = run_bias_detection( text=combined_text, confidence_threshold=0.7 ) print(f"Biased sentences: {bias_result.biased_count}") print(f"Neutral sentences: {bias_result.neutral_count}") ``` ## Nepali Language Support ### Character Range Supported The module recognizes Nepali character ranges: - Consonants: अ-ह - Vowels: ा-ौ - Special characters: ँ-ॿ ### Sentence Boundaries Recognized punctuation: - `।` Danda (primary Nepali punctuation) - `.` Period - `!` Exclamation mark - `?` Question mark ## Logging Enable debug logging to track processing: ```python import logging logging.basicConfig(level=logging.DEBUG) logger = logging.getLogger("utility.pdf_processor") # Now see detailed logs processor = PDFProcessor() result = processor.process_pdf("document.pdf") ``` ## Files Structure ``` utility/ ├── __init__.py # Module initialization ├── pdf_processor.py # Main PDFProcessor class ├── pdf_processor_examples.py # Usage examples └── docs/ └── pdf_processing.md # This file api/ ├── routes/ │ └── pdf_processing.py # API endpoints └── schemas.py # Pydantic models ``` ## Troubleshooting ### Issue: "No text could be extracted from the PDF" **Cause:** PDF is image-based (scanned document) **Solution:** Requires OCR support (future enhancement) ### Issue: "LLM refinement failed" **Cause:** Mistral API key missing or network error **Solution:** Check `MISTRAL_API_KEY` environment variable ### Issue: Sentences are too short or fragmented **Solution:** Sentences shorter than 5 characters are filtered. Adjust threshold in code if needed. ### Issue: Slow processing with LLM **Solution:** - Disable LLM refinement (`refine_with_llm=False`) for speed - Use smaller batch sizes - Check network latency to Mistral API ## Future Enhancements - [ ] OCR support for scanned PDFs - [ ] Language detection and auto-switching - [ ] Caching layer for repeated PDFs - [ ] Batch processing optimization - [ ] Support for other document formats (DOCX, TXT) - [ ] Custom Nepali dictionary for better segmentation ## License Part of Nepal Justice Weaver project