setu / docs /pdf_processing.md
khagu's picture
chore: finally untrack large database files
3998131
# PDF Processing Module for Bias Detection
## Overview
The PDF Processing module (`utility/pdf_processor.py`) provides a complete pipeline for extracting text from Nepali PDFs and preparing sentences for bias detection analysis.
**Key Features:**
- ✓ PDF text extraction using PyMuPDF (fitz)
- ✓ Intelligent Nepali sentence segmentation
- ✓ LLM-based sentence refinement using Mistral
- ✓ Integration with bias detection API
- ✓ File upload support via API endpoints
- ✓ Error handling and logging
## Architecture
```
User Upload (PDF)
PDFProcessor.process_pdf_from_bytes()
[Extract Text] → [Clean Text] → [Split Sentences] → [Refine with LLM]
List of Refined Sentences
Bias Detection Model
Bias Analysis Results
```
## Installation
### Required Dependencies
```bash
# PyMuPDF for PDF text extraction
pip install pymupdf
# Already included in module_a
# mistralai - Mistral LLM client
```
### Setup
1. Ensure `mistralai` is installed in your environment
2. Set `MISTRAL_API_KEY` environment variable
3. Module uses existing `MistralClient` from `module_a/llm_client.py`
## Usage
### 1. Basic Python Usage
```python
from utility.pdf_processor import PDFProcessor
# Initialize processor
processor = PDFProcessor()
# Process PDF from file path
result = processor.process_pdf(
pdf_path="path/to/document.pdf",
refine_with_llm=True
)
if result["success"]:
sentences = result["sentences"]
print(f"Extracted {result['total_sentences']} sentences")
for sentence in sentences:
print(f"- {sentence}")
```
### 2. Process from Bytes (File Uploads)
```python
# For API file uploads
processor = PDFProcessor()
pdf_bytes = await request.file.read()
result = processor.process_pdf_from_bytes(
pdf_bytes=pdf_bytes,
refine_with_llm=True
)
```
### 3. API Endpoints
#### A. Extract Sentences Only
**Endpoint:** `POST /api/v1/process-pdf`
**Request:**
```bash
curl -X POST "http://localhost:8000/api/v1/process-pdf" \
-F "file=@nepali_document.pdf" \
-F "refine_with_llm=true"
```
**Response:**
```json
{
"success": true,
"sentences": [
"पहिलो वाक्य यहाँ छ।",
"दोस्रो वाक्य यहाँ छ।",
"तेस्रो वाक्य यहाँ छ।"
],
"total_sentences": 3,
"filename": "nepali_document.pdf",
"raw_text": "पहिलो वाक्य यहाँ छ। दोस्रो वाक्य यहाँ छ। तेस्रो वाक्य यहाँ छ।"
}
```
#### B. Extract Sentences + Bias Detection
**Endpoint:** `POST /api/v1/process-pdf-to-bias`
**Request:**
```bash
curl -X POST "http://localhost:8000/api/v1/process-pdf-to-bias" \
-F "file=@nepali_document.pdf" \
-F "refine_with_llm=true" \
-F "confidence_threshold=0.7"
```
**Response:**
```json
{
"success": true,
"total_sentences": 3,
"biased_count": 1,
"neutral_count": 2,
"results": [
{
"sentence": "पहिलो वाक्य यहाँ छ।",
"category": "neutral",
"confidence": 0.95,
"is_biased": false
},
{
"sentence": "दोस्रो वाक्य यहाँ छ।",
"category": "gender",
"confidence": 0.82,
"is_biased": true
},
{
"sentence": "तेस्रो वाक्य यहाँ छ।",
"category": "neutral",
"confidence": 0.91,
"is_biased": false
}
],
"filename": "nepali_document.pdf"
}
```
#### C. Service Health Check
**Endpoint:** `GET /api/v1/pdf-health`
**Response:**
```json
{
"status": "healthy",
"pdf_processor": "ready",
"mistral_client": "connected",
"features": {
"pdf_extraction": true,
"sentence_segmentation": true,
"llm_refinement": true
}
}
```
## API Schemas
### PDFProcessingResponse
```python
{
"success": bool,
"sentences": List[str],
"total_sentences": int,
"raw_text": Optional[str],
"error": Optional[str],
"filename": Optional[str]
}
```
### PDFToBiasDetectionResponse
```python
{
"success": bool,
"total_sentences": int,
"biased_count": int,
"neutral_count": int,
"results": List[BiasResult],
"error": Optional[str],
"filename": Optional[str]
}
```
Where `BiasResult`:
```python
{
"sentence": str,
"category": str,
"confidence": float,
"is_biased": bool
}
```
## Processing Pipeline
### Step 1: Text Extraction
- Uses PyMuPDF (fitz) to extract text from PDF
- Handles multi-page documents
- Detects image-based PDFs (requires OCR)
### Step 2: Text Cleaning
- Removes extra whitespace
- Normalizes newlines
- Fixes formatting issues
### Step 3: Sentence Segmentation
- Uses regex patterns for Nepali sentence boundaries
- Recognizes: । (danda), . , ! , ?
- Filters out short fragments (< 5 characters)
### Step 4: LLM Refinement (Optional)
- Sends sentences to Mistral LLM
- Corrects mis-segmented sentences
- Removes duplicates
- Returns properly formatted JSON array
## Configuration
### Environment Variables
```bash
# Required for LLM refinement
export MISTRAL_API_KEY="your-api-key"
# Optional
export MISTRAL_MODEL="mistral-small" # Default: mistral-small
export LOG_LEVEL="INFO"
```
### Processing Options
```python
# With LLM refinement (more accurate, slower)
result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=True # Uses Mistral LLM
)
# Without LLM refinement (faster, regex-based)
result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=False # Regex-based segmentation only
)
```
## Error Handling
The module handles various error scenarios:
```python
result = processor.process_pdf(pdf_path="file.pdf")
if not result["success"]:
error = result["error"]
# Possible errors:
# - "No text could be extracted from the PDF"
# - "Could not segment sentences from extracted text"
# - "PDF might be image-based (requires OCR)"
# - "File not found: path/to/file.pdf"
```
## Performance Considerations
### Execution Time Estimates
| Operation | Time | Notes |
|-----------|------|-------|
| PDF Text Extraction | ~100-500ms | Depends on PDF size |
| Sentence Segmentation | ~50-200ms | Regex-based |
| LLM Refinement | ~2-5s | API call to Mistral |
| Total (with LLM) | ~3-6s | Per document |
| Total (without LLM) | ~150-700ms | Regex only |
### Optimization Tips
1. **Disable LLM refinement** for faster processing when accuracy is less critical
2. **Batch multiple PDFs** to amortize API overhead
3. **Cache results** if processing same PDFs repeatedly
## Integration with Bias Detection
### Workflow
```
1. User uploads PDF
2. Extract sentences using PDFProcessor
3. Send sentences to Bias Detection model
4. Classify each sentence (neutral/gender/caste/religion/etc.)
5. Return analysis results to user
```
### Code Example
```python
from utility.pdf_processor import PDFProcessor
from api.routes.bias_detection import run_bias_detection
processor = PDFProcessor()
# Process PDF
pdf_result = processor.process_pdf(
pdf_path="document.pdf",
refine_with_llm=True
)
if pdf_result["success"]:
sentences = pdf_result["sentences"]
combined_text = " ".join(sentences)
# Run bias detection
bias_result = run_bias_detection(
text=combined_text,
confidence_threshold=0.7
)
print(f"Biased sentences: {bias_result.biased_count}")
print(f"Neutral sentences: {bias_result.neutral_count}")
```
## Nepali Language Support
### Character Range Supported
The module recognizes Nepali character ranges:
- Consonants: अ-ह
- Vowels: ा-ौ
- Special characters: ँ-ॿ
### Sentence Boundaries
Recognized punctuation:
- `।` Danda (primary Nepali punctuation)
- `.` Period
- `!` Exclamation mark
- `?` Question mark
## Logging
Enable debug logging to track processing:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("utility.pdf_processor")
# Now see detailed logs
processor = PDFProcessor()
result = processor.process_pdf("document.pdf")
```
## Files Structure
```
utility/
├── __init__.py # Module initialization
├── pdf_processor.py # Main PDFProcessor class
├── pdf_processor_examples.py # Usage examples
└── docs/
└── pdf_processing.md # This file
api/
├── routes/
│ └── pdf_processing.py # API endpoints
└── schemas.py # Pydantic models
```
## Troubleshooting
### Issue: "No text could be extracted from the PDF"
**Cause:** PDF is image-based (scanned document)
**Solution:** Requires OCR support (future enhancement)
### Issue: "LLM refinement failed"
**Cause:** Mistral API key missing or network error
**Solution:** Check `MISTRAL_API_KEY` environment variable
### Issue: Sentences are too short or fragmented
**Solution:** Sentences shorter than 5 characters are filtered. Adjust threshold in code if needed.
### Issue: Slow processing with LLM
**Solution:**
- Disable LLM refinement (`refine_with_llm=False`) for speed
- Use smaller batch sizes
- Check network latency to Mistral API
## Future Enhancements
- [ ] OCR support for scanned PDFs
- [ ] Language detection and auto-switching
- [ ] Caching layer for repeated PDFs
- [ ] Batch processing optimization
- [ ] Support for other document formats (DOCX, TXT)
- [ ] Custom Nepali dictionary for better segmentation
## License
Part of Nepal Justice Weaver project