Spaces:

khagu
/

setu

Running

App Files Files Community

setu / docs /pdf_processing.md

khagu

chore: finally untrack large database files

3998131 22 days ago

preview code

raw

history blame contribute delete

9.59 kB

	# PDF Processing Module for Bias Detection

	## Overview

	The PDF Processing module (`utility/pdf_processor.py`) provides a complete pipeline for extracting text from Nepali PDFs and preparing sentences for bias detection analysis.

	Key Features:
	- ✓ PDF text extraction using PyMuPDF (fitz)
	- ✓ Intelligent Nepali sentence segmentation
	- ✓ LLM-based sentence refinement using Mistral
	- ✓ Integration with bias detection API
	- ✓ File upload support via API endpoints
	- ✓ Error handling and logging

	## Architecture

	```
	User Upload (PDF)
	↓
	PDFProcessor.process_pdf_from_bytes()
	↓
	[Extract Text] → [Clean Text] → [Split Sentences] → [Refine with LLM]
	↓
	List of Refined Sentences
	↓
	Bias Detection Model
	↓
	Bias Analysis Results
	```

	## Installation

	### Required Dependencies

	```bash
	# PyMuPDF for PDF text extraction
	pip install pymupdf

	# Already included in module_a
	# mistralai - Mistral LLM client
	```

	### Setup

	1. Ensure `mistralai` is installed in your environment
	2. Set `MISTRAL_API_KEY` environment variable
	3. Module uses existing `MistralClient` from `module_a/llm_client.py`

	## Usage

	### 1. Basic Python Usage

	```python
	from utility.pdf_processor import PDFProcessor

	# Initialize processor
	processor = PDFProcessor()

	# Process PDF from file path
	result = processor.process_pdf(
	pdf_path="path/to/document.pdf",
	refine_with_llm=True
	)

	if result["success"]:
	sentences = result["sentences"]
	print(f"Extracted {result['total_sentences']} sentences")
	for sentence in sentences:
	print(f"- {sentence}")
	```

	### 2. Process from Bytes (File Uploads)

	```python
	# For API file uploads
	processor = PDFProcessor()

	pdf_bytes = await request.file.read()
	result = processor.process_pdf_from_bytes(
	pdf_bytes=pdf_bytes,
	refine_with_llm=True
	)
	```

	### 3. API Endpoints

	#### A. Extract Sentences Only

	Endpoint: `POST /api/v1/process-pdf`

	Request:
	```bash
	curl -X POST "http://localhost:8000/api/v1/process-pdf" \
	-F "file=@nepali_document.pdf" \
	-F "refine_with_llm=true"
	```

	Response:
	```json
	{
	"success": true,
	"sentences": [
	"पहिलो वाक्य यहाँ छ।",
	"दोस्रो वाक्य यहाँ छ।",
	"तेस्रो वाक्य यहाँ छ।"
	],
	"total_sentences": 3,
	"filename": "nepali_document.pdf",
	"raw_text": "पहिलो वाक्य यहाँ छ। दोस्रो वाक्य यहाँ छ। तेस्रो वाक्य यहाँ छ।"
	}
	```

	#### B. Extract Sentences + Bias Detection

	Endpoint: `POST /api/v1/process-pdf-to-bias`

	Request:
	```bash
	curl -X POST "http://localhost:8000/api/v1/process-pdf-to-bias" \
	-F "file=@nepali_document.pdf" \
	-F "refine_with_llm=true" \
	-F "confidence_threshold=0.7"
	```

	Response:
	```json
	{
	"success": true,
	"total_sentences": 3,
	"biased_count": 1,
	"neutral_count": 2,
	"results": [
	{
	"sentence": "पहिलो वाक्य यहाँ छ।",
	"category": "neutral",
	"confidence": 0.95,
	"is_biased": false
	},
	{
	"sentence": "दोस्रो वाक्य यहाँ छ।",
	"category": "gender",
	"confidence": 0.82,
	"is_biased": true
	},
	{
	"sentence": "तेस्रो वाक्य यहाँ छ।",
	"category": "neutral",
	"confidence": 0.91,
	"is_biased": false
	}
	],
	"filename": "nepali_document.pdf"
	}
	```

	#### C. Service Health Check

	Endpoint: `GET /api/v1/pdf-health`

	Response:
	```json
	{
	"status": "healthy",
	"pdf_processor": "ready",
	"mistral_client": "connected",
	"features": {
	"pdf_extraction": true,
	"sentence_segmentation": true,
	"llm_refinement": true
	}
	}
	```

	## API Schemas

	### PDFProcessingResponse

	```python
	{
	"success": bool,
	"sentences": List[str],
	"total_sentences": int,
	"raw_text": Optional[str],
	"error": Optional[str],
	"filename": Optional[str]
	}
	```

	### PDFToBiasDetectionResponse

	```python
	{
	"success": bool,
	"total_sentences": int,
	"biased_count": int,
	"neutral_count": int,
	"results": List[BiasResult],
	"error": Optional[str],
	"filename": Optional[str]
	}
	```

	Where `BiasResult`:
	```python
	{
	"sentence": str,
	"category": str,
	"confidence": float,
	"is_biased": bool
	}
	```

	## Processing Pipeline

	### Step 1: Text Extraction
	- Uses PyMuPDF (fitz) to extract text from PDF
	- Handles multi-page documents
	- Detects image-based PDFs (requires OCR)

	### Step 2: Text Cleaning
	- Removes extra whitespace
	- Normalizes newlines
	- Fixes formatting issues

	### Step 3: Sentence Segmentation
	- Uses regex patterns for Nepali sentence boundaries
	- Recognizes: । (danda), . , ! , ?
	- Filters out short fragments (< 5 characters)

	### Step 4: LLM Refinement (Optional)
	- Sends sentences to Mistral LLM
	- Corrects mis-segmented sentences
	- Removes duplicates
	- Returns properly formatted JSON array

	## Configuration

	### Environment Variables

	```bash
	# Required for LLM refinement
	export MISTRAL_API_KEY="your-api-key"

	# Optional
	export MISTRAL_MODEL="mistral-small" # Default: mistral-small
	export LOG_LEVEL="INFO"
	```

	### Processing Options

	```python
	# With LLM refinement (more accurate, slower)
	result = processor.process_pdf(
	pdf_path="document.pdf",
	refine_with_llm=True # Uses Mistral LLM
	)

	# Without LLM refinement (faster, regex-based)
	result = processor.process_pdf(
	pdf_path="document.pdf",
	refine_with_llm=False # Regex-based segmentation only
	)
	```

	## Error Handling

	The module handles various error scenarios:

	```python
	result = processor.process_pdf(pdf_path="file.pdf")

	if not result["success"]:
	error = result["error"]
	# Possible errors:
	# - "No text could be extracted from the PDF"
	# - "Could not segment sentences from extracted text"
	# - "PDF might be image-based (requires OCR)"
	# - "File not found: path/to/file.pdf"
	```

	## Performance Considerations

	### Execution Time Estimates

	\| Operation \| Time \| Notes \|
	\|-----------\|------\|-------\|
	\| PDF Text Extraction \| ~100-500ms \| Depends on PDF size \|
	\| Sentence Segmentation \| ~50-200ms \| Regex-based \|
	\| LLM Refinement \| ~2-5s \| API call to Mistral \|
	\| Total (with LLM) \| ~3-6s \| Per document \|
	\| Total (without LLM) \| ~150-700ms \| Regex only \|

	### Optimization Tips

	1. Disable LLM refinement for faster processing when accuracy is less critical
	2. Batch multiple PDFs to amortize API overhead
	3. Cache results if processing same PDFs repeatedly

	## Integration with Bias Detection

	### Workflow

	```
	1. User uploads PDF
	↓
	2. Extract sentences using PDFProcessor
	↓
	3. Send sentences to Bias Detection model
	↓
	4. Classify each sentence (neutral/gender/caste/religion/etc.)
	↓
	5. Return analysis results to user
	```

	### Code Example

	```python
	from utility.pdf_processor import PDFProcessor
	from api.routes.bias_detection import run_bias_detection

	processor = PDFProcessor()

	# Process PDF
	pdf_result = processor.process_pdf(
	pdf_path="document.pdf",
	refine_with_llm=True
	)

	if pdf_result["success"]:
	sentences = pdf_result["sentences"]
	combined_text = " ".join(sentences)

	# Run bias detection
	bias_result = run_bias_detection(
	text=combined_text,
	confidence_threshold=0.7
	)

	print(f"Biased sentences: {bias_result.biased_count}")
	print(f"Neutral sentences: {bias_result.neutral_count}")
	```

	## Nepali Language Support

	### Character Range Supported

	The module recognizes Nepali character ranges:
	- Consonants: अ-ह
	- Vowels: ा-ौ
	- Special characters: ँ-ॿ

	### Sentence Boundaries

	Recognized punctuation:
	- `।` Danda (primary Nepali punctuation)
	- `.` Period
	- `!` Exclamation mark
	- `?` Question mark

	## Logging

	Enable debug logging to track processing:

	```python
	import logging

	logging.basicConfig(level=logging.DEBUG)
	logger = logging.getLogger("utility.pdf_processor")

	# Now see detailed logs
	processor = PDFProcessor()
	result = processor.process_pdf("document.pdf")
	```

	## Files Structure

	```
	utility/
	├── __init__.py # Module initialization
	├── pdf_processor.py # Main PDFProcessor class
	├── pdf_processor_examples.py # Usage examples
	└── docs/
	└── pdf_processing.md # This file

	api/
	├── routes/
	│ └── pdf_processing.py # API endpoints
	└── schemas.py # Pydantic models
	```

	## Troubleshooting

	### Issue: "No text could be extracted from the PDF"

	Cause: PDF is image-based (scanned document)
	Solution: Requires OCR support (future enhancement)

	### Issue: "LLM refinement failed"

	Cause: Mistral API key missing or network error
	Solution: Check `MISTRAL_API_KEY` environment variable

	### Issue: Sentences are too short or fragmented

	Solution: Sentences shorter than 5 characters are filtered. Adjust threshold in code if needed.

	### Issue: Slow processing with LLM

	Solution:
	- Disable LLM refinement (`refine_with_llm=False`) for speed
	- Use smaller batch sizes
	- Check network latency to Mistral API

	## Future Enhancements

	- [ ] OCR support for scanned PDFs
	- [ ] Language detection and auto-switching
	- [ ] Caching layer for repeated PDFs
	- [ ] Batch processing optimization
	- [ ] Support for other document formats (DOCX, TXT)
	- [ ] Custom Nepali dictionary for better segmentation

	## License

	Part of Nepal Justice Weaver project