File size: 2,583 Bytes
3998131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# Utility Module

This module contains shared utility functions and classes for the Nepal Justice Weaver project.

## Components

### 1. PDF Processor (`pdf_processor.py`)

A comprehensive PDF processing module for extracting and refining Nepali text.

**Features:**
- PDF text extraction using PyMuPDF
- Intelligent Nepali sentence segmentation
- LLM-based refinement using Mistral
- Integration with bias detection pipeline

**Key Classes:**
- `PDFProcessor`: Main class for PDF processing

**Usage:**
```python
from utility.pdf_processor import PDFProcessor

processor = PDFProcessor()
result = processor.process_pdf(
    pdf_path="document.pdf",
    refine_with_llm=True
)
```

**API Endpoints:**
- `POST /api/v1/process-pdf` - Extract sentences from PDF
- `POST /api/v1/process-pdf-to-bias` - Extract and analyze bias
- `GET /api/v1/pdf-health` - Service health check

## Dependencies

```
fitz (pymupdf)  - PDF text extraction
mistralai       - LLM for sentence refinement
fastapi         - API framework (for routes)
```

## Documentation

See [docs/pdf_processing.md](../docs/pdf_processing.md) for:
- Complete API documentation
- Usage examples
- Configuration guide
- Troubleshooting

See [pdf_processor_examples.py](pdf_processor_examples.py) for code examples.

## Testing

Run tests:
```bash
pytest utility/test_pdf_processor.py -v
```

Manual tests:
```bash
python utility/test_pdf_processor.py
```

## Architecture

```
PDF Upload
   ↓
PDFProcessor
   β”œβ”€ extract_text_from_pdf()      [PyMuPDF]
   β”œβ”€ clean_text()                  [Regex]
   β”œβ”€ split_into_sentences()        [Regex + Unicode]
   └─ refine_sentences_with_llm()   [Mistral API]
   ↓
List of Sentences
   ↓
Bias Detection API
```

## File Structure

```
utility/
β”œβ”€β”€ __init__.py                  # Module initialization
β”œβ”€β”€ pdf_processor.py             # Main PDF processor class
β”œβ”€β”€ pdf_processor_examples.py    # Usage examples
β”œβ”€β”€ test_pdf_processor.py        # Test suite
└── README.md                    # This file
```

## Future Enhancements

- [ ] OCR support for scanned PDFs
- [ ] Language auto-detection
- [ ] Additional document format support
- [ ] Caching optimization
- [ ] Batch processing improvements

## Contributing

When adding new utilities:
1. Add classes/functions to appropriate module
2. Update `__init__.py` with exports
3. Add comprehensive docstrings
4. Include examples in `*_examples.py`
5. Add tests to `test_*.py`

---

For more information about PDF processing, see [docs/pdf_processing.md](../docs/pdf_processing.md).