Spaces:

khagu
/

setu

Running

App Files Files Community

setu / utility /README.md

khagu

chore: finally untrack large database files

3998131 about 2 months ago

preview code

raw

history blame contribute delete

2.58 kB

	# Utility Module

	This module contains shared utility functions and classes for the Nepal Justice Weaver project.

	## Components

	### 1. PDF Processor (`pdf_processor.py`)

	A comprehensive PDF processing module for extracting and refining Nepali text.

	Features:
	- PDF text extraction using PyMuPDF
	- Intelligent Nepali sentence segmentation
	- LLM-based refinement using Mistral
	- Integration with bias detection pipeline

	Key Classes:
	- `PDFProcessor`: Main class for PDF processing

	Usage:
	```python
	from utility.pdf_processor import PDFProcessor

	processor = PDFProcessor()
	result = processor.process_pdf(
	pdf_path="document.pdf",
	refine_with_llm=True
	)
	```

	API Endpoints:
	- `POST /api/v1/process-pdf` - Extract sentences from PDF
	- `POST /api/v1/process-pdf-to-bias` - Extract and analyze bias
	- `GET /api/v1/pdf-health` - Service health check

	## Dependencies

	```
	fitz (pymupdf) - PDF text extraction
	mistralai - LLM for sentence refinement
	fastapi - API framework (for routes)
	```

	## Documentation

	See [docs/pdf_processing.md](../docs/pdf_processing.md) for:
	- Complete API documentation
	- Usage examples
	- Configuration guide
	- Troubleshooting

	See [pdf_processor_examples.py](pdf_processor_examples.py) for code examples.

	## Testing

	Run tests:
	```bash
	pytest utility/test_pdf_processor.py -v
	```

	Manual tests:
	```bash
	python utility/test_pdf_processor.py
	```

	## Architecture

	```
	PDF Upload
	↓
	PDFProcessor
	├─ extract_text_from_pdf() [PyMuPDF]
	├─ clean_text() [Regex]
	├─ split_into_sentences() [Regex + Unicode]
	└─ refine_sentences_with_llm() [Mistral API]
	↓
	List of Sentences
	↓
	Bias Detection API
	```

	## File Structure

	```
	utility/
	├── __init__.py # Module initialization
	├── pdf_processor.py # Main PDF processor class
	├── pdf_processor_examples.py # Usage examples
	├── test_pdf_processor.py # Test suite
	└── README.md # This file
	```

	## Future Enhancements

	- [ ] OCR support for scanned PDFs
	- [ ] Language auto-detection
	- [ ] Additional document format support
	- [ ] Caching optimization
	- [ ] Batch processing improvements

	## Contributing

	When adding new utilities:
	1. Add classes/functions to appropriate module
	2. Update `__init__.py` with exports
	3. Add comprehensive docstrings
	4. Include examples in `*_examples.py`
	5. Add tests to `test_*.py`

	---

	For more information about PDF processing, see [docs/pdf_processing.md](../docs/pdf_processing.md).