open-navigator / docs /INSTALLING_DOCUMENT_LIBRARIES.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

πŸ“¦ INSTALLING DOCUMENT PROCESSING LIBRARIES

Quick guide to install all libraries for handling multiple document formats.


πŸš€ QUICK INSTALL

cd /home/developer/projects/open-navigator
source venv/bin/activate

# Install all document processing libraries
pip install PyPDF2 pdfplumber python-pptx python-docx openpyxl

# Optional: OCR for scanned documents (requires tesseract)
pip install pytesseract Pillow

πŸ“‹ WHAT GETS INSTALLED

Library Purpose Size
PyPDF2 Extract text from PDFs ~500 KB
pdfplumber Advanced PDF extraction (tables) ~2 MB
python-pptx Extract text from PowerPoint ~500 KB
python-docx Extract text from Word documents ~300 KB
openpyxl Extract text from Excel ~2 MB
pytesseract OCR for scanned documents (optional) ~100 KB
Pillow Image processing for OCR ~3 MB

Total: ~8 MB (very lightweight!)


πŸ”§ OPTIONAL: OCR SUPPORT

For scanned PDFs and images, install Tesseract OCR engine:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

Windows:

Download installer from: https://github.com/UB-Mannheim/tesseract/wiki


βœ… VERIFY INSTALLATION

# Test all libraries
python -c "
import PyPDF2
import pdfplumber
from pptx import Presentation
from docx import Document
import openpyxl
print('βœ… All document libraries installed!')
"

# Test OCR (optional)
python -c "
import pytesseract
from PIL import Image
print('βœ… OCR libraries installed!')
print(f'Tesseract version: {pytesseract.get_tesseract_version()}')
"

🎯 TEST WITH REAL DOCUMENT

# Test PDF extraction
python extraction/universal_extractor.py https://example.com/document.pdf

# Test PowerPoint extraction
python extraction/universal_extractor.py https://example.com/presentation.pptx

# Test Word extraction
python extraction/universal_extractor.py https://example.com/document.docx

πŸ†˜ TROUBLESHOOTING

"No module named 'PyPDF2'"

pip install PyPDF2

"pytesseract is not installed"

# Install Python package
pip install pytesseract

# Install system package (Ubuntu)
sudo apt-get install tesseract-ocr

"TesseractNotFoundError"

# On Ubuntu/Debian
sudo apt-get install tesseract-ocr

# On macOS
brew install tesseract

# On Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH after installation

"Permission denied"

# Make sure you're in virtual environment
source venv/bin/activate

# Then retry installation
pip install -r requirements.txt

πŸ“Š STORAGE IMPACT

Even with all libraries installed:

  • Virtual environment size: ~500 MB (unchanged)
  • Libraries add: ~8 MB
  • Total: Still under 1 GB βœ…

Processing impact:

  • Extract text from 1000 PDFs: ~50 MB local storage (temporary)
  • Store in Parquet: ~5 MB (compressed)
  • Save 90% storage vs storing original files βœ…

βœ… DONE!

You can now extract text from:

  • βœ… PDF documents
  • βœ… PowerPoint presentations
  • βœ… Word documents
  • βœ… Excel spreadsheets
  • βœ… HTML pages
  • βœ… Scanned documents (with OCR)

All will be stored efficiently in Parquet format for FREE on Hugging Face! πŸŽ‰