# 📦 INSTALLING DOCUMENT PROCESSING LIBRARIES **Quick guide to install all libraries for handling multiple document formats.** --- ## 🚀 QUICK INSTALL ```bash cd /home/developer/projects/open-navigator source venv/bin/activate # Install all document processing libraries pip install PyPDF2 pdfplumber python-pptx python-docx openpyxl # Optional: OCR for scanned documents (requires tesseract) pip install pytesseract Pillow ``` --- ## 📋 WHAT GETS INSTALLED | Library | Purpose | Size | |---------|---------|------| | **PyPDF2** | Extract text from PDFs | ~500 KB | | **pdfplumber** | Advanced PDF extraction (tables) | ~2 MB | | **python-pptx** | Extract text from PowerPoint | ~500 KB | | **python-docx** | Extract text from Word documents | ~300 KB | | **openpyxl** | Extract text from Excel | ~2 MB | | **pytesseract** | OCR for scanned documents (optional) | ~100 KB | | **Pillow** | Image processing for OCR | ~3 MB | **Total: ~8 MB** (very lightweight!) --- ## 🔧 OPTIONAL: OCR SUPPORT **For scanned PDFs and images, install Tesseract OCR engine:** ### Ubuntu/Debian: ```bash sudo apt-get update sudo apt-get install tesseract-ocr ``` ### macOS: ```bash brew install tesseract ``` ### Windows: Download installer from: https://github.com/UB-Mannheim/tesseract/wiki --- ## ✅ VERIFY INSTALLATION ```bash # Test all libraries python -c " import PyPDF2 import pdfplumber from pptx import Presentation from docx import Document import openpyxl print('✅ All document libraries installed!') " # Test OCR (optional) python -c " import pytesseract from PIL import Image print('✅ OCR libraries installed!') print(f'Tesseract version: {pytesseract.get_tesseract_version()}') " ``` --- ## 🎯 TEST WITH REAL DOCUMENT ```bash # Test PDF extraction python extraction/universal_extractor.py https://example.com/document.pdf # Test PowerPoint extraction python extraction/universal_extractor.py https://example.com/presentation.pptx # Test Word extraction python extraction/universal_extractor.py https://example.com/document.docx ``` --- ## 🆘 TROUBLESHOOTING ### "No module named 'PyPDF2'" ```bash pip install PyPDF2 ``` ### "pytesseract is not installed" ```bash # Install Python package pip install pytesseract # Install system package (Ubuntu) sudo apt-get install tesseract-ocr ``` ### "TesseractNotFoundError" ```bash # On Ubuntu/Debian sudo apt-get install tesseract-ocr # On macOS brew install tesseract # On Windows # Download from: https://github.com/UB-Mannheim/tesseract/wiki # Add to PATH after installation ``` ### "Permission denied" ```bash # Make sure you're in virtual environment source venv/bin/activate # Then retry installation pip install -r requirements.txt ``` --- ## 📊 STORAGE IMPACT **Even with all libraries installed:** - Virtual environment size: ~500 MB (unchanged) - Libraries add: ~8 MB - **Total: Still under 1 GB** ✅ **Processing impact:** - Extract text from 1000 PDFs: ~50 MB local storage (temporary) - Store in Parquet: ~5 MB (compressed) - **Save 90% storage vs storing original files** ✅ --- ## ✅ DONE! **You can now extract text from:** - ✅ PDF documents - ✅ PowerPoint presentations - ✅ Word documents - ✅ Excel spreadsheets - ✅ HTML pages - ✅ Scanned documents (with OCR) **All will be stored efficiently in Parquet format for FREE on Hugging Face!** 🎉