Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # π¦ INSTALLING DOCUMENT PROCESSING LIBRARIES | |
| **Quick guide to install all libraries for handling multiple document formats.** | |
| --- | |
| ## π QUICK INSTALL | |
| ```bash | |
| cd /home/developer/projects/open-navigator | |
| source venv/bin/activate | |
| # Install all document processing libraries | |
| pip install PyPDF2 pdfplumber python-pptx python-docx openpyxl | |
| # Optional: OCR for scanned documents (requires tesseract) | |
| pip install pytesseract Pillow | |
| ``` | |
| --- | |
| ## π WHAT GETS INSTALLED | |
| | Library | Purpose | Size | | |
| |---------|---------|------| | |
| | **PyPDF2** | Extract text from PDFs | ~500 KB | | |
| | **pdfplumber** | Advanced PDF extraction (tables) | ~2 MB | | |
| | **python-pptx** | Extract text from PowerPoint | ~500 KB | | |
| | **python-docx** | Extract text from Word documents | ~300 KB | | |
| | **openpyxl** | Extract text from Excel | ~2 MB | | |
| | **pytesseract** | OCR for scanned documents (optional) | ~100 KB | | |
| | **Pillow** | Image processing for OCR | ~3 MB | | |
| **Total: ~8 MB** (very lightweight!) | |
| --- | |
| ## π§ OPTIONAL: OCR SUPPORT | |
| **For scanned PDFs and images, install Tesseract OCR engine:** | |
| ### Ubuntu/Debian: | |
| ```bash | |
| sudo apt-get update | |
| sudo apt-get install tesseract-ocr | |
| ``` | |
| ### macOS: | |
| ```bash | |
| brew install tesseract | |
| ``` | |
| ### Windows: | |
| Download installer from: https://github.com/UB-Mannheim/tesseract/wiki | |
| --- | |
| ## β VERIFY INSTALLATION | |
| ```bash | |
| # Test all libraries | |
| python -c " | |
| import PyPDF2 | |
| import pdfplumber | |
| from pptx import Presentation | |
| from docx import Document | |
| import openpyxl | |
| print('β All document libraries installed!') | |
| " | |
| # Test OCR (optional) | |
| python -c " | |
| import pytesseract | |
| from PIL import Image | |
| print('β OCR libraries installed!') | |
| print(f'Tesseract version: {pytesseract.get_tesseract_version()}') | |
| " | |
| ``` | |
| --- | |
| ## π― TEST WITH REAL DOCUMENT | |
| ```bash | |
| # Test PDF extraction | |
| python extraction/universal_extractor.py https://example.com/document.pdf | |
| # Test PowerPoint extraction | |
| python extraction/universal_extractor.py https://example.com/presentation.pptx | |
| # Test Word extraction | |
| python extraction/universal_extractor.py https://example.com/document.docx | |
| ``` | |
| --- | |
| ## π TROUBLESHOOTING | |
| ### "No module named 'PyPDF2'" | |
| ```bash | |
| pip install PyPDF2 | |
| ``` | |
| ### "pytesseract is not installed" | |
| ```bash | |
| # Install Python package | |
| pip install pytesseract | |
| # Install system package (Ubuntu) | |
| sudo apt-get install tesseract-ocr | |
| ``` | |
| ### "TesseractNotFoundError" | |
| ```bash | |
| # On Ubuntu/Debian | |
| sudo apt-get install tesseract-ocr | |
| # On macOS | |
| brew install tesseract | |
| # On Windows | |
| # Download from: https://github.com/UB-Mannheim/tesseract/wiki | |
| # Add to PATH after installation | |
| ``` | |
| ### "Permission denied" | |
| ```bash | |
| # Make sure you're in virtual environment | |
| source venv/bin/activate | |
| # Then retry installation | |
| pip install -r requirements.txt | |
| ``` | |
| --- | |
| ## π STORAGE IMPACT | |
| **Even with all libraries installed:** | |
| - Virtual environment size: ~500 MB (unchanged) | |
| - Libraries add: ~8 MB | |
| - **Total: Still under 1 GB** β | |
| **Processing impact:** | |
| - Extract text from 1000 PDFs: ~50 MB local storage (temporary) | |
| - Store in Parquet: ~5 MB (compressed) | |
| - **Save 90% storage vs storing original files** β | |
| --- | |
| ## β DONE! | |
| **You can now extract text from:** | |
| - β PDF documents | |
| - β PowerPoint presentations | |
| - β Word documents | |
| - β Excel spreadsheets | |
| - β HTML pages | |
| - β Scanned documents (with OCR) | |
| **All will be stored efficiently in Parquet format for FREE on Hugging Face!** π | |