Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
π¦ INSTALLING DOCUMENT PROCESSING LIBRARIES
Quick guide to install all libraries for handling multiple document formats.
π QUICK INSTALL
cd /home/developer/projects/open-navigator
source venv/bin/activate
# Install all document processing libraries
pip install PyPDF2 pdfplumber python-pptx python-docx openpyxl
# Optional: OCR for scanned documents (requires tesseract)
pip install pytesseract Pillow
π WHAT GETS INSTALLED
| Library | Purpose | Size |
|---|---|---|
| PyPDF2 | Extract text from PDFs | ~500 KB |
| pdfplumber | Advanced PDF extraction (tables) | ~2 MB |
| python-pptx | Extract text from PowerPoint | ~500 KB |
| python-docx | Extract text from Word documents | ~300 KB |
| openpyxl | Extract text from Excel | ~2 MB |
| pytesseract | OCR for scanned documents (optional) | ~100 KB |
| Pillow | Image processing for OCR | ~3 MB |
Total: ~8 MB (very lightweight!)
π§ OPTIONAL: OCR SUPPORT
For scanned PDFs and images, install Tesseract OCR engine:
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr
macOS:
brew install tesseract
Windows:
Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
β VERIFY INSTALLATION
# Test all libraries
python -c "
import PyPDF2
import pdfplumber
from pptx import Presentation
from docx import Document
import openpyxl
print('β
All document libraries installed!')
"
# Test OCR (optional)
python -c "
import pytesseract
from PIL import Image
print('β
OCR libraries installed!')
print(f'Tesseract version: {pytesseract.get_tesseract_version()}')
"
π― TEST WITH REAL DOCUMENT
# Test PDF extraction
python extraction/universal_extractor.py https://example.com/document.pdf
# Test PowerPoint extraction
python extraction/universal_extractor.py https://example.com/presentation.pptx
# Test Word extraction
python extraction/universal_extractor.py https://example.com/document.docx
π TROUBLESHOOTING
"No module named 'PyPDF2'"
pip install PyPDF2
"pytesseract is not installed"
# Install Python package
pip install pytesseract
# Install system package (Ubuntu)
sudo apt-get install tesseract-ocr
"TesseractNotFoundError"
# On Ubuntu/Debian
sudo apt-get install tesseract-ocr
# On macOS
brew install tesseract
# On Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH after installation
"Permission denied"
# Make sure you're in virtual environment
source venv/bin/activate
# Then retry installation
pip install -r requirements.txt
π STORAGE IMPACT
Even with all libraries installed:
- Virtual environment size: ~500 MB (unchanged)
- Libraries add: ~8 MB
- Total: Still under 1 GB β
Processing impact:
- Extract text from 1000 PDFs: ~50 MB local storage (temporary)
- Store in Parquet: ~5 MB (compressed)
- Save 90% storage vs storing original files β
β DONE!
You can now extract text from:
- β PDF documents
- β PowerPoint presentations
- β Word documents
- β Excel spreadsheets
- β HTML pages
- β Scanned documents (with OCR)
All will be stored efficiently in Parquet format for FREE on Hugging Face! π