Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 3,373 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | # π¦ INSTALLING DOCUMENT PROCESSING LIBRARIES
**Quick guide to install all libraries for handling multiple document formats.**
---
## π QUICK INSTALL
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
# Install all document processing libraries
pip install PyPDF2 pdfplumber python-pptx python-docx openpyxl
# Optional: OCR for scanned documents (requires tesseract)
pip install pytesseract Pillow
```
---
## π WHAT GETS INSTALLED
| Library | Purpose | Size |
|---------|---------|------|
| **PyPDF2** | Extract text from PDFs | ~500 KB |
| **pdfplumber** | Advanced PDF extraction (tables) | ~2 MB |
| **python-pptx** | Extract text from PowerPoint | ~500 KB |
| **python-docx** | Extract text from Word documents | ~300 KB |
| **openpyxl** | Extract text from Excel | ~2 MB |
| **pytesseract** | OCR for scanned documents (optional) | ~100 KB |
| **Pillow** | Image processing for OCR | ~3 MB |
**Total: ~8 MB** (very lightweight!)
---
## π§ OPTIONAL: OCR SUPPORT
**For scanned PDFs and images, install Tesseract OCR engine:**
### Ubuntu/Debian:
```bash
sudo apt-get update
sudo apt-get install tesseract-ocr
```
### macOS:
```bash
brew install tesseract
```
### Windows:
Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
---
## β
VERIFY INSTALLATION
```bash
# Test all libraries
python -c "
import PyPDF2
import pdfplumber
from pptx import Presentation
from docx import Document
import openpyxl
print('β
All document libraries installed!')
"
# Test OCR (optional)
python -c "
import pytesseract
from PIL import Image
print('β
OCR libraries installed!')
print(f'Tesseract version: {pytesseract.get_tesseract_version()}')
"
```
---
## π― TEST WITH REAL DOCUMENT
```bash
# Test PDF extraction
python extraction/universal_extractor.py https://example.com/document.pdf
# Test PowerPoint extraction
python extraction/universal_extractor.py https://example.com/presentation.pptx
# Test Word extraction
python extraction/universal_extractor.py https://example.com/document.docx
```
---
## π TROUBLESHOOTING
### "No module named 'PyPDF2'"
```bash
pip install PyPDF2
```
### "pytesseract is not installed"
```bash
# Install Python package
pip install pytesseract
# Install system package (Ubuntu)
sudo apt-get install tesseract-ocr
```
### "TesseractNotFoundError"
```bash
# On Ubuntu/Debian
sudo apt-get install tesseract-ocr
# On macOS
brew install tesseract
# On Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH after installation
```
### "Permission denied"
```bash
# Make sure you're in virtual environment
source venv/bin/activate
# Then retry installation
pip install -r requirements.txt
```
---
## π STORAGE IMPACT
**Even with all libraries installed:**
- Virtual environment size: ~500 MB (unchanged)
- Libraries add: ~8 MB
- **Total: Still under 1 GB** β
**Processing impact:**
- Extract text from 1000 PDFs: ~50 MB local storage (temporary)
- Store in Parquet: ~5 MB (compressed)
- **Save 90% storage vs storing original files** β
---
## β
DONE!
**You can now extract text from:**
- β
PDF documents
- β
PowerPoint presentations
- β
Word documents
- β
Excel spreadsheets
- β
HTML pages
- β
Scanned documents (with OCR)
**All will be stored efficiently in Parquet format for FREE on Hugging Face!** π
|