Yaz Hobooti
Increase PDF resolution: DPI from 300 to 600, scaling factors improved for better OCR and barcode detection
e7a28e8
PDF Comparison Tool
A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.
Features
- PDF Validation: Ensures uploaded PDFs contain "50 Carroll" using OCR
- Color Difference Detection: Identifies visual differences between PDFs and highlights them with red boxes
- Spelling Verification: Checks text against both English and French dictionaries
- Barcode/QR Code Detection: Automatically detects and reads barcodes and QR codes
- Visual Comparison: Side-by-side comparison with annotated differences
- Modern Web Interface: Responsive design with Bootstrap and custom styling
Requirements
System Requirements
- Python 3.7 or higher
- macOS, Linux, or Windows
- Tesseract OCR engine (for text extraction)
Python Dependencies
All dependencies are listed in requirements.txt:
- Flask (web framework)
- PyPDF2 (PDF processing)
- pdf2image (PDF to image conversion)
- OpenCV (image processing)
- pytesseract (OCR)
- pyzbar (barcode detection)
- pyspellchecker (spelling verification)
- scikit-image (image comparison)
- Pillow (image manipulation)
Installation
1. Install Tesseract OCR
macOS:
brew install tesseract
Ubuntu/Debian:
sudo apt-get install tesseract-ocr
Windows: Download from Tesseract GitHub
2. Install Python Dependencies
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
3. Download Language Data (if needed)
The application will automatically download required NLTK data on first run.
Usage
1. Start the Application
python app.py
The application will start on http://localhost:5000
2. Upload PDFs
- Open your web browser and navigate to
http://localhost:5000 - Select two PDF files for comparison
- Both PDFs must contain "50 Carroll" for validation
- Click "Compare PDFs" to start the analysis
3. View Results
The comparison results are displayed in three tabs:
- Visual Comparison: Side-by-side view with red boxes highlighting differences
- Spelling Issues: Table of spelling errors with suggestions from English and French dictionaries
- Barcodes & QR Codes: List of detected barcodes with their data and positions
File Structure
ProofCheck/
βββ app.py # Main Flask application
βββ pdf_comparator.py # PDF comparison logic
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ templates/
β βββ index.html # Main web interface
βββ static/
β βββ css/
β β βββ style.css # Custom styles
β βββ js/
β β βββ script.js # Frontend JavaScript
β βββ results/ # Generated comparison images
βββ uploads/ # Temporary uploaded files
βββ results/ # Comparison results JSON files
How It Works
1. PDF Validation
- Converts PDF pages to images using
pdf2image - Uses Tesseract OCR to extract text
- Validates presence of "50 Carroll" in extracted text
2. Color Difference Detection
- Converts PDF pages to images
- Resizes images to same dimensions
- Uses structural similarity index (SSIM) to detect differences
- Draws red rectangles around detected differences
3. Spelling Verification
- Extracts text using OCR
- Splits text into individual words
- Checks each word against English and French dictionaries
- Provides spelling suggestions for incorrect words
4. Barcode/QR Code Detection
- Uses
pyzbarlibrary to detect barcodes and QR codes - Extracts data and position information
- Displays results in organized table format
Configuration
Environment Variables
FLASK_ENV: Set todevelopmentfor debug modeMAX_CONTENT_LENGTH: Maximum file upload size (default: 16MB)
Customization
- Modify
pdf_comparator.pyto change comparison algorithms - Update
static/css/style.cssfor custom styling - Edit
templates/index.htmlfor interface changes
Troubleshooting
Common Issues
Tesseract not found
- Ensure Tesseract is installed and in your system PATH
- On macOS, try:
brew install tesseract
PDF processing errors
- Check that PDFs are not corrupted
- Ensure PDFs contain readable text (not just images)
Memory issues with large PDFs
- Reduce DPI in
pdf_comparator.py(default: 200) - Process PDFs page by page for very large documents
- Reduce DPI in
Spelling checker not working
- Ensure internet connection for first run (downloads dictionary data)
- Check that
pyspellcheckeris properly installed
Performance Tips
- Use smaller DPI values for faster processing
- Limit PDF page count for large documents
- Ensure sufficient RAM for image processing
Security Considerations
- Uploaded files are stored temporarily and cleaned up
- File size limits prevent DoS attacks
- Input validation prevents malicious file uploads
- Session-based file handling ensures isolation
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
This project is open source and available under the MIT License.
Support
For issues and questions:
- Check the troubleshooting section
- Review the code comments
- Create an issue on the repository
Future Enhancements
- Support for more document formats
- Advanced text comparison algorithms
- Machine learning-based difference detection
- Batch processing capabilities
- Export functionality for comparison reports