ProofCheck / README.md

Yaz Hobooti

Increase PDF resolution: DPI from 300 to 600, scaling factors improved for better OCR and barcode detection

e7a28e8 3 months ago

5.9 kB

PDF Comparison Tool

A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.

Features

PDF Validation: Ensures uploaded PDFs contain "50 Carroll" using OCR
Color Difference Detection: Identifies visual differences between PDFs and highlights them with red boxes
Spelling Verification: Checks text against both English and French dictionaries
Barcode/QR Code Detection: Automatically detects and reads barcodes and QR codes
Visual Comparison: Side-by-side comparison with annotated differences
Modern Web Interface: Responsive design with Bootstrap and custom styling

Requirements

System Requirements

Python 3.7 or higher
macOS, Linux, or Windows
Tesseract OCR engine (for text extraction)

Python Dependencies

All dependencies are listed in requirements.txt:

Flask (web framework)
PyPDF2 (PDF processing)
pdf2image (PDF to image conversion)
OpenCV (image processing)
pytesseract (OCR)
pyzbar (barcode detection)
pyspellchecker (spelling verification)
scikit-image (image comparison)
Pillow (image manipulation)

Installation

1. Install Tesseract OCR

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

Windows: Download from Tesseract GitHub

2. Install Python Dependencies

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Download Language Data (if needed)

The application will automatically download required NLTK data on first run.

Usage

1. Start the Application

python app.py

The application will start on http://localhost:5000

2. Upload PDFs

Open your web browser and navigate to http://localhost:5000
Select two PDF files for comparison
Both PDFs must contain "50 Carroll" for validation
Click "Compare PDFs" to start the analysis

3. View Results

The comparison results are displayed in three tabs:

Visual Comparison: Side-by-side view with red boxes highlighting differences
Spelling Issues: Table of spelling errors with suggestions from English and French dictionaries
Barcodes & QR Codes: List of detected barcodes with their data and positions

File Structure

ProofCheck/
├── app.py                 # Main Flask application
├── pdf_comparator.py      # PDF comparison logic
├── requirements.txt       # Python dependencies
├── README.md             # This file
├── templates/
│   └── index.html        # Main web interface
├── static/
│   ├── css/
│   │   └── style.css     # Custom styles
│   ├── js/
│   │   └── script.js     # Frontend JavaScript
│   └── results/          # Generated comparison images
├── uploads/              # Temporary uploaded files
└── results/              # Comparison results JSON files

How It Works

1. PDF Validation

Converts PDF pages to images using pdf2image
Uses Tesseract OCR to extract text
Validates presence of "50 Carroll" in extracted text

2. Color Difference Detection

Converts PDF pages to images
Resizes images to same dimensions
Uses structural similarity index (SSIM) to detect differences
Draws red rectangles around detected differences

3. Spelling Verification

Extracts text using OCR
Splits text into individual words
Checks each word against English and French dictionaries
Provides spelling suggestions for incorrect words

4. Barcode/QR Code Detection

Uses pyzbar library to detect barcodes and QR codes
Extracts data and position information
Displays results in organized table format

Configuration

Environment Variables

FLASK_ENV: Set to development for debug mode
MAX_CONTENT_LENGTH: Maximum file upload size (default: 16MB)

Customization

Modify pdf_comparator.py to change comparison algorithms
Update static/css/style.css for custom styling
Edit templates/index.html for interface changes

Troubleshooting

Common Issues

Tesseract not found
- Ensure Tesseract is installed and in your system PATH
- On macOS, try: brew install tesseract
PDF processing errors
- Check that PDFs are not corrupted
- Ensure PDFs contain readable text (not just images)
Memory issues with large PDFs
- Reduce DPI in pdf_comparator.py (default: 200)
- Process PDFs page by page for very large documents
Spelling checker not working
- Ensure internet connection for first run (downloads dictionary data)
- Check that pyspellchecker is properly installed

Performance Tips

Use smaller DPI values for faster processing
Limit PDF page count for large documents
Ensure sufficient RAM for image processing

Security Considerations

Uploaded files are stored temporarily and cleaned up
File size limits prevent DoS attacks
Input validation prevents malicious file uploads
Session-based file handling ensures isolation

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is open source and available under the MIT License.

Support

For issues and questions:

Check the troubleshooting section
Review the code comments
Create an issue on the repository

Future Enhancements

Support for more document formats
Advanced text comparison algorithms
Machine learning-based difference detection
Batch processing capabilities
Export functionality for comparison reports