ProofCheck / README.md
Yaz Hobooti
Increase PDF resolution: DPI from 300 to 600, scaling factors improved for better OCR and barcode detection
e7a28e8
|
raw
history blame
5.9 kB

PDF Comparison Tool

A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.

Features

  • PDF Validation: Ensures uploaded PDFs contain "50 Carroll" using OCR
  • Color Difference Detection: Identifies visual differences between PDFs and highlights them with red boxes
  • Spelling Verification: Checks text against both English and French dictionaries
  • Barcode/QR Code Detection: Automatically detects and reads barcodes and QR codes
  • Visual Comparison: Side-by-side comparison with annotated differences
  • Modern Web Interface: Responsive design with Bootstrap and custom styling

Requirements

System Requirements

  • Python 3.7 or higher
  • macOS, Linux, or Windows
  • Tesseract OCR engine (for text extraction)

Python Dependencies

All dependencies are listed in requirements.txt:

  • Flask (web framework)
  • PyPDF2 (PDF processing)
  • pdf2image (PDF to image conversion)
  • OpenCV (image processing)
  • pytesseract (OCR)
  • pyzbar (barcode detection)
  • pyspellchecker (spelling verification)
  • scikit-image (image comparison)
  • Pillow (image manipulation)

Installation

1. Install Tesseract OCR

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

Windows: Download from Tesseract GitHub

2. Install Python Dependencies

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Download Language Data (if needed)

The application will automatically download required NLTK data on first run.

Usage

1. Start the Application

python app.py

The application will start on http://localhost:5000

2. Upload PDFs

  1. Open your web browser and navigate to http://localhost:5000
  2. Select two PDF files for comparison
  3. Both PDFs must contain "50 Carroll" for validation
  4. Click "Compare PDFs" to start the analysis

3. View Results

The comparison results are displayed in three tabs:

  • Visual Comparison: Side-by-side view with red boxes highlighting differences
  • Spelling Issues: Table of spelling errors with suggestions from English and French dictionaries
  • Barcodes & QR Codes: List of detected barcodes with their data and positions

File Structure

ProofCheck/
β”œβ”€β”€ app.py                 # Main Flask application
β”œβ”€β”€ pdf_comparator.py      # PDF comparison logic
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ README.md             # This file
β”œβ”€β”€ templates/
β”‚   └── index.html        # Main web interface
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── style.css     # Custom styles
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   └── script.js     # Frontend JavaScript
β”‚   └── results/          # Generated comparison images
β”œβ”€β”€ uploads/              # Temporary uploaded files
└── results/              # Comparison results JSON files

How It Works

1. PDF Validation

  • Converts PDF pages to images using pdf2image
  • Uses Tesseract OCR to extract text
  • Validates presence of "50 Carroll" in extracted text

2. Color Difference Detection

  • Converts PDF pages to images
  • Resizes images to same dimensions
  • Uses structural similarity index (SSIM) to detect differences
  • Draws red rectangles around detected differences

3. Spelling Verification

  • Extracts text using OCR
  • Splits text into individual words
  • Checks each word against English and French dictionaries
  • Provides spelling suggestions for incorrect words

4. Barcode/QR Code Detection

  • Uses pyzbar library to detect barcodes and QR codes
  • Extracts data and position information
  • Displays results in organized table format

Configuration

Environment Variables

  • FLASK_ENV: Set to development for debug mode
  • MAX_CONTENT_LENGTH: Maximum file upload size (default: 16MB)

Customization

  • Modify pdf_comparator.py to change comparison algorithms
  • Update static/css/style.css for custom styling
  • Edit templates/index.html for interface changes

Troubleshooting

Common Issues

  1. Tesseract not found

    • Ensure Tesseract is installed and in your system PATH
    • On macOS, try: brew install tesseract
  2. PDF processing errors

    • Check that PDFs are not corrupted
    • Ensure PDFs contain readable text (not just images)
  3. Memory issues with large PDFs

    • Reduce DPI in pdf_comparator.py (default: 200)
    • Process PDFs page by page for very large documents
  4. Spelling checker not working

    • Ensure internet connection for first run (downloads dictionary data)
    • Check that pyspellchecker is properly installed

Performance Tips

  • Use smaller DPI values for faster processing
  • Limit PDF page count for large documents
  • Ensure sufficient RAM for image processing

Security Considerations

  • Uploaded files are stored temporarily and cleaned up
  • File size limits prevent DoS attacks
  • Input validation prevents malicious file uploads
  • Session-based file handling ensures isolation

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is open source and available under the MIT License.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the code comments
  3. Create an issue on the repository

Future Enhancements

  • Support for more document formats
  • Advanced text comparison algorithms
  • Machine learning-based difference detection
  • Batch processing capabilities
  • Export functionality for comparison reports