Yaz Hobooti commited on
Commit
c328680
·
1 Parent(s): fa64916

Add .gitignore and Hugging Face model card

Browse files
Files changed (2) hide show
  1. .gitignore +26 -0
  2. README.md +22 -194
.gitignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # macOS
2
+ .DS_Store
3
+
4
+ # Python
5
+ __pycache__/
6
+ *.py[cod]
7
+ *.egg-info/
8
+ *.egg
9
+ *.pyo
10
+
11
+ # Environments
12
+ .venv/
13
+ venv/
14
+ env/
15
+
16
+ # Editor/IDE
17
+ .vscode/
18
+ .idea/
19
+
20
+ # Local artifacts
21
+ ProofCheck/.gradio/
22
+ *.log
23
+
24
+ # OS/Temp
25
+ Thumbs.db
26
+ *.tmp
README.md CHANGED
@@ -1,203 +1,31 @@
1
- # PDF Comparison Tool
2
-
3
- A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Features
6
-
7
- - **PDF Validation**: Ensures uploaded PDFs contain "50 Carroll" using OCR
8
- - **Color Difference Detection**: Identifies visual differences between PDFs and highlights them with red boxes
9
- - **Spelling Verification**: Checks text against both English and French dictionaries
10
- - **Barcode/QR Code Detection**: Automatically detects and reads barcodes and QR codes
11
- - **Visual Comparison**: Side-by-side comparison with annotated differences
12
- - **Modern Web Interface**: Responsive design with Bootstrap and custom styling
13
-
14
- ## Requirements
15
-
16
- ### System Requirements
17
- - Python 3.7 or higher
18
- - macOS, Linux, or Windows
19
- - Tesseract OCR engine (for text extraction)
20
-
21
- ### Python Dependencies
22
- All dependencies are listed in `requirements.txt`:
23
- - Flask (web framework)
24
- - PyPDF2 (PDF processing)
25
- - pdf2image (PDF to image conversion)
26
- - OpenCV (image processing)
27
- - pytesseract (OCR)
28
- - pyzbar (barcode detection)
29
- - pyspellchecker (spelling verification)
30
- - scikit-image (image comparison)
31
- - Pillow (image manipulation)
32
-
33
- ## Installation
34
-
35
- ### 1. Install Tesseract OCR
36
-
37
- **macOS:**
38
- ```bash
39
- brew install tesseract
40
- ```
41
-
42
- **Ubuntu/Debian:**
43
- ```bash
44
- sudo apt-get install tesseract-ocr
45
- ```
46
-
47
- **Windows:**
48
- Download from [Tesseract GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
49
-
50
- ### 2. Install Python Dependencies
51
-
52
- ```bash
53
- # Create virtual environment (recommended)
54
- python -m venv venv
55
- source venv/bin/activate # On Windows: venv\Scripts\activate
56
-
57
- # Install dependencies
58
- pip install -r requirements.txt
59
- ```
60
-
61
- ### 3. Download Language Data (if needed)
62
-
63
- The application will automatically download required NLTK data on first run.
64
 
65
  ## Usage
66
-
67
- ### 1. Start the Application
68
 
69
  ```bash
70
- python app.py
71
  ```
72
 
73
- The application will start on `http://localhost:5000`
74
-
75
- ### 2. Upload PDFs
76
-
77
- 1. Open your web browser and navigate to `http://localhost:5000`
78
- 2. Select two PDF files for comparison
79
- 3. Both PDFs must contain "50 Carroll" for validation
80
- 4. Click "Compare PDFs" to start the analysis
81
-
82
- ### 3. View Results
83
-
84
- The comparison results are displayed in three tabs:
85
-
86
- - **Visual Comparison**: Side-by-side view with red boxes highlighting differences
87
- - **Spelling Issues**: Table of spelling errors with suggestions from English and French dictionaries
88
- - **Barcodes & QR Codes**: List of detected barcodes with their data and positions
89
-
90
- ## File Structure
91
-
92
- ```
93
- ProofCheck/
94
- ├── app.py # Main Flask application
95
- ├── pdf_comparator.py # PDF comparison logic
96
- ├── requirements.txt # Python dependencies
97
- ├── README.md # This file
98
- ├── templates/
99
- │ └── index.html # Main web interface
100
- ├── static/
101
- │ ├── css/
102
- │ │ └── style.css # Custom styles
103
- │ ├── js/
104
- │ │ └── script.js # Frontend JavaScript
105
- │ └── results/ # Generated comparison images
106
- ├── uploads/ # Temporary uploaded files
107
- └── results/ # Comparison results JSON files
108
- ```
109
-
110
- ## How It Works
111
-
112
- ### 1. PDF Validation
113
- - Converts PDF pages to images using `pdf2image`
114
- - Uses Tesseract OCR to extract text
115
- - Validates presence of "50 Carroll" in extracted text
116
-
117
- ### 2. Color Difference Detection
118
- - Converts PDF pages to images
119
- - Resizes images to same dimensions
120
- - Uses structural similarity index (SSIM) to detect differences
121
- - Draws red rectangles around detected differences
122
-
123
- ### 3. Spelling Verification
124
- - Extracts text using OCR
125
- - Splits text into individual words
126
- - Checks each word against English and French dictionaries
127
- - Provides spelling suggestions for incorrect words
128
-
129
- ### 4. Barcode/QR Code Detection
130
- - Uses `pyzbar` library to detect barcodes and QR codes
131
- - Extracts data and position information
132
- - Displays results in organized table format
133
-
134
- ## Configuration
135
-
136
- ### Environment Variables
137
- - `FLASK_ENV`: Set to `development` for debug mode
138
- - `MAX_CONTENT_LENGTH`: Maximum file upload size (default: 16MB)
139
-
140
- ### Customization
141
- - Modify `pdf_comparator.py` to change comparison algorithms
142
- - Update `static/css/style.css` for custom styling
143
- - Edit `templates/index.html` for interface changes
144
-
145
- ## Troubleshooting
146
-
147
- ### Common Issues
148
-
149
- 1. **Tesseract not found**
150
- - Ensure Tesseract is installed and in your system PATH
151
- - On macOS, try: `brew install tesseract`
152
-
153
- 2. **PDF processing errors**
154
- - Check that PDFs are not corrupted
155
- - Ensure PDFs contain readable text (not just images)
156
-
157
- 3. **Memory issues with large PDFs**
158
- - Reduce DPI in `pdf_comparator.py` (default: 200)
159
- - Process PDFs page by page for very large documents
160
-
161
- 4. **Spelling checker not working**
162
- - Ensure internet connection for first run (downloads dictionary data)
163
- - Check that `pyspellchecker` is properly installed
164
-
165
- ### Performance Tips
166
-
167
- - Use smaller DPI values for faster processing
168
- - Limit PDF page count for large documents
169
- - Ensure sufficient RAM for image processing
170
-
171
- ## Security Considerations
172
-
173
- - Uploaded files are stored temporarily and cleaned up
174
- - File size limits prevent DoS attacks
175
- - Input validation prevents malicious file uploads
176
- - Session-based file handling ensures isolation
177
-
178
- ## Contributing
179
-
180
- 1. Fork the repository
181
- 2. Create a feature branch
182
- 3. Make your changes
183
- 4. Add tests if applicable
184
- 5. Submit a pull request
185
-
186
  ## License
187
-
188
- This project is open source and available under the MIT License.
189
-
190
- ## Support
191
-
192
- For issues and questions:
193
- 1. Check the troubleshooting section
194
- 2. Review the code comments
195
- 3. Create an issue on the repository
196
-
197
- ## Future Enhancements
198
-
199
- - Support for more document formats
200
- - Advanced text comparison algorithms
201
- - Machine learning-based difference detection
202
- - Batch processing capabilities
203
- - Export functionality for comparison reports
 
1
+ ---
2
+ title: ProofCheck
3
+ license: apache-2.0
4
+ tags:
5
+ - document-processing
6
+ - pdf
7
+ - ocr
8
+ - comparator
9
+ task_categories:
10
+ - other
11
+ pretty_name: ProofCheck
12
+ ---
13
+
14
+ # ProofCheck
15
+
16
+ ProofCheck is a PDF comparison and validation tool that enhances OCR and barcode detection.
17
 
18
  ## Features
19
+ - High-DPI PDF rendering (600 DPI) for improved OCR and barcode recognition
20
+ - Rule-based text and layout comparison
21
+ - Export of comparison results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Usage
24
+ Run locally:
 
25
 
26
  ```bash
27
+ python run.py
28
  ```
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## License
31
+ Apache-2.0