Spaces:
Running
Running
| title: OCR Time Capsule | |
| emoji: π¦ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| # OCR Time Capsule π¦ | |
| A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions. | |
|  | |
| ## Features | |
| - **Fast Navigation**: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys) | |
| - **Side-by-Side Comparison**: View original OCR and improved text simultaneously | |
| - **Advanced Diff Visualization**: Character, word, or line-level differences with color highlighting | |
| - **No Backend Required**: Direct integration with HuggingFace Dataset Viewer API | |
| - **Responsive Design**: Works seamlessly on desktop and mobile devices | |
| - **Dark Mode**: Easy on the eyes for extended reading sessions | |
| - **URL Sharing**: Share specific dataset samples with direct links | |
| ## Quick Start | |
| ### Option 1: Local Development | |
| 1. Clone or download this directory | |
| 2. Serve the files using any static web server: | |
| ```bash | |
| # Using Python | |
| python -m http.server 8000 | |
| # Using Node.js | |
| npx serve . | |
| # Using PHP | |
| php -S localhost:8000 | |
| ``` | |
| 3. Open http://localhost:8000 in your browser | |
| ### Option 2: GitHub Pages | |
| 1. Push this directory to a GitHub repository | |
| 2. Enable GitHub Pages in repository settings | |
| 3. Access via `https://[username].github.io/[repo-name]/` | |
| ### Option 3: Direct File Access | |
| Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions. | |
| ## Usage | |
| ### Loading a Dataset | |
| 1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`) | |
| 2. Click "Load" or press Enter | |
| 3. The explorer will automatically detect text columns | |
| ### Navigation | |
| - **Next**: Press `J` or `β` arrow key | |
| - **Previous**: Press `K` or `β` arrow key | |
| - **Switch Views**: Press `1` (comparison), `2` (diff), or `3` (improved only) | |
| ### Supported Column Names | |
| The explorer automatically detects these column patterns: | |
| **Original OCR**: `text`, `ocr`, `original_text`, `ground_truth` | |
| **Improved OCR**: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr` | |
| ## Technical Details | |
| ### Architecture | |
| ``` | |
| βββββββββββββββββββ ββββββββββββββββββββββββ | |
| β Browser UI ββββββΆβ HF Dataset Viewer APIβ | |
| β (Alpine.js) β β (datasets-server) β | |
| βββββββββββββββββββ ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Local Cache β | |
| β (JavaScript) β | |
| βββββββββββββββββββ | |
| ``` | |
| ### API Integration | |
| Uses the HuggingFace Dataset Viewer API: | |
| - Base URL: `https://datasets-server.huggingface.co` | |
| - No authentication required for public datasets | |
| - Automatic handling of image URL expiration | |
| - Smart batching for efficient data loading | |
| ### Performance Optimizations | |
| - **Batch Loading**: Fetches 100 rows at a time | |
| - **Smart Caching**: Reduces API calls | |
| - **Lazy Loading**: Only loads visible content | |
| - **URL Refresh**: Automatically refreshes expired image URLs | |
| ## Customization | |
| ### Adding New Column Patterns | |
| Edit `js/dataset-api.js` and update the `detectColumns` method: | |
| ```javascript | |
| if (!originalTextColumn && ['your_column_name'].includes(name)) { | |
| originalTextColumn = name; | |
| } | |
| ``` | |
| ### Styling | |
| The UI uses Tailwind CSS. Modify styles in: | |
| - `css/styles.css` for custom styles | |
| - Tailwind classes directly in `index.html` | |
| ### Keyboard Shortcuts | |
| Add new shortcuts in `js/app.js`: | |
| ```javascript | |
| case 'your_key': | |
| // Your action here | |
| break; | |
| ``` | |
| ## Browser Support | |
| - Chrome/Edge: Full support | |
| - Firefox: Full support | |
| - Safari: Full support (14+) | |
| - Mobile browsers: Full support with touch navigation | |
| ## Limitations | |
| - Maximum 100 rows per API request | |
| - Image URLs expire after ~1 hour | |
| - No authentication support for private datasets (yet) | |
| - Read-only interface (no editing capabilities) | |
| ## Future Enhancements | |
| - [ ] Export functionality for improved texts | |
| - [ ] Batch processing capabilities | |
| - [ ] Search within dataset | |
| - [ ] Bookmarking system | |
| - [ ] Authentication for private datasets | |
| - [ ] Confidence scores visualization | |
| - [ ] Multi-dataset comparison | |
| ## Troubleshooting | |
| ### "Dataset viewer is not available" | |
| - Check if the dataset exists on HuggingFace | |
| - Ensure the dataset has viewer enabled | |
| - Try a known working dataset like `davanstrien/exams-ocr` | |
| ### Images not loading | |
| - Image URLs expire after ~1 hour | |
| - The app automatically refreshes URLs on error | |
| - Check browser console for detailed errors | |
| ### Slow loading | |
| - Large datasets may take time for initial load | |
| - Consider using datasets with pre-computed statistics | |
| - Check your internet connection | |
| ## Contributing | |
| This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs! | |
| ## License | |
| MIT License - Use freely for any purpose | |
| ## Related Projects | |
| - [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs | |
| - [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets | |
| - [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation |