Spaces:

davanstrien
/

ocr-time-capsule

Running

App Files Files Community

ocr-time-capsule / README.md

davanstrien HF Staff

draft

10aaf2c 4 months ago

preview code

raw

history blame contribute delete

5.4 kB

	---
	title: OCR Time Capsule
	emoji: 📦
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	---

	# OCR Time Capsule 📦

	A fast, modern web interface for exploring and comparing OCR text improvements in HuggingFace datasets. Browse through pre-processed OCR improvements to see how AI models enhance historical document transcriptions.

	![OCR Time Capsule](https://img.shields.io/badge/OCR-Time%20Capsule-blue)

	## Features

	- Fast Navigation: Browse through large OCR datasets with keyboard shortcuts (J/K or arrow keys)
	- Side-by-Side Comparison: View original OCR and improved text simultaneously
	- Advanced Diff Visualization: Character, word, or line-level differences with color highlighting
	- No Backend Required: Direct integration with HuggingFace Dataset Viewer API
	- Responsive Design: Works seamlessly on desktop and mobile devices
	- Dark Mode: Easy on the eyes for extended reading sessions
	- URL Sharing: Share specific dataset samples with direct links

	## Quick Start

	### Option 1: Local Development

	1. Clone or download this directory
	2. Serve the files using any static web server:

	```bash
	# Using Python
	python -m http.server 8000

	# Using Node.js
	npx serve .

	# Using PHP
	php -S localhost:8000
	```

	3. Open http://localhost:8000 in your browser

	### Option 2: GitHub Pages

	1. Push this directory to a GitHub repository
	2. Enable GitHub Pages in repository settings
	3. Access via `https://[username].github.io/[repo-name]/`

	### Option 3: Direct File Access

	Simply open `index.html` in a modern web browser. Note: Some features may be limited due to CORS restrictions.

	## Usage

	### Loading a Dataset

	1. Enter a HuggingFace dataset ID (e.g., `davanstrien/exams-ocr`)
	2. Click "Load" or press Enter
	3. The explorer will automatically detect text columns

	### Navigation

	- Next: Press `J` or `→` arrow key
	- Previous: Press `K` or `←` arrow key
	- Switch Views: Press `1` (comparison), `2` (diff), or `3` (improved only)

	### Supported Column Names

	The explorer automatically detects these column patterns:

	Original OCR: `text`, `ocr`, `original_text`, `ground_truth`
	Improved OCR: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`

	## Technical Details

	### Architecture

	```
	┌─────────────────┐ ┌──────────────────────┐
	│ Browser UI │────▶│ HF Dataset Viewer API│
	│ (Alpine.js) │ │ (datasets-server) │
	└─────────────────┘ └──────────────────────┘
	│
	▼
	┌─────────────────┐
	│ Local Cache │
	│ (JavaScript) │
	└─────────────────┘
	```

	### API Integration

	Uses the HuggingFace Dataset Viewer API:
	- Base URL: `https://datasets-server.huggingface.co`
	- No authentication required for public datasets
	- Automatic handling of image URL expiration
	- Smart batching for efficient data loading

	### Performance Optimizations

	- Batch Loading: Fetches 100 rows at a time
	- Smart Caching: Reduces API calls
	- Lazy Loading: Only loads visible content
	- URL Refresh: Automatically refreshes expired image URLs

	## Customization

	### Adding New Column Patterns

	Edit `js/dataset-api.js` and update the `detectColumns` method:

	```javascript
	if (!originalTextColumn && ['your_column_name'].includes(name)) {
	originalTextColumn = name;
	}
	```

	### Styling

	The UI uses Tailwind CSS. Modify styles in:
	- `css/styles.css` for custom styles
	- Tailwind classes directly in `index.html`

	### Keyboard Shortcuts

	Add new shortcuts in `js/app.js`:

	```javascript
	case 'your_key':
	// Your action here
	break;
	```

	## Browser Support

	- Chrome/Edge: Full support
	- Firefox: Full support
	- Safari: Full support (14+)
	- Mobile browsers: Full support with touch navigation

	## Limitations

	- Maximum 100 rows per API request
	- Image URLs expire after ~1 hour
	- No authentication support for private datasets (yet)
	- Read-only interface (no editing capabilities)

	## Future Enhancements

	- [ ] Export functionality for improved texts
	- [ ] Batch processing capabilities
	- [ ] Search within dataset
	- [ ] Bookmarking system
	- [ ] Authentication for private datasets
	- [ ] Confidence scores visualization
	- [ ] Multi-dataset comparison

	## Troubleshooting

	### "Dataset viewer is not available"
	- Check if the dataset exists on HuggingFace
	- Ensure the dataset has viewer enabled
	- Try a known working dataset like `davanstrien/exams-ocr`

	### Images not loading
	- Image URLs expire after ~1 hour
	- The app automatically refreshes URLs on error
	- Check browser console for detailed errors

	### Slow loading
	- Large datasets may take time for initial load
	- Consider using datasets with pre-computed statistics
	- Check your internet connection

	## Contributing

	This is a standalone tool designed for OCR exploration. Feel free to fork and customize for your needs!

	## License

	MIT License - Use freely for any purpose

	## Related Projects

	- [OCR Time Machine](../app.py) - Interactive OCR improvement with VLMs
	- [HuggingFace Datasets](https://huggingface.co/datasets) - Browse available datasets
	- [Dataset Viewer Docs](https://huggingface.co/docs/dataset-viewer) - API documentation