Spaces:
Running
Running
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with the OCR Text Explorer. | |
| ## Project Overview | |
| OCR Text Explorer is a modern, standalone web application for browsing and comparing OCR text improvements in HuggingFace datasets. Built as a lightweight alternative to the Gradio-based OCR Time Machine, it focuses specifically on exploring pre-OCR'd datasets with enhanced user experience. | |
| ## Recent Updates | |
| ### Deep Link Sharing (Added 2025-08-07) | |
| The application now supports deep linking with full state preservation for easy sharing and collaboration: | |
| **Features:** | |
| - Complete URL state management for all view settings | |
| - Copy Link button for one-click sharing | |
| - Automatic restoration of view state from URL parameters | |
| - Success notification when link is copied | |
| **URL Parameters Supported:** | |
| - `dataset` - HuggingFace dataset ID | |
| - `index` - Sample index (0-based) | |
| - `view` - View mode (comparison, diff, improved) | |
| - `diff` - Diff algorithm (char, word, line, markdown) | |
| - `markdown` - Markdown rendering state (true/false) | |
| - `reasoning` - Reasoning panel expansion state (true/false, only for samples with reasoning traces) | |
| **Implementation Details:** | |
| - URL updates automatically as users navigate and change settings | |
| - Prevents double-loading when URL contains specific index | |
| - Fallback clipboard API support for older browsers | |
| - Reasoning state only included in URL when reasoning trace is present | |
| **Example Deep Links:** | |
| - Basic: `/?dataset=davanstrien/exams-ocr&index=5` | |
| - Full state: `/?dataset=davanstrien/india-medical-ocr-test&index=0&view=improved&diff=word&markdown=true&reasoning=true` | |
| ### Reasoning Trace Support (Added 2025-08-07) | |
| The application now supports displaying reasoning traces from models like NuMarkdown-8B-Thinking that include their analysis process in the output: | |
| **Features:** | |
| - Automatic detection of `<think>` and `<answer>` XML-like tags in improved text | |
| - Collapsible "Model Reasoning" panel showing the model's thought process | |
| - Clean separation of reasoning from final output for better readability | |
| - Reasoning statistics (word count, percentage of total output) | |
| - Support for formatted reasoning steps with numbered analysis points | |
| - "Reasoning Trace" badge indicator in the statistics panel | |
| - Deep link support preserves reasoning panel state | |
| **Implementation Details:** | |
| - New `reasoning-parser.js` module handles detection and parsing of reasoning traces | |
| - Supports multiple reasoning formats (`<think>`, `<thinking>`, `<reasoning>` tags) | |
| - **Important:** Only well-formed traces with both opening AND closing tags are parsed | |
| - Malformed traces (missing closing tags) are displayed as plain text | |
| - Formats numbered steps from reasoning content for structured display | |
| - Caches parsed reasoning to avoid reprocessing | |
| - Exports include optional reasoning trace content | |
| - Reasoning panel state included in shareable URLs | |
| **Supported Datasets:** | |
| - `davanstrien/india-medical-ocr-test` - Medical documents processed with NuMarkdown-8B-Thinking | |
| - Any dataset with reasoning traces in supported XML-like formats | |
| **UI Components:** | |
| - Collapsible reasoning panel with smooth animations | |
| - Step-by-step reasoning display with numbered indicators | |
| - "Final Output" label when reasoning is present | |
| - Dark mode optimized styling for reasoning sections | |
| ### Markdown Rendering Support (Added 2025-08-01) | |
| The application now supports rendering markdown-formatted VLM output for improved readability: | |
| **Features:** | |
| - Automatic markdown detection in improved OCR text | |
| - Toggle button to switch between raw markdown and rendered view | |
| - Support for common markdown elements: headers, lists, tables, code blocks, links | |
| - Security-focused implementation with XSS prevention | |
| - Performance optimization with render caching | |
| **Implementation Details:** | |
| - Uses marked.js library for markdown parsing | |
| - Custom renderers for security (sanitizes URLs, prevents script injection) | |
| - Tailwind-styled markdown elements matching the app's design | |
| - HTML table support for VLM outputs that use table tags | |
| - Cache system limits memory usage to 50 rendered items | |
| **UI Changes:** | |
| - Markdown toggle button appears when markdown is detected | |
| - "Markdown Detected" badge in statistics panel | |
| - New "Markdown Diff" mode showing plain vs rendered comparison | |
| - Both "Improved Only" and "Side by Side" views support rendering | |
| ## Architecture | |
| ### Technology Stack | |
| - **Frontend Framework**: Alpine.js (lightweight reactivity, ~15KB) | |
| - **Styling**: Tailwind CSS (utility-first, responsive design) | |
| - **Interactions**: HTMX (server-side rendering capabilities) | |
| - **API**: HuggingFace Dataset Viewer API (no backend required) | |
| - **Language**: Vanilla JavaScript (no build process needed) | |
| ### Core Components | |
| **index.html** - Main application shell | |
| - Split-pane layout (1/3 image, 2/3 text comparison) | |
| - Three view modes: Side-by-side, Inline diff, Improved only | |
| - Dark mode support with proper contrast | |
| - Responsive design for mobile devices | |
| **js/dataset-api.js** - HuggingFace API wrapper | |
| - Smart caching with 45-minute expiration for signed URLs | |
| - Batch loading (100 rows at a time) | |
| - Automatic column detection for different dataset schemas | |
| - Image URL refresh on expiration | |
| **js/app.js** - Alpine.js application logic | |
| - Keyboard navigation (J/K, arrows) | |
| - URL state management for shareable links | |
| - Diff mode switching (character/word/line) | |
| - Dark mode persistence in localStorage | |
| **js/diff-utils.js** - Text comparison algorithms | |
| - Character-level diff with inline highlighting | |
| - Word-level diff preserving whitespace | |
| - Line-level diff for larger changes | |
| - LCS (Longest Common Subsequence) implementation | |
| **css/styles.css** - Custom styling | |
| - Dark mode enhancements | |
| - Diff highlighting with accessibility in mind | |
| - Smooth transitions and animations | |
| - Print-friendly styles | |
| ## Key Design Decisions | |
| ### Why Separate from OCR Time Machine? | |
| 1. **Focused Purpose**: OCR Time Machine is for live OCR processing with VLMs (requires GPU), while this explorer is for browsing pre-processed results | |
| 2. **Performance**: No Python/Gradio overhead - instant loading and navigation | |
| 3. **User Experience**: Custom UI optimized for text comparison workflows | |
| 4. **Deployment**: Static files can be hosted anywhere (GitHub Pages, CDN, etc.) | |
| ### API vs Backend Trade-offs | |
| **Chose HF Dataset Viewer API because:** | |
| - No backend infrastructure needed | |
| - Automatic image serving with CDN | |
| - Built-in pagination support | |
| - Works with any public HF dataset | |
| **Limitations accepted:** | |
| - Image URLs expire (~1 hour) | |
| - 100 rows max per request | |
| - No write capabilities | |
| - Public datasets only (no auth yet) | |
| ### UI/UX Principles | |
| 1. **Keyboard-first**: Professional users prefer keyboard navigation | |
| 2. **Information density**: Show more content, less chrome | |
| 3. **Visual diff**: Color-coded changes are easier to scan than side-by-side | |
| 4. **Dark mode**: Essential for extended reading sessions | |
| 5. **Responsive**: Works on tablets for field work | |
| ## Development Approach | |
| ### Phase 1: MVP (Completed) | |
| - Basic dataset loading and navigation | |
| - Side-by-side text comparison | |
| - Keyboard shortcuts | |
| - Dark mode | |
| ### Phase 2: Enhancements (Completed) | |
| - Three diff algorithms (char/word/line) | |
| - URL state management | |
| - Image error handling with refresh | |
| - Responsive mobile layout | |
| ### Phase 3: Polish (Completed) | |
| - Fixed dark mode contrast issues | |
| - Optimized performance with direct indexing | |
| - Added loading states and error handling | |
| - Comprehensive documentation | |
| ## Common Tasks | |
| ### Adding Column Name Patterns | |
| ```javascript | |
| // In dataset-api.js detectColumns() method | |
| if (!originalTextColumn && ['your_column_name'].includes(name)) { | |
| originalTextColumn = name; | |
| } | |
| ``` | |
| ### Adding Keyboard Shortcuts | |
| ```javascript | |
| // In app.js setupKeyboardNavigation() | |
| case 'your_key': | |
| // Your action | |
| break; | |
| ``` | |
| ### Customizing Diff Colors | |
| ```javascript | |
| // In diff-utils.js | |
| // Light mode: bg-red-200, text-red-800 | |
| // Dark mode: bg-red-950, text-red-300 | |
| ``` | |
| ### Working with Markdown Rendering | |
| ```javascript | |
| // Enable/disable markdown rendering | |
| this.renderMarkdown = true; // Toggle markdown rendering | |
| // Add new markdown patterns to detection | |
| // In app.js detectMarkdown() method | |
| const markdownPatterns = [ | |
| /your_pattern_here/, // Add your pattern | |
| // ... existing patterns | |
| ]; | |
| // Customize markdown styles | |
| // In app.js renderMarkdownText() method | |
| html = html.replace(/<your_element>/g, '<your_element class="your-tailwind-classes">'); | |
| ``` | |
| ## Performance Optimizations | |
| 1. **Direct Dataset Indexing**: Uses `dataset[index]` instead of loading batches into memory | |
| 2. **Smart Caching**: Caches API responses for 45 minutes (conservative for signed URLs) | |
| 3. **Batch Fetching**: Loads 100 rows at once, caches for smooth navigation | |
| 4. **Lazy Loading**: Only fetches data when needed | |
| ## Known Issues & Solutions | |
| ### Issue: Navigation buttons were disabled | |
| **Cause**: API response structure wasn't parsed correctly | |
| **Fix**: Updated getTotalRows() to check `size.config.num_rows` and `size.splits[0].num_rows` | |
| ### Issue: Dark mode text unreadable | |
| **Cause**: Insufficient contrast in diff highlighting and code blocks | |
| **Fix**: | |
| - Changed diff colors to use `dark:bg-red-950` and `dark:text-red-300` | |
| - Added explicit `text-gray-900 dark:text-gray-100` to all text containers | |
| ### Issue: Image loading errors | |
| **Cause**: Signed URLs expire after ~1 hour | |
| **Fix**: Implemented handleImageError() with automatic URL refresh | |
| ### Issue: Markdown tables not rendering | |
| **Cause**: Default marked.js settings and HTML security restrictions | |
| **Fix**: | |
| - Enabled `tables: true` in marked.js options | |
| - Added safe HTML table tag allowlist in renderer | |
| - Applied proper Tailwind CSS classes to table elements | |
| - Added CSS overrides for prose container compatibility | |
| ## Mobile Support Status | |
| While the application claims responsive design, the current mobile support is limited. A comprehensive mobile enhancement is planned but not yet implemented. See [mobile-enhancement-plan.md](mobile-enhancement-plan.md) for detailed technical requirements and implementation approach. | |
| **Current limitations:** | |
| - Fixed desktop layout doesn't adapt well to small screens | |
| - No touch gesture support for navigation | |
| - Small touch targets for buttons and inputs | |
| - Desktop-only interactions (hover states, keyboard shortcuts) | |
| **Planned improvements:** | |
| - Responsive stacked layout for mobile devices | |
| - Touch gestures (swipe for navigation) | |
| - Mobile-optimized navigation bar | |
| - Touch-friendly UI components | |
| ## Future Enhancements | |
| - [ ] Comprehensive mobile support (see mobile-enhancement-plan.md) | |
| - [ ] Search/filter within dataset | |
| - [ ] Bookmark favorite samples | |
| - [ ] Export selected texts | |
| - [ ] Support for private datasets (auth) | |
| - [ ] Metrics display (CER/WER) | |
| - [ ] Batch operations | |
| - [ ] PWA support for offline viewing | |
| ## Deployment | |
| ### Static Hosting (Recommended) | |
| ```bash | |
| # Any static file server works | |
| python3 -m http.server 8000 | |
| npx serve . | |
| ``` | |
| ### GitHub Pages | |
| 1. Push to GitHub repository | |
| 2. Enable Pages in settings | |
| 3. Access at: `https://[username].github.io/[repo]/ocr-text-explorer/` | |
| ### CDN Deployment | |
| - Upload files to any CDN | |
| - No server-side processing needed | |
| - Works with CloudFlare, Netlify, Vercel, etc. | |
| ## Testing Datasets | |
| Known working datasets: | |
| - `davanstrien/exams-ocr` - Default dataset with exam papers (uses `text` and `markdown` columns) | |
| - `davanstrien/rolm-test` - Victorian theatre playbills processed with RolmOCR (uses `text` and `rolmocr_text` columns, includes `inference_info` metadata) | |
| - Any dataset with image + text columns | |
| Column patterns automatically detected: | |
| - Original: `text`, `ocr`, `original_text`, `ground_truth` | |
| - Improved: `markdown`, `new_ocr`, `corrected_text`, `vlm_ocr`, `rolmocr_text` | |
| - Metadata: `inference_info` (JSON array with model details, processing date, parameters) | |
| ## Recent Updates | |
| ### Model Information Display (Added 2025-08-04) | |
| The application now displays model processing information when available: | |
| **Features:** | |
| - Automatic detection of `inference_info` column | |
| - Model metadata panel showing: model name, processing date, batch size, max tokens | |
| - Link to processing script when available | |
| - Positioned prominently below image for immediate visibility | |
| **Implementation Notes:** | |
| - The model info panel only appears when `inference_info` column exists | |
| - Supports datasets processed with UV scripts via HF Jobs | |
| - Gracefully handles datasets without model metadata | |
| ### Reasoning Trace Parsing Fix (Added 2025-08-07) | |
| Fixed an issue where reasoning traces with incomplete or malformed XML tags would cause parsing errors: | |
| **Problem:** | |
| - Some model outputs contained opening `<think>` tags without closing `</think>` tags | |
| - This appeared to be truncated or malformed model output | |
| - The parser would attempt to parse these incomplete traces, causing confusion | |
| **Solution:** | |
| - Updated `detectReasoningTrace()` to require BOTH opening and closing tags | |
| - Added console warnings when incomplete traces are detected | |
| - Malformed traces are now displayed as plain text instead of being parsed | |
| **Benefits:** | |
| - Cleaner handling of incomplete model outputs | |
| - No confusing partial reasoning panels for malformed content | |
| - Maintains full functionality for well-formed reasoning traces | |
| - Helpful console warnings for debugging | |
| **Technical Details:** | |
| - File: `js/reasoning-parser.js` | |
| - Only traces with complete XML tags (`<think>...</think>`, `<thinking>...</thinking>`, etc.) are parsed | |
| - Incomplete traces log: "Incomplete reasoning trace detected - missing closing tags" |