Spaces:
Sleeping
Sleeping
File size: 17,948 Bytes
0d61aa0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | # CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.
**Target deployment**: Hugging Face Spaces (gradio SDK)
## Commands
### Development
```bash
# Run the Gradio app locally
uv run python app.py
# The app will launch at http://localhost:7860 by default
```
### Dependencies
```bash
# Sync environment (after cloning or pulling changes)
uv sync
# Add a new dependency
uv add <package>
# Add a dev dependency
uv add --dev <package>
# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
uv pip compile pyproject.toml -o requirements.txt
```
The project uses `pyproject.toml` for dependency management with uv lock file support. **Always use `uv run`** for running commands in the development environment.
## Architecture
### Core Libraries
- **PyMuPDF (fitz)**: Layout extraction, block/span detection, rendering pages as images
- **pikepdf**: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
- **Gradio**: Web UI framework
- **PIL (Pillow)**: Image manipulation for overlay rendering
### Main Application Flow (app.py)
The application has two main modes: **Single Page Analysis** and **Batch Analysis**.
#### Single Page Analysis Pipeline
1. **PDF Structure Analysis** (`pdf_struct_report`):
- Uses pikepdf to inspect PDF-level metadata
- Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
- Analyzes per-page resources (fonts, XObjects)
2. **Layout Extraction** (`extract_blocks_spans`):
- Uses PyMuPDF's `get_text("dict")` to extract blocks, lines, and spans with bounding boxes
- Returns structured `BlockInfo` objects containing text, bbox, font info, and span details
- Block types: 0=text, 1=image, 2=drawing
3. **Reading Order Analysis** (`order_blocks`):
- Three ordering modes:
- `raw`: Extraction order (as stored in PDF)
- `tblr`: Top-to-bottom, left-to-right sorting by bbox
- `columns`: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
4. **Diagnostic Heuristics** (`diagnose_page`):
- Detects scanned pages (no text + images)
- Identifies text-as-vector-outlines (no text + many drawings)
- Flags Type3 fonts (often correlate with broken text extraction)
- Detects garbled text (replacement characters, missing ToUnicode)
- Guesses multi-column layouts (x-center clustering)
5. **Adaptive Contrast Detection** (for visualization):
- `sample_background_color()`: Samples page at 9 points (corners, edges, center) to determine background
- `calculate_luminance()`: Uses WCAG formula to compute relative luminance (0-1)
- `get_contrast_colors()`: Returns appropriate color palette based on luminance
- Background colors cached per page for performance
6. **Visualization** (`render_page_with_overlay`):
- Renders page at specified DPI using PyMuPDF
- Automatically detects background and chooses contrasting overlay colors
- Overlays numbered block rectangles showing reading order
- Optionally shows span-level boxes
- Flags math-like content using regex heuristics (`_looks_like_math`)
7. **Result Formatting** (`format_diagnostic_summary`):
- Generates Markdown with severity icons (β, β οΈ, β)
- Includes inline explanations from `DIAGNOSTIC_HELP` dictionary
#### Batch Analysis Pipeline
1. **Multi-Page Processing** (`diagnose_all_pages`):
- Analyzes multiple pages (configurable max_pages and sample_rate)
- Progress tracking via `gr.Progress()`
- Calls `diagnose_page()` for each page with timing
- Returns `BatchAnalysisResult` dataclass
2. **Aggregation** (`aggregate_results`):
- Counts issues across all pages
- Identifies critical pages (3+ issues)
- Detects common issues (affecting >50% of pages)
3. **Result Formatting**:
- `format_batch_summary_markdown()`: Executive summary with statistics
- `format_batch_results_table()`: Color-coded HTML table per page
- `format_batch_results_chart()`: Plotly bar chart of issue distribution
### Advanced Analysis Modules
The application includes specialized modules for advanced PDF accessibility analysis:
**advanced_analysis.py** - Coordinator module
- Provides facade functions with error handling
- `require_structure_tree` decorator: checks for tagged PDFs before execution
- `safe_execute` decorator: comprehensive error handling with user-friendly messages
- Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
**content_stream_parser.py** - PDF operator extraction
- `extract_content_stream_for_block()`: Gets operators for a specific block
- `_parse_text_objects()`: Extracts BT...ET blocks from content stream
- `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
- `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
- Returns formatted markdown and raw stream text
**screen_reader_sim.py** - Accessibility simulation
- `simulate_screen_reader()`: Main simulation function
- `_simulate_tagged()`: Follows structure tree for tagged PDFs
- `_simulate_untagged()`: Falls back to visual order for untagged PDFs
- `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
**structure_tree.py** - Structure tree analysis
- `StructureNode` dataclass: represents PDF tag hierarchy
- `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
- `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
- `format_tree_text()`: Creates indented text view with box-drawing characters
- `get_tree_statistics()`: Counts nodes, tags, alt text coverage
- `extract_mcid_for_page()`: Finds marked content IDs in page content stream
- `map_blocks_to_tags()`: Correlates visual blocks with structure elements
- `detect_visual_paragraphs()`: Spacing-based paragraph detection
- `detect_semantic_paragraphs()`: Extracts <P> tags for a page
- `compare_paragraphs()`: Calculates match quality between visual and semantic
### Key Data Structures
**Single Page Analysis**:
- `SpanInfo`: Individual text run with bbox, text, font, size
- `BlockInfo`: Text/image block with bbox, text, type, and list of spans
**Batch Analysis**:
- `PageDiagnostic`: Per-page diagnostic results with all issue flags and processing time
- `BatchAnalysisResult`: Aggregated statistics across multiple pages including:
- `summary_stats`: Dictionary of issue counts
- `per_page_results`: List of PageDiagnostic objects
- `common_issues`: Issues affecting >50% of pages
- `critical_pages`: Pages with 3+ issues
- `to_dict()`: Method to convert to JSON-serializable format
**Advanced Analysis**:
- `StructureNode`: Represents a node in the PDF structure tree with:
- `tag_type`: Tag name (P, H1, Document, Figure, etc.)
- `depth`: Nesting level in the tree
- `mcid`: Marked Content ID (links to page content)
- `alt_text`: Alternative text for accessibility
- `actual_text`: Actual text content or replacement text
- `page_ref`: 0-based page index
- `children`: List of child StructureNode objects
- `to_dict()`: Convert to JSON-serializable format
**UI State**:
- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
- Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
### Gradio UI Flow
The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
#### Single Page Tab
1. User uploads PDF β `_on_upload` β extracts path and page count
2. User adjusts parameters (page, DPI, order mode, visualization options)
3. Click "Analyze" β `analyze` function:
- Runs structural report (pikepdf)
- Extracts and orders blocks (PyMuPDF)
- Generates diagnostic report with adaptive contrast detection
- Creates overlay image with high-contrast colors
- Returns reading order preview + formatted summary with icons
#### Batch Analysis Tab
1. User sets max_pages (default 100) and sample_rate (default 1)
2. Click "Analyze All Pages" β `analyze_batch_with_progress` function:
- Calls `diagnose_all_pages()` with progress tracking
- Aggregates results across pages
- Returns:
- Summary markdown with statistics and common issues
- Plotly bar chart of issue distribution
- Color-coded HTML table of per-page results
- Full JSON report
#### Advanced Analysis Tab
Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
1. **Content Stream Inspector**:
- Extracts raw PDF content stream operators for a specific block
- Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
- Useful for debugging text extraction, font issues, and positioning problems
- Provides both formatted view and raw stream
- Uses regex parsing of content streams (approximate for complex PDFs)
2. **Screen Reader Simulator**:
- Simulates NVDA or JAWS reading behavior for the current page
- Two modes:
- **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
- **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
- Three detail levels: minimal (text only), default (element announcements), verbose (full context)
- Generates transcript + analysis with alt text coverage statistics
- Reading order configurable for untagged fallback (raw/tblr/columns)
3. **Paragraph Detection**:
- Compares visual paragraphs (detected by spacing) vs semantic <P> tags
- Visual detection: groups blocks with vertical gap < threshold (default 15pt)
- Semantic detection: extracts <P> tags from structure tree
- Generates color-coded overlay (green = visual paragraphs)
- Reports match quality score and mismatches
- Requires tagged PDF for semantic comparison
4. **Structure Tree Visualizer**:
- Extracts complete PDF tag hierarchy from StructTreeRoot
- Three visualization formats:
- **Tree Diagram**: Interactive Plotly sunburst chart
- **Text View**: Indented text with box-drawing characters
- **Statistics**: Node counts, tag distribution, alt text coverage
- Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
- Displays alt text, actual text, page references, and MCID markers
- Only works for tagged PDFs
5. **Block-to-Tag Mapping**:
- Maps visual blocks to structure tree elements via MCID (Marked Content ID)
- Shows which blocks have proper semantic tagging
- DataFrame output with block index, tag type, MCID, alt text
- Helps identify untagged content
- Requires tagged PDF with MCID references
#### Help & Documentation
- All UI controls have `info` parameters with inline tooltips
- Expandable "π Understanding the Diagnostics" accordion with detailed explanations
- `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries provide explanation text
- Summary sections use severity icons (β, β οΈ, β) for quick scanning
## Key Features
### Adaptive Contrast Overlays
The overlay visualization automatically adapts to document background colors:
- **Light backgrounds** (luminance > 0.5) β Dark overlays (dark blue #00008B, black text)
- **Dark backgrounds** (luminance β€ 0.5) β Light overlays (yellow #FFFF00, white text)
- Background sampled at 9 strategic points using low DPI (72) for performance
- Results cached in `_bg_color_cache` to avoid re-sampling
- Color palettes defined in `LIGHT_BG_COLORS` and `DARK_BG_COLORS` constants
### Inline Help System
Comprehensive documentation integrated into the UI:
- `info` parameters on all controls provide contextual tooltips
- Expandable accordion with detailed explanations of all diagnostics and modes
- Help text stored in `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries
- Summary formatting includes severity icons and inline explanations
### Batch Analysis
Multi-page document analysis with aggregate statistics:
- Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
- Real-time progress tracking via `gr.Progress()`
- Outputs: summary stats, issue chart, per-page table, full JSON report
- Performance: ~10-50ms per page depending on complexity
- Identifies common issues (>50% of pages) and critical pages (3+ issues)
## Important Implementation Notes
### PDF Handling
- Always use pikepdf for structural queries (tags, resources)
- Always use PyMuPDF (fitz) for layout extraction and rendering
- Page indices are 0-based internally, 1-based in UI (convert with `page_num - 1`)
- Close documents properly using context managers (`with fitz.open()`, `with pikepdf.open()`)
### Coordinate Systems
- PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
- PIL/ImageDraw expects integer pixel coordinates
- Use `_rect_i()` to convert float bboxes to int for drawing
- DPI scaling is handled by PyMuPDF's `get_pixmap(dpi=...)`
### Heuristics Limitations
- Column detection is crude (assumes max 2 columns, uses median x-center as divider)
- Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
- All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
- Type3 font detection is string-based and may have false positives
### Gradio Patterns
- File upload provides `.name` attribute for file path
- Use `gr.update()` to modify component properties dynamically (e.g., slider maximum)
- State management relies on component values, not session storage
- Use `gr.Progress()` parameter in callbacks for long-running operations (batch analysis)
- Tabs organize related functionality (`gr.Tabs()` with `gr.Tab()` children)
- Accordions (`gr.Accordion()`) for progressive disclosure of help text and detailed results
### Adaptive Contrast Implementation
- Always render at low DPI (72) for background sampling to avoid performance impact
- Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
- Use `statistics.median()` instead of mean to avoid outliers from text/graphics
- Cache key format: `(document.name, page_index)` tuple
- Clear cache on new document upload if memory becomes an issue
- Fallback to `LIGHT_BG_COLORS` if sampling fails or `auto_contrast=False`
### Batch Analysis Performance
- Default max_pages=100 prevents timeout on large documents
- Sample rate allows analyzing every Nth page (useful for 500+ page documents)
- Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
- Progress updates every page to keep UI responsive
- Use dataclasses instead of dicts for better memory efficiency
- Consider adding timeout protection for very large documents (1000+ pages)
### Result Formatting
- Use Markdown with severity icons for human-readable summaries
- Icons: β (no issues), β οΈ (warnings), β (critical issues)
- HTML tables for detailed per-page results allow custom styling (color-coded cells)
- Plotly charts via `gr.Plot()` for interactive visualizations
- All batch results have `.to_dict()` method for JSON export
### Advanced Analysis Error Handling
- **Graceful Degradation**: All advanced features check for requirements before execution
- **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
- `@require_structure_tree` decorator checks for StructTreeRoot
- Returns user-friendly error message if not found
- Explains what tagging is and why it's needed
- **Safe Execution**: All features wrapped in `@safe_execute` decorator
- Catches all exceptions with traceback
- Returns formatted error messages instead of crashing
- **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
- Returns "not matched" status if text object not found
- Shows raw stream even if parsing fails
- **MCID Extraction**: May fail if content stream uses non-standard encoding
- Returns empty list on failure
- Block-to-tag mapping shows "No mappings found" message
- **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
## Testing
### Manual Testing Checklist
1. **Adaptive Contrast**: Test with light and dark background PDFs, verify overlay colors contrast properly
2. **Help System**: Hover over all controls, expand help accordion, verify all text displays correctly
3. **Batch Analysis**: Test with 1-page, 10-page, and 100+ page documents
4. **Edge Cases**: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts
### Performance Benchmarks
- Single page analysis: <1 second for typical pages
- Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
- Background sampling adds ~50-100ms one-time cost per page
- Memory usage: ~10-20MB per 100 pages of diagnostic data
## Deployment to Hugging Face
### Pre-deployment Steps
1. Test locally: `uv run python app.py`
2. Regenerate requirements.txt: `uv pip compile pyproject.toml -o requirements.txt`
3. Commit both `pyproject.toml` and `requirements.txt`
4. Verify `app.py` is set as `app_file` in README.md frontmatter
### Hugging Face Configuration
- SDK: gradio
- SDK version: 6.3.0 (or latest compatible)
- Python version: >=3.12 (as specified in pyproject.toml)
- Main file: app.py
### Known Limitations on Hugging Face
- Very large PDFs (1000+ pages) may hit timeout limits
- Recommend setting max_pages=100 by default
- Consider adding explicit timeout handling for batch analysis
|