Spaces:

rianders
/

pdfinspector

Sleeping

File size: 17,948 Bytes

0d61aa0

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.

**Target deployment**: Hugging Face Spaces (gradio SDK)

## Commands

### Development
```bash
# Run the Gradio app locally
uv run python app.py

# The app will launch at http://localhost:7860 by default
```

### Dependencies
```bash
# Sync environment (after cloning or pulling changes)
uv sync

# Add a new dependency
uv add <package>

# Add a dev dependency
uv add --dev <package>

# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
uv pip compile pyproject.toml -o requirements.txt
```

The project uses `pyproject.toml` for dependency management with uv lock file support. **Always use `uv run`** for running commands in the development environment.

## Architecture

### Core Libraries
- **PyMuPDF (fitz)**: Layout extraction, block/span detection, rendering pages as images
- **pikepdf**: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
- **Gradio**: Web UI framework
- **PIL (Pillow)**: Image manipulation for overlay rendering

### Main Application Flow (app.py)

The application has two main modes: **Single Page Analysis** and **Batch Analysis**.

#### Single Page Analysis Pipeline

1. **PDF Structure Analysis** (`pdf_struct_report`):
   - Uses pikepdf to inspect PDF-level metadata
   - Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
   - Analyzes per-page resources (fonts, XObjects)

2. **Layout Extraction** (`extract_blocks_spans`):
   - Uses PyMuPDF's `get_text("dict")` to extract blocks, lines, and spans with bounding boxes
   - Returns structured `BlockInfo` objects containing text, bbox, font info, and span details
   - Block types: 0=text, 1=image, 2=drawing

3. **Reading Order Analysis** (`order_blocks`):
   - Three ordering modes:
     - `raw`: Extraction order (as stored in PDF)
     - `tblr`: Top-to-bottom, left-to-right sorting by bbox
     - `columns`: Simple 2-column heuristic (clusters by x-center, sorts each column separately)

4. **Diagnostic Heuristics** (`diagnose_page`):
   - Detects scanned pages (no text + images)
   - Identifies text-as-vector-outlines (no text + many drawings)
   - Flags Type3 fonts (often correlate with broken text extraction)
   - Detects garbled text (replacement characters, missing ToUnicode)
   - Guesses multi-column layouts (x-center clustering)

5. **Adaptive Contrast Detection** (for visualization):
   - `sample_background_color()`: Samples page at 9 points (corners, edges, center) to determine background
   - `calculate_luminance()`: Uses WCAG formula to compute relative luminance (0-1)
   - `get_contrast_colors()`: Returns appropriate color palette based on luminance
   - Background colors cached per page for performance

6. **Visualization** (`render_page_with_overlay`):
   - Renders page at specified DPI using PyMuPDF
   - Automatically detects background and chooses contrasting overlay colors
   - Overlays numbered block rectangles showing reading order
   - Optionally shows span-level boxes
   - Flags math-like content using regex heuristics (`_looks_like_math`)

7. **Result Formatting** (`format_diagnostic_summary`):
   - Generates Markdown with severity icons (✓, ⚠️, ❌)
   - Includes inline explanations from `DIAGNOSTIC_HELP` dictionary

#### Batch Analysis Pipeline

1. **Multi-Page Processing** (`diagnose_all_pages`):
   - Analyzes multiple pages (configurable max_pages and sample_rate)
   - Progress tracking via `gr.Progress()`
   - Calls `diagnose_page()` for each page with timing
   - Returns `BatchAnalysisResult` dataclass

2. **Aggregation** (`aggregate_results`):
   - Counts issues across all pages
   - Identifies critical pages (3+ issues)
   - Detects common issues (affecting >50% of pages)

3. **Result Formatting**:
   - `format_batch_summary_markdown()`: Executive summary with statistics
   - `format_batch_results_table()`: Color-coded HTML table per page
   - `format_batch_results_chart()`: Plotly bar chart of issue distribution

### Advanced Analysis Modules

The application includes specialized modules for advanced PDF accessibility analysis:

**advanced_analysis.py** - Coordinator module
- Provides facade functions with error handling
- `require_structure_tree` decorator: checks for tagged PDFs before execution
- `safe_execute` decorator: comprehensive error handling with user-friendly messages
- Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.

**content_stream_parser.py** - PDF operator extraction
- `extract_content_stream_for_block()`: Gets operators for a specific block
- `_parse_text_objects()`: Extracts BT...ET blocks from content stream
- `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
- `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
- Returns formatted markdown and raw stream text

**screen_reader_sim.py** - Accessibility simulation
- `simulate_screen_reader()`: Main simulation function
- `_simulate_tagged()`: Follows structure tree for tagged PDFs
- `_simulate_untagged()`: Falls back to visual order for untagged PDFs
- `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs

**structure_tree.py** - Structure tree analysis
- `StructureNode` dataclass: represents PDF tag hierarchy
- `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
- `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
- `format_tree_text()`: Creates indented text view with box-drawing characters
- `get_tree_statistics()`: Counts nodes, tags, alt text coverage
- `extract_mcid_for_page()`: Finds marked content IDs in page content stream
- `map_blocks_to_tags()`: Correlates visual blocks with structure elements
- `detect_visual_paragraphs()`: Spacing-based paragraph detection
- `detect_semantic_paragraphs()`: Extracts &lt;P&gt; tags for a page
- `compare_paragraphs()`: Calculates match quality between visual and semantic

### Key Data Structures

**Single Page Analysis**:
- `SpanInfo`: Individual text run with bbox, text, font, size
- `BlockInfo`: Text/image block with bbox, text, type, and list of spans

**Batch Analysis**:
- `PageDiagnostic`: Per-page diagnostic results with all issue flags and processing time
- `BatchAnalysisResult`: Aggregated statistics across multiple pages including:
  - `summary_stats`: Dictionary of issue counts
  - `per_page_results`: List of PageDiagnostic objects
  - `common_issues`: Issues affecting >50% of pages
  - `critical_pages`: Pages with 3+ issues
  - `to_dict()`: Method to convert to JSON-serializable format

**Advanced Analysis**:
- `StructureNode`: Represents a node in the PDF structure tree with:
  - `tag_type`: Tag name (P, H1, Document, Figure, etc.)
  - `depth`: Nesting level in the tree
  - `mcid`: Marked Content ID (links to page content)
  - `alt_text`: Alternative text for accessibility
  - `actual_text`: Actual text content or replacement text
  - `page_ref`: 0-based page index
  - `children`: List of child StructureNode objects
  - `to_dict()`: Convert to JSON-serializable format

**UI State**:
- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
- Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)

### Gradio UI Flow

The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.

#### Single Page Tab
1. User uploads PDF → `_on_upload` → extracts path and page count
2. User adjusts parameters (page, DPI, order mode, visualization options)
3. Click "Analyze" → `analyze` function:
   - Runs structural report (pikepdf)
   - Extracts and orders blocks (PyMuPDF)
   - Generates diagnostic report with adaptive contrast detection
   - Creates overlay image with high-contrast colors
   - Returns reading order preview + formatted summary with icons

#### Batch Analysis Tab
1. User sets max_pages (default 100) and sample_rate (default 1)
2. Click "Analyze All Pages" → `analyze_batch_with_progress` function:
   - Calls `diagnose_all_pages()` with progress tracking
   - Aggregates results across pages
   - Returns:
     - Summary markdown with statistics and common issues
     - Plotly bar chart of issue distribution
     - Color-coded HTML table of per-page results
     - Full JSON report

#### Advanced Analysis Tab

Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:

1. **Content Stream Inspector**:
   - Extracts raw PDF content stream operators for a specific block
   - Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
   - Useful for debugging text extraction, font issues, and positioning problems
   - Provides both formatted view and raw stream
   - Uses regex parsing of content streams (approximate for complex PDFs)

2. **Screen Reader Simulator**:
   - Simulates NVDA or JAWS reading behavior for the current page
   - Two modes:
     - **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
     - **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
   - Three detail levels: minimal (text only), default (element announcements), verbose (full context)
   - Generates transcript + analysis with alt text coverage statistics
   - Reading order configurable for untagged fallback (raw/tblr/columns)

3. **Paragraph Detection**:
   - Compares visual paragraphs (detected by spacing) vs semantic &lt;P&gt; tags
   - Visual detection: groups blocks with vertical gap < threshold (default 15pt)
   - Semantic detection: extracts &lt;P&gt; tags from structure tree
   - Generates color-coded overlay (green = visual paragraphs)
   - Reports match quality score and mismatches
   - Requires tagged PDF for semantic comparison

4. **Structure Tree Visualizer**:
   - Extracts complete PDF tag hierarchy from StructTreeRoot
   - Three visualization formats:
     - **Tree Diagram**: Interactive Plotly sunburst chart
     - **Text View**: Indented text with box-drawing characters
     - **Statistics**: Node counts, tag distribution, alt text coverage
   - Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
   - Displays alt text, actual text, page references, and MCID markers
   - Only works for tagged PDFs

5. **Block-to-Tag Mapping**:
   - Maps visual blocks to structure tree elements via MCID (Marked Content ID)
   - Shows which blocks have proper semantic tagging
   - DataFrame output with block index, tag type, MCID, alt text
   - Helps identify untagged content
   - Requires tagged PDF with MCID references

#### Help & Documentation
- All UI controls have `info` parameters with inline tooltips
- Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
- `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries provide explanation text
- Summary sections use severity icons (✓, ⚠️, ❌) for quick scanning

## Key Features

### Adaptive Contrast Overlays
The overlay visualization automatically adapts to document background colors:
- **Light backgrounds** (luminance > 0.5) → Dark overlays (dark blue #00008B, black text)
- **Dark backgrounds** (luminance ≤ 0.5) → Light overlays (yellow #FFFF00, white text)
- Background sampled at 9 strategic points using low DPI (72) for performance
- Results cached in `_bg_color_cache` to avoid re-sampling
- Color palettes defined in `LIGHT_BG_COLORS` and `DARK_BG_COLORS` constants

### Inline Help System
Comprehensive documentation integrated into the UI:
- `info` parameters on all controls provide contextual tooltips
- Expandable accordion with detailed explanations of all diagnostics and modes
- Help text stored in `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries
- Summary formatting includes severity icons and inline explanations

### Batch Analysis
Multi-page document analysis with aggregate statistics:
- Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
- Real-time progress tracking via `gr.Progress()`
- Outputs: summary stats, issue chart, per-page table, full JSON report
- Performance: ~10-50ms per page depending on complexity
- Identifies common issues (>50% of pages) and critical pages (3+ issues)

## Important Implementation Notes

### PDF Handling
- Always use pikepdf for structural queries (tags, resources)
- Always use PyMuPDF (fitz) for layout extraction and rendering
- Page indices are 0-based internally, 1-based in UI (convert with `page_num - 1`)
- Close documents properly using context managers (`with fitz.open()`, `with pikepdf.open()`)

### Coordinate Systems
- PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
- PIL/ImageDraw expects integer pixel coordinates
- Use `_rect_i()` to convert float bboxes to int for drawing
- DPI scaling is handled by PyMuPDF's `get_pixmap(dpi=...)`

### Heuristics Limitations
- Column detection is crude (assumes max 2 columns, uses median x-center as divider)
- Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
- All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
- Type3 font detection is string-based and may have false positives

### Gradio Patterns
- File upload provides `.name` attribute for file path
- Use `gr.update()` to modify component properties dynamically (e.g., slider maximum)
- State management relies on component values, not session storage
- Use `gr.Progress()` parameter in callbacks for long-running operations (batch analysis)
- Tabs organize related functionality (`gr.Tabs()` with `gr.Tab()` children)
- Accordions (`gr.Accordion()`) for progressive disclosure of help text and detailed results

### Adaptive Contrast Implementation
- Always render at low DPI (72) for background sampling to avoid performance impact
- Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
- Use `statistics.median()` instead of mean to avoid outliers from text/graphics
- Cache key format: `(document.name, page_index)` tuple
- Clear cache on new document upload if memory becomes an issue
- Fallback to `LIGHT_BG_COLORS` if sampling fails or `auto_contrast=False`

### Batch Analysis Performance
- Default max_pages=100 prevents timeout on large documents
- Sample rate allows analyzing every Nth page (useful for 500+ page documents)
- Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
- Progress updates every page to keep UI responsive
- Use dataclasses instead of dicts for better memory efficiency
- Consider adding timeout protection for very large documents (1000+ pages)

### Result Formatting
- Use Markdown with severity icons for human-readable summaries
- Icons: ✓ (no issues), ⚠️ (warnings), ❌ (critical issues)
- HTML tables for detailed per-page results allow custom styling (color-coded cells)
- Plotly charts via `gr.Plot()` for interactive visualizations
- All batch results have `.to_dict()` method for JSON export

### Advanced Analysis Error Handling
- **Graceful Degradation**: All advanced features check for requirements before execution
- **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
  - `@require_structure_tree` decorator checks for StructTreeRoot
  - Returns user-friendly error message if not found
  - Explains what tagging is and why it's needed
- **Safe Execution**: All features wrapped in `@safe_execute` decorator
  - Catches all exceptions with traceback
  - Returns formatted error messages instead of crashing
- **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
  - Returns "not matched" status if text object not found
  - Shows raw stream even if parsing fails
- **MCID Extraction**: May fail if content stream uses non-standard encoding
  - Returns empty list on failure
  - Block-to-tag mapping shows "No mappings found" message
- **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops

## Testing

### Manual Testing Checklist
1. **Adaptive Contrast**: Test with light and dark background PDFs, verify overlay colors contrast properly
2. **Help System**: Hover over all controls, expand help accordion, verify all text displays correctly
3. **Batch Analysis**: Test with 1-page, 10-page, and 100+ page documents
4. **Edge Cases**: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts

### Performance Benchmarks
- Single page analysis: <1 second for typical pages
- Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
- Background sampling adds ~50-100ms one-time cost per page
- Memory usage: ~10-20MB per 100 pages of diagnostic data

## Deployment to Hugging Face

### Pre-deployment Steps
1. Test locally: `uv run python app.py`
2. Regenerate requirements.txt: `uv pip compile pyproject.toml -o requirements.txt`
3. Commit both `pyproject.toml` and `requirements.txt`
4. Verify `app.py` is set as `app_file` in README.md frontmatter

### Hugging Face Configuration
- SDK: gradio
- SDK version: 6.3.0 (or latest compatible)
- Python version: >=3.12 (as specified in pyproject.toml)
- Main file: app.py

### Known Limitations on Hugging Face
- Very large PDFs (1000+ pages) may hit timeout limits
- Recommend setting max_pages=100 by default
- Consider adding explicit timeout handling for batch analysis