pdfinspector / AGENTS.md
rianders's picture
Fix file load errors and implement auto-refresh functionality
0d61aa0

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.

Target deployment: Hugging Face Spaces (gradio SDK)

Commands

Development

# Run the Gradio app locally
uv run python app.py

# The app will launch at http://localhost:7860 by default

Dependencies

# Sync environment (after cloning or pulling changes)
uv sync

# Add a new dependency
uv add <package>

# Add a dev dependency
uv add --dev <package>

# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
uv pip compile pyproject.toml -o requirements.txt

The project uses pyproject.toml for dependency management with uv lock file support. Always use uv run for running commands in the development environment.

Architecture

Core Libraries

  • PyMuPDF (fitz): Layout extraction, block/span detection, rendering pages as images
  • pikepdf: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
  • Gradio: Web UI framework
  • PIL (Pillow): Image manipulation for overlay rendering

Main Application Flow (app.py)

The application has two main modes: Single Page Analysis and Batch Analysis.

Single Page Analysis Pipeline

  1. PDF Structure Analysis (pdf_struct_report):

    • Uses pikepdf to inspect PDF-level metadata
    • Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
    • Analyzes per-page resources (fonts, XObjects)
  2. Layout Extraction (extract_blocks_spans):

    • Uses PyMuPDF's get_text("dict") to extract blocks, lines, and spans with bounding boxes
    • Returns structured BlockInfo objects containing text, bbox, font info, and span details
    • Block types: 0=text, 1=image, 2=drawing
  3. Reading Order Analysis (order_blocks):

    • Three ordering modes:
      • raw: Extraction order (as stored in PDF)
      • tblr: Top-to-bottom, left-to-right sorting by bbox
      • columns: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
  4. Diagnostic Heuristics (diagnose_page):

    • Detects scanned pages (no text + images)
    • Identifies text-as-vector-outlines (no text + many drawings)
    • Flags Type3 fonts (often correlate with broken text extraction)
    • Detects garbled text (replacement characters, missing ToUnicode)
    • Guesses multi-column layouts (x-center clustering)
  5. Adaptive Contrast Detection (for visualization):

    • sample_background_color(): Samples page at 9 points (corners, edges, center) to determine background
    • calculate_luminance(): Uses WCAG formula to compute relative luminance (0-1)
    • get_contrast_colors(): Returns appropriate color palette based on luminance
    • Background colors cached per page for performance
  6. Visualization (render_page_with_overlay):

    • Renders page at specified DPI using PyMuPDF
    • Automatically detects background and chooses contrasting overlay colors
    • Overlays numbered block rectangles showing reading order
    • Optionally shows span-level boxes
    • Flags math-like content using regex heuristics (_looks_like_math)
  7. Result Formatting (format_diagnostic_summary):

    • Generates Markdown with severity icons (βœ“, ⚠️, ❌)
    • Includes inline explanations from DIAGNOSTIC_HELP dictionary

Batch Analysis Pipeline

  1. Multi-Page Processing (diagnose_all_pages):

    • Analyzes multiple pages (configurable max_pages and sample_rate)
    • Progress tracking via gr.Progress()
    • Calls diagnose_page() for each page with timing
    • Returns BatchAnalysisResult dataclass
  2. Aggregation (aggregate_results):

    • Counts issues across all pages
    • Identifies critical pages (3+ issues)
    • Detects common issues (affecting >50% of pages)
  3. Result Formatting:

    • format_batch_summary_markdown(): Executive summary with statistics
    • format_batch_results_table(): Color-coded HTML table per page
    • format_batch_results_chart(): Plotly bar chart of issue distribution

Advanced Analysis Modules

The application includes specialized modules for advanced PDF accessibility analysis:

advanced_analysis.py - Coordinator module

  • Provides facade functions with error handling
  • require_structure_tree decorator: checks for tagged PDFs before execution
  • safe_execute decorator: comprehensive error handling with user-friendly messages
  • Exports high-level functions: analyze_content_stream, analyze_screen_reader, etc.

content_stream_parser.py - PDF operator extraction

  • extract_content_stream_for_block(): Gets operators for a specific block
  • _parse_text_objects(): Extracts BT...ET blocks from content stream
  • _parse_operators(): Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
  • _find_matching_text_object(): Correlates text objects with BlockInfo via text matching
  • Returns formatted markdown and raw stream text

screen_reader_sim.py - Accessibility simulation

  • simulate_screen_reader(): Main simulation function
  • _simulate_tagged(): Follows structure tree for tagged PDFs
  • _simulate_untagged(): Falls back to visual order for untagged PDFs
  • _format_element_announcement(): Generates NVDA/JAWS-style announcements
  • Supports heading levels, paragraphs, figures, formulas, lists, tables, links
  • Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs

structure_tree.py - Structure tree analysis

  • StructureNode dataclass: represents PDF tag hierarchy
  • extract_structure_tree(): Recursively parses StructTreeRoot with pikepdf
  • _parse_structure_element(): Handles Dictionary, Array, and MCID elements
  • format_tree_text(): Creates indented text view with box-drawing characters
  • get_tree_statistics(): Counts nodes, tags, alt text coverage
  • extract_mcid_for_page(): Finds marked content IDs in page content stream
  • map_blocks_to_tags(): Correlates visual blocks with structure elements
  • detect_visual_paragraphs(): Spacing-based paragraph detection
  • detect_semantic_paragraphs(): Extracts <P> tags for a page
  • compare_paragraphs(): Calculates match quality between visual and semantic

Key Data Structures

Single Page Analysis:

  • SpanInfo: Individual text run with bbox, text, font, size
  • BlockInfo: Text/image block with bbox, text, type, and list of spans

Batch Analysis:

  • PageDiagnostic: Per-page diagnostic results with all issue flags and processing time
  • BatchAnalysisResult: Aggregated statistics across multiple pages including:
    • summary_stats: Dictionary of issue counts
    • per_page_results: List of PageDiagnostic objects
    • common_issues: Issues affecting >50% of pages
    • critical_pages: Pages with 3+ issues
    • to_dict(): Method to convert to JSON-serializable format

Advanced Analysis:

  • StructureNode: Represents a node in the PDF structure tree with:
    • tag_type: Tag name (P, H1, Document, Figure, etc.)
    • depth: Nesting level in the tree
    • mcid: Marked Content ID (links to page content)
    • alt_text: Alternative text for accessibility
    • actual_text: Actual text content or replacement text
    • page_ref: 0-based page index
    • children: List of child StructureNode objects
    • to_dict(): Convert to JSON-serializable format

UI State:

  • The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
  • Background color cache: _bg_color_cache dict keyed by (document_path, page_index)

Gradio UI Flow

The UI is organized into three tabs: Single Page Analysis, Batch Analysis, and Advanced Analysis.

Single Page Tab

  1. User uploads PDF β†’ _on_upload β†’ extracts path and page count
  2. User adjusts parameters (page, DPI, order mode, visualization options)
  3. Click "Analyze" β†’ analyze function:
    • Runs structural report (pikepdf)
    • Extracts and orders blocks (PyMuPDF)
    • Generates diagnostic report with adaptive contrast detection
    • Creates overlay image with high-contrast colors
    • Returns reading order preview + formatted summary with icons

Batch Analysis Tab

  1. User sets max_pages (default 100) and sample_rate (default 1)
  2. Click "Analyze All Pages" β†’ analyze_batch_with_progress function:
    • Calls diagnose_all_pages() with progress tracking
    • Aggregates results across pages
    • Returns:
      • Summary markdown with statistics and common issues
      • Plotly bar chart of issue distribution
      • Color-coded HTML table of per-page results
      • Full JSON report

Advanced Analysis Tab

Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:

  1. Content Stream Inspector:

    • Extracts raw PDF content stream operators for a specific block
    • Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
    • Useful for debugging text extraction, font issues, and positioning problems
    • Provides both formatted view and raw stream
    • Uses regex parsing of content streams (approximate for complex PDFs)
  2. Screen Reader Simulator:

    • Simulates NVDA or JAWS reading behavior for the current page
    • Two modes:
      • Tagged PDFs: Follows structure tree, announces headings/paragraphs/figures with proper semantics
      • Untagged PDFs: Falls back to visual reading order, infers headings from font size
    • Three detail levels: minimal (text only), default (element announcements), verbose (full context)
    • Generates transcript + analysis with alt text coverage statistics
    • Reading order configurable for untagged fallback (raw/tblr/columns)
  3. Paragraph Detection:

    • Compares visual paragraphs (detected by spacing) vs semantic <P> tags
    • Visual detection: groups blocks with vertical gap < threshold (default 15pt)
    • Semantic detection: extracts <P> tags from structure tree
    • Generates color-coded overlay (green = visual paragraphs)
    • Reports match quality score and mismatches
    • Requires tagged PDF for semantic comparison
  4. Structure Tree Visualizer:

    • Extracts complete PDF tag hierarchy from StructTreeRoot
    • Three visualization formats:
      • Tree Diagram: Interactive Plotly sunburst chart
      • Text View: Indented text with box-drawing characters
      • Statistics: Node counts, tag distribution, alt text coverage
    • Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
    • Displays alt text, actual text, page references, and MCID markers
    • Only works for tagged PDFs
  5. Block-to-Tag Mapping:

    • Maps visual blocks to structure tree elements via MCID (Marked Content ID)
    • Shows which blocks have proper semantic tagging
    • DataFrame output with block index, tag type, MCID, alt text
    • Helps identify untagged content
    • Requires tagged PDF with MCID references

Help & Documentation

  • All UI controls have info parameters with inline tooltips
  • Expandable "πŸ“– Understanding the Diagnostics" accordion with detailed explanations
  • DIAGNOSTIC_HELP and ORDERING_MODE_HELP dictionaries provide explanation text
  • Summary sections use severity icons (βœ“, ⚠️, ❌) for quick scanning

Key Features

Adaptive Contrast Overlays

The overlay visualization automatically adapts to document background colors:

  • Light backgrounds (luminance > 0.5) β†’ Dark overlays (dark blue #00008B, black text)
  • Dark backgrounds (luminance ≀ 0.5) β†’ Light overlays (yellow #FFFF00, white text)
  • Background sampled at 9 strategic points using low DPI (72) for performance
  • Results cached in _bg_color_cache to avoid re-sampling
  • Color palettes defined in LIGHT_BG_COLORS and DARK_BG_COLORS constants

Inline Help System

Comprehensive documentation integrated into the UI:

  • info parameters on all controls provide contextual tooltips
  • Expandable accordion with detailed explanations of all diagnostics and modes
  • Help text stored in DIAGNOSTIC_HELP and ORDERING_MODE_HELP dictionaries
  • Summary formatting includes severity icons and inline explanations

Batch Analysis

Multi-page document analysis with aggregate statistics:

  • Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
  • Real-time progress tracking via gr.Progress()
  • Outputs: summary stats, issue chart, per-page table, full JSON report
  • Performance: ~10-50ms per page depending on complexity
  • Identifies common issues (>50% of pages) and critical pages (3+ issues)

Important Implementation Notes

PDF Handling

  • Always use pikepdf for structural queries (tags, resources)
  • Always use PyMuPDF (fitz) for layout extraction and rendering
  • Page indices are 0-based internally, 1-based in UI (convert with page_num - 1)
  • Close documents properly using context managers (with fitz.open(), with pikepdf.open())

Coordinate Systems

  • PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
  • PIL/ImageDraw expects integer pixel coordinates
  • Use _rect_i() to convert float bboxes to int for drawing
  • DPI scaling is handled by PyMuPDF's get_pixmap(dpi=...)

Heuristics Limitations

  • Column detection is crude (assumes max 2 columns, uses median x-center as divider)
  • Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
  • All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
  • Type3 font detection is string-based and may have false positives

Gradio Patterns

  • File upload provides .name attribute for file path
  • Use gr.update() to modify component properties dynamically (e.g., slider maximum)
  • State management relies on component values, not session storage
  • Use gr.Progress() parameter in callbacks for long-running operations (batch analysis)
  • Tabs organize related functionality (gr.Tabs() with gr.Tab() children)
  • Accordions (gr.Accordion()) for progressive disclosure of help text and detailed results

Adaptive Contrast Implementation

  • Always render at low DPI (72) for background sampling to avoid performance impact
  • Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
  • Use statistics.median() instead of mean to avoid outliers from text/graphics
  • Cache key format: (document.name, page_index) tuple
  • Clear cache on new document upload if memory becomes an issue
  • Fallback to LIGHT_BG_COLORS if sampling fails or auto_contrast=False

Batch Analysis Performance

  • Default max_pages=100 prevents timeout on large documents
  • Sample rate allows analyzing every Nth page (useful for 500+ page documents)
  • Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
  • Progress updates every page to keep UI responsive
  • Use dataclasses instead of dicts for better memory efficiency
  • Consider adding timeout protection for very large documents (1000+ pages)

Result Formatting

  • Use Markdown with severity icons for human-readable summaries
  • Icons: βœ“ (no issues), ⚠️ (warnings), ❌ (critical issues)
  • HTML tables for detailed per-page results allow custom styling (color-coded cells)
  • Plotly charts via gr.Plot() for interactive visualizations
  • All batch results have .to_dict() method for JSON export

Advanced Analysis Error Handling

  • Graceful Degradation: All advanced features check for requirements before execution
  • Structure Tree Required: Features 2, 4, 5 require tagged PDFs
    • @require_structure_tree decorator checks for StructTreeRoot
    • Returns user-friendly error message if not found
    • Explains what tagging is and why it's needed
  • Safe Execution: All features wrapped in @safe_execute decorator
    • Catches all exceptions with traceback
    • Returns formatted error messages instead of crashing
  • Content Stream Parsing: Regex-based, may fail on complex/malformed PDFs
    • Returns "not matched" status if text object not found
    • Shows raw stream even if parsing fails
  • MCID Extraction: May fail if content stream uses non-standard encoding
    • Returns empty list on failure
    • Block-to-tag mapping shows "No mappings found" message
  • Performance Limits: Structure tree extraction has max_depth=20 to prevent infinite loops

Testing

Manual Testing Checklist

  1. Adaptive Contrast: Test with light and dark background PDFs, verify overlay colors contrast properly
  2. Help System: Hover over all controls, expand help accordion, verify all text displays correctly
  3. Batch Analysis: Test with 1-page, 10-page, and 100+ page documents
  4. Edge Cases: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts

Performance Benchmarks

  • Single page analysis: <1 second for typical pages
  • Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
  • Background sampling adds ~50-100ms one-time cost per page
  • Memory usage: ~10-20MB per 100 pages of diagnostic data

Deployment to Hugging Face

Pre-deployment Steps

  1. Test locally: uv run python app.py
  2. Regenerate requirements.txt: uv pip compile pyproject.toml -o requirements.txt
  3. Commit both pyproject.toml and requirements.txt
  4. Verify app.py is set as app_file in README.md frontmatter

Hugging Face Configuration

  • SDK: gradio
  • SDK version: 6.3.0 (or latest compatible)
  • Python version: >=3.12 (as specified in pyproject.toml)
  • Main file: app.py

Known Limitations on Hugging Face

  • Very large PDFs (1000+ pages) may hit timeout limits
  • Recommend setting max_pages=100 by default
  • Consider adding explicit timeout handling for batch analysis