Spaces:

rianders
/

pdfinspector

Sleeping

App Files Files Community

pdfinspector / AGENTS.md

rianders

Fix file load errors and implement auto-refresh functionality

0d61aa0 about 2 months ago

preview code

raw

history blame contribute delete

17.9 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.

Target deployment: Hugging Face Spaces (gradio SDK)

Commands

Development

# Run the Gradio app locally
uv run python app.py

# The app will launch at http://localhost:7860 by default

Dependencies

# Sync environment (after cloning or pulling changes)
uv sync

# Add a new dependency
uv add <package>

# Add a dev dependency
uv add --dev <package>

# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
uv pip compile pyproject.toml -o requirements.txt

The project uses pyproject.toml for dependency management with uv lock file support. Always use uv run for running commands in the development environment.

Architecture

Core Libraries

PyMuPDF (fitz): Layout extraction, block/span detection, rendering pages as images
pikepdf: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
Gradio: Web UI framework
PIL (Pillow): Image manipulation for overlay rendering

Main Application Flow (app.py)

The application has two main modes: Single Page Analysis and Batch Analysis.

Single Page Analysis Pipeline

PDF Structure Analysis (pdf_struct_report):
- Uses pikepdf to inspect PDF-level metadata
- Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
- Analyzes per-page resources (fonts, XObjects)
Layout Extraction (extract_blocks_spans):
- Uses PyMuPDF's get_text("dict") to extract blocks, lines, and spans with bounding boxes
- Returns structured BlockInfo objects containing text, bbox, font info, and span details
- Block types: 0=text, 1=image, 2=drawing
Reading Order Analysis (order_blocks):
- Three ordering modes:
  - raw: Extraction order (as stored in PDF)
  - tblr: Top-to-bottom, left-to-right sorting by bbox
  - columns: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
Diagnostic Heuristics (diagnose_page):
- Detects scanned pages (no text + images)
- Identifies text-as-vector-outlines (no text + many drawings)
- Flags Type3 fonts (often correlate with broken text extraction)
- Detects garbled text (replacement characters, missing ToUnicode)
- Guesses multi-column layouts (x-center clustering)
Adaptive Contrast Detection (for visualization):
- sample_background_color(): Samples page at 9 points (corners, edges, center) to determine background
- calculate_luminance(): Uses WCAG formula to compute relative luminance (0-1)
- get_contrast_colors(): Returns appropriate color palette based on luminance
- Background colors cached per page for performance
Visualization (render_page_with_overlay):
- Renders page at specified DPI using PyMuPDF
- Automatically detects background and chooses contrasting overlay colors
- Overlays numbered block rectangles showing reading order
- Optionally shows span-level boxes
- Flags math-like content using regex heuristics (_looks_like_math)
Result Formatting (format_diagnostic_summary):
- Generates Markdown with severity icons (✓, ⚠️, ❌)
- Includes inline explanations from DIAGNOSTIC_HELP dictionary

Batch Analysis Pipeline

Multi-Page Processing (diagnose_all_pages):
- Analyzes multiple pages (configurable max_pages and sample_rate)
- Progress tracking via gr.Progress()
- Calls diagnose_page() for each page with timing
- Returns BatchAnalysisResult dataclass
Aggregation (aggregate_results):
- Counts issues across all pages
- Identifies critical pages (3+ issues)
- Detects common issues (affecting >50% of pages)
Result Formatting:
- format_batch_summary_markdown(): Executive summary with statistics
- format_batch_results_table(): Color-coded HTML table per page
- format_batch_results_chart(): Plotly bar chart of issue distribution

Advanced Analysis Modules

The application includes specialized modules for advanced PDF accessibility analysis:

advanced_analysis.py - Coordinator module

Provides facade functions with error handling
require_structure_tree decorator: checks for tagged PDFs before execution
safe_execute decorator: comprehensive error handling with user-friendly messages
Exports high-level functions: analyze_content_stream, analyze_screen_reader, etc.

content_stream_parser.py - PDF operator extraction

extract_content_stream_for_block(): Gets operators for a specific block
_parse_text_objects(): Extracts BT...ET blocks from content stream
_parse_operators(): Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
_find_matching_text_object(): Correlates text objects with BlockInfo via text matching
Returns formatted markdown and raw stream text

screen_reader_sim.py - Accessibility simulation

simulate_screen_reader(): Main simulation function
_simulate_tagged(): Follows structure tree for tagged PDFs
_simulate_untagged(): Falls back to visual order for untagged PDFs
_format_element_announcement(): Generates NVDA/JAWS-style announcements
Supports heading levels, paragraphs, figures, formulas, lists, tables, links
Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs

structure_tree.py - Structure tree analysis

StructureNode dataclass: represents PDF tag hierarchy
extract_structure_tree(): Recursively parses StructTreeRoot with pikepdf
_parse_structure_element(): Handles Dictionary, Array, and MCID elements
format_tree_text(): Creates indented text view with box-drawing characters
get_tree_statistics(): Counts nodes, tags, alt text coverage
extract_mcid_for_page(): Finds marked content IDs in page content stream
map_blocks_to_tags(): Correlates visual blocks with structure elements
detect_visual_paragraphs(): Spacing-based paragraph detection
detect_semantic_paragraphs(): Extracts <P> tags for a page
compare_paragraphs(): Calculates match quality between visual and semantic

Key Data Structures

Single Page Analysis:

SpanInfo: Individual text run with bbox, text, font, size
BlockInfo: Text/image block with bbox, text, type, and list of spans

Batch Analysis:

PageDiagnostic: Per-page diagnostic results with all issue flags and processing time
BatchAnalysisResult: Aggregated statistics across multiple pages including:
- summary_stats: Dictionary of issue counts
- per_page_results: List of PageDiagnostic objects
- common_issues: Issues affecting >50% of pages
- critical_pages: Pages with 3+ issues
- to_dict(): Method to convert to JSON-serializable format

Advanced Analysis:

StructureNode: Represents a node in the PDF structure tree with:
- tag_type: Tag name (P, H1, Document, Figure, etc.)
- depth: Nesting level in the tree
- mcid: Marked Content ID (links to page content)
- alt_text: Alternative text for accessibility
- actual_text: Actual text content or replacement text
- page_ref: 0-based page index
- children: List of child StructureNode objects
- to_dict(): Convert to JSON-serializable format

UI State:

The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
Background color cache: _bg_color_cache dict keyed by (document_path, page_index)

Gradio UI Flow

The UI is organized into three tabs: Single Page Analysis, Batch Analysis, and Advanced Analysis.

Single Page Tab

User uploads PDF → _on_upload → extracts path and page count
User adjusts parameters (page, DPI, order mode, visualization options)
Click "Analyze" → analyze function:
- Runs structural report (pikepdf)
- Extracts and orders blocks (PyMuPDF)
- Generates diagnostic report with adaptive contrast detection
- Creates overlay image with high-contrast colors
- Returns reading order preview + formatted summary with icons

Batch Analysis Tab

User sets max_pages (default 100) and sample_rate (default 1)
Click "Analyze All Pages" → analyze_batch_with_progress function:
- Calls diagnose_all_pages() with progress tracking
- Aggregates results across pages
- Returns:
  - Summary markdown with statistics and common issues
  - Plotly bar chart of issue distribution
  - Color-coded HTML table of per-page results
  - Full JSON report

Advanced Analysis Tab

Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:

Content Stream Inspector:
- Extracts raw PDF content stream operators for a specific block
- Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
- Useful for debugging text extraction, font issues, and positioning problems
- Provides both formatted view and raw stream
- Uses regex parsing of content streams (approximate for complex PDFs)
Screen Reader Simulator:
- Simulates NVDA or JAWS reading behavior for the current page
- Two modes:
  - Tagged PDFs: Follows structure tree, announces headings/paragraphs/figures with proper semantics
  - Untagged PDFs: Falls back to visual reading order, infers headings from font size
- Three detail levels: minimal (text only), default (element announcements), verbose (full context)
- Generates transcript + analysis with alt text coverage statistics
- Reading order configurable for untagged fallback (raw/tblr/columns)
Paragraph Detection:
- Compares visual paragraphs (detected by spacing) vs semantic <P> tags
- Visual detection: groups blocks with vertical gap < threshold (default 15pt)
- Semantic detection: extracts <P> tags from structure tree
- Generates color-coded overlay (green = visual paragraphs)
- Reports match quality score and mismatches
- Requires tagged PDF for semantic comparison
Structure Tree Visualizer:
- Extracts complete PDF tag hierarchy from StructTreeRoot
- Three visualization formats:
  - Tree Diagram: Interactive Plotly sunburst chart
  - Text View: Indented text with box-drawing characters
  - Statistics: Node counts, tag distribution, alt text coverage
- Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
- Displays alt text, actual text, page references, and MCID markers
- Only works for tagged PDFs
Block-to-Tag Mapping:
- Maps visual blocks to structure tree elements via MCID (Marked Content ID)
- Shows which blocks have proper semantic tagging
- DataFrame output with block index, tag type, MCID, alt text
- Helps identify untagged content
- Requires tagged PDF with MCID references

Help & Documentation

All UI controls have info parameters with inline tooltips
Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
DIAGNOSTIC_HELP and ORDERING_MODE_HELP dictionaries provide explanation text
Summary sections use severity icons (✓, ⚠️, ❌) for quick scanning

Key Features

Adaptive Contrast Overlays

The overlay visualization automatically adapts to document background colors:

Light backgrounds (luminance > 0.5) → Dark overlays (dark blue #00008B, black text)
Dark backgrounds (luminance ≤ 0.5) → Light overlays (yellow #FFFF00, white text)
Background sampled at 9 strategic points using low DPI (72) for performance
Results cached in _bg_color_cache to avoid re-sampling
Color palettes defined in LIGHT_BG_COLORS and DARK_BG_COLORS constants

Inline Help System

Comprehensive documentation integrated into the UI:

info parameters on all controls provide contextual tooltips
Expandable accordion with detailed explanations of all diagnostics and modes
Help text stored in DIAGNOSTIC_HELP and ORDERING_MODE_HELP dictionaries
Summary formatting includes severity icons and inline explanations

Batch Analysis

Multi-page document analysis with aggregate statistics:

Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
Real-time progress tracking via gr.Progress()
Outputs: summary stats, issue chart, per-page table, full JSON report
Performance: ~10-50ms per page depending on complexity
Identifies common issues (>50% of pages) and critical pages (3+ issues)

Important Implementation Notes

PDF Handling

Always use pikepdf for structural queries (tags, resources)
Always use PyMuPDF (fitz) for layout extraction and rendering
Page indices are 0-based internally, 1-based in UI (convert with page_num - 1)
Close documents properly using context managers (with fitz.open(), with pikepdf.open())

Coordinate Systems

PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
PIL/ImageDraw expects integer pixel coordinates
Use _rect_i() to convert float bboxes to int for drawing
DPI scaling is handled by PyMuPDF's get_pixmap(dpi=...)

Heuristics Limitations

Column detection is crude (assumes max 2 columns, uses median x-center as divider)
Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
Type3 font detection is string-based and may have false positives

Gradio Patterns

File upload provides .name attribute for file path
Use gr.update() to modify component properties dynamically (e.g., slider maximum)
State management relies on component values, not session storage
Use gr.Progress() parameter in callbacks for long-running operations (batch analysis)
Tabs organize related functionality (gr.Tabs() with gr.Tab() children)
Accordions (gr.Accordion()) for progressive disclosure of help text and detailed results

Adaptive Contrast Implementation

Always render at low DPI (72) for background sampling to avoid performance impact
Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
Use statistics.median() instead of mean to avoid outliers from text/graphics
Cache key format: (document.name, page_index) tuple
Clear cache on new document upload if memory becomes an issue
Fallback to LIGHT_BG_COLORS if sampling fails or auto_contrast=False

Batch Analysis Performance

Default max_pages=100 prevents timeout on large documents
Sample rate allows analyzing every Nth page (useful for 500+ page documents)
Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
Progress updates every page to keep UI responsive
Use dataclasses instead of dicts for better memory efficiency
Consider adding timeout protection for very large documents (1000+ pages)

Result Formatting

Use Markdown with severity icons for human-readable summaries
Icons: ✓ (no issues), ⚠️ (warnings), ❌ (critical issues)
HTML tables for detailed per-page results allow custom styling (color-coded cells)
Plotly charts via gr.Plot() for interactive visualizations
All batch results have .to_dict() method for JSON export

Advanced Analysis Error Handling

Graceful Degradation: All advanced features check for requirements before execution
Structure Tree Required: Features 2, 4, 5 require tagged PDFs
- @require_structure_tree decorator checks for StructTreeRoot
- Returns user-friendly error message if not found
- Explains what tagging is and why it's needed
Safe Execution: All features wrapped in @safe_execute decorator
- Catches all exceptions with traceback
- Returns formatted error messages instead of crashing
Content Stream Parsing: Regex-based, may fail on complex/malformed PDFs
- Returns "not matched" status if text object not found
- Shows raw stream even if parsing fails
MCID Extraction: May fail if content stream uses non-standard encoding
- Returns empty list on failure
- Block-to-tag mapping shows "No mappings found" message
Performance Limits: Structure tree extraction has max_depth=20 to prevent infinite loops

Testing

Manual Testing Checklist

Adaptive Contrast: Test with light and dark background PDFs, verify overlay colors contrast properly
Help System: Hover over all controls, expand help accordion, verify all text displays correctly
Batch Analysis: Test with 1-page, 10-page, and 100+ page documents
Edge Cases: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts

Performance Benchmarks

Single page analysis: <1 second for typical pages
Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
Background sampling adds ~50-100ms one-time cost per page
Memory usage: ~10-20MB per 100 pages of diagnostic data

Deployment to Hugging Face

Pre-deployment Steps

Test locally: uv run python app.py
Regenerate requirements.txt: uv pip compile pyproject.toml -o requirements.txt
Commit both pyproject.toml and requirements.txt
Verify app.py is set as app_file in README.md frontmatter

Hugging Face Configuration

SDK: gradio
SDK version: 6.3.0 (or latest compatible)
Python version: >=3.12 (as specified in pyproject.toml)
Main file: app.py

Known Limitations on Hugging Face

Very large PDFs (1000+ pages) may hit timeout limits
Recommend setting max_pages=100 by default
Consider adding explicit timeout handling for batch analysis