Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.
Target deployment: Hugging Face Spaces (gradio SDK)
Commands
Development
# Run the Gradio app locally
uv run python app.py
# The app will launch at http://localhost:7860 by default
Dependencies
# Sync environment (after cloning or pulling changes)
uv sync
# Add a new dependency
uv add <package>
# Add a dev dependency
uv add --dev <package>
# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
uv pip compile pyproject.toml -o requirements.txt
The project uses pyproject.toml for dependency management with uv lock file support. Always use uv run for running commands in the development environment.
Architecture
Core Libraries
- PyMuPDF (fitz): Layout extraction, block/span detection, rendering pages as images
- pikepdf: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
- Gradio: Web UI framework
- PIL (Pillow): Image manipulation for overlay rendering
Main Application Flow (app.py)
The application has two main modes: Single Page Analysis and Batch Analysis.
Single Page Analysis Pipeline
PDF Structure Analysis (
pdf_struct_report):- Uses pikepdf to inspect PDF-level metadata
- Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
- Analyzes per-page resources (fonts, XObjects)
Layout Extraction (
extract_blocks_spans):- Uses PyMuPDF's
get_text("dict")to extract blocks, lines, and spans with bounding boxes - Returns structured
BlockInfoobjects containing text, bbox, font info, and span details - Block types: 0=text, 1=image, 2=drawing
- Uses PyMuPDF's
Reading Order Analysis (
order_blocks):- Three ordering modes:
raw: Extraction order (as stored in PDF)tblr: Top-to-bottom, left-to-right sorting by bboxcolumns: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
- Three ordering modes:
Diagnostic Heuristics (
diagnose_page):- Detects scanned pages (no text + images)
- Identifies text-as-vector-outlines (no text + many drawings)
- Flags Type3 fonts (often correlate with broken text extraction)
- Detects garbled text (replacement characters, missing ToUnicode)
- Guesses multi-column layouts (x-center clustering)
Adaptive Contrast Detection (for visualization):
sample_background_color(): Samples page at 9 points (corners, edges, center) to determine backgroundcalculate_luminance(): Uses WCAG formula to compute relative luminance (0-1)get_contrast_colors(): Returns appropriate color palette based on luminance- Background colors cached per page for performance
Visualization (
render_page_with_overlay):- Renders page at specified DPI using PyMuPDF
- Automatically detects background and chooses contrasting overlay colors
- Overlays numbered block rectangles showing reading order
- Optionally shows span-level boxes
- Flags math-like content using regex heuristics (
_looks_like_math)
Result Formatting (
format_diagnostic_summary):- Generates Markdown with severity icons (β, β οΈ, β)
- Includes inline explanations from
DIAGNOSTIC_HELPdictionary
Batch Analysis Pipeline
Multi-Page Processing (
diagnose_all_pages):- Analyzes multiple pages (configurable max_pages and sample_rate)
- Progress tracking via
gr.Progress() - Calls
diagnose_page()for each page with timing - Returns
BatchAnalysisResultdataclass
Aggregation (
aggregate_results):- Counts issues across all pages
- Identifies critical pages (3+ issues)
- Detects common issues (affecting >50% of pages)
Result Formatting:
format_batch_summary_markdown(): Executive summary with statisticsformat_batch_results_table(): Color-coded HTML table per pageformat_batch_results_chart(): Plotly bar chart of issue distribution
Advanced Analysis Modules
The application includes specialized modules for advanced PDF accessibility analysis:
advanced_analysis.py - Coordinator module
- Provides facade functions with error handling
require_structure_treedecorator: checks for tagged PDFs before executionsafe_executedecorator: comprehensive error handling with user-friendly messages- Exports high-level functions:
analyze_content_stream,analyze_screen_reader, etc.
content_stream_parser.py - PDF operator extraction
extract_content_stream_for_block(): Gets operators for a specific block_parse_text_objects(): Extracts BT...ET blocks from content stream_parse_operators(): Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators_find_matching_text_object(): Correlates text objects with BlockInfo via text matching- Returns formatted markdown and raw stream text
screen_reader_sim.py - Accessibility simulation
simulate_screen_reader(): Main simulation function_simulate_tagged(): Follows structure tree for tagged PDFs_simulate_untagged(): Falls back to visual order for untagged PDFs_format_element_announcement(): Generates NVDA/JAWS-style announcements- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
structure_tree.py - Structure tree analysis
StructureNodedataclass: represents PDF tag hierarchyextract_structure_tree(): Recursively parses StructTreeRoot with pikepdf_parse_structure_element(): Handles Dictionary, Array, and MCID elementsformat_tree_text(): Creates indented text view with box-drawing charactersget_tree_statistics(): Counts nodes, tags, alt text coverageextract_mcid_for_page(): Finds marked content IDs in page content streammap_blocks_to_tags(): Correlates visual blocks with structure elementsdetect_visual_paragraphs(): Spacing-based paragraph detectiondetect_semantic_paragraphs(): Extracts <P> tags for a pagecompare_paragraphs(): Calculates match quality between visual and semantic
Key Data Structures
Single Page Analysis:
SpanInfo: Individual text run with bbox, text, font, sizeBlockInfo: Text/image block with bbox, text, type, and list of spans
Batch Analysis:
PageDiagnostic: Per-page diagnostic results with all issue flags and processing timeBatchAnalysisResult: Aggregated statistics across multiple pages including:summary_stats: Dictionary of issue countsper_page_results: List of PageDiagnostic objectscommon_issues: Issues affecting >50% of pagescritical_pages: Pages with 3+ issuesto_dict(): Method to convert to JSON-serializable format
Advanced Analysis:
StructureNode: Represents a node in the PDF structure tree with:tag_type: Tag name (P, H1, Document, Figure, etc.)depth: Nesting level in the treemcid: Marked Content ID (links to page content)alt_text: Alternative text for accessibilityactual_text: Actual text content or replacement textpage_ref: 0-based page indexchildren: List of child StructureNode objectsto_dict(): Convert to JSON-serializable format
UI State:
- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
- Background color cache:
_bg_color_cachedict keyed by (document_path, page_index)
Gradio UI Flow
The UI is organized into three tabs: Single Page Analysis, Batch Analysis, and Advanced Analysis.
Single Page Tab
- User uploads PDF β
_on_uploadβ extracts path and page count - User adjusts parameters (page, DPI, order mode, visualization options)
- Click "Analyze" β
analyzefunction:- Runs structural report (pikepdf)
- Extracts and orders blocks (PyMuPDF)
- Generates diagnostic report with adaptive contrast detection
- Creates overlay image with high-contrast colors
- Returns reading order preview + formatted summary with icons
Batch Analysis Tab
- User sets max_pages (default 100) and sample_rate (default 1)
- Click "Analyze All Pages" β
analyze_batch_with_progressfunction:- Calls
diagnose_all_pages()with progress tracking - Aggregates results across pages
- Returns:
- Summary markdown with statistics and common issues
- Plotly bar chart of issue distribution
- Color-coded HTML table of per-page results
- Full JSON report
- Calls
Advanced Analysis Tab
Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
Content Stream Inspector:
- Extracts raw PDF content stream operators for a specific block
- Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
- Useful for debugging text extraction, font issues, and positioning problems
- Provides both formatted view and raw stream
- Uses regex parsing of content streams (approximate for complex PDFs)
Screen Reader Simulator:
- Simulates NVDA or JAWS reading behavior for the current page
- Two modes:
- Tagged PDFs: Follows structure tree, announces headings/paragraphs/figures with proper semantics
- Untagged PDFs: Falls back to visual reading order, infers headings from font size
- Three detail levels: minimal (text only), default (element announcements), verbose (full context)
- Generates transcript + analysis with alt text coverage statistics
- Reading order configurable for untagged fallback (raw/tblr/columns)
Paragraph Detection:
- Compares visual paragraphs (detected by spacing) vs semantic <P> tags
- Visual detection: groups blocks with vertical gap < threshold (default 15pt)
- Semantic detection: extracts <P> tags from structure tree
- Generates color-coded overlay (green = visual paragraphs)
- Reports match quality score and mismatches
- Requires tagged PDF for semantic comparison
Structure Tree Visualizer:
- Extracts complete PDF tag hierarchy from StructTreeRoot
- Three visualization formats:
- Tree Diagram: Interactive Plotly sunburst chart
- Text View: Indented text with box-drawing characters
- Statistics: Node counts, tag distribution, alt text coverage
- Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
- Displays alt text, actual text, page references, and MCID markers
- Only works for tagged PDFs
Block-to-Tag Mapping:
- Maps visual blocks to structure tree elements via MCID (Marked Content ID)
- Shows which blocks have proper semantic tagging
- DataFrame output with block index, tag type, MCID, alt text
- Helps identify untagged content
- Requires tagged PDF with MCID references
Help & Documentation
- All UI controls have
infoparameters with inline tooltips - Expandable "π Understanding the Diagnostics" accordion with detailed explanations
DIAGNOSTIC_HELPandORDERING_MODE_HELPdictionaries provide explanation text- Summary sections use severity icons (β, β οΈ, β) for quick scanning
Key Features
Adaptive Contrast Overlays
The overlay visualization automatically adapts to document background colors:
- Light backgrounds (luminance > 0.5) β Dark overlays (dark blue #00008B, black text)
- Dark backgrounds (luminance β€ 0.5) β Light overlays (yellow #FFFF00, white text)
- Background sampled at 9 strategic points using low DPI (72) for performance
- Results cached in
_bg_color_cacheto avoid re-sampling - Color palettes defined in
LIGHT_BG_COLORSandDARK_BG_COLORSconstants
Inline Help System
Comprehensive documentation integrated into the UI:
infoparameters on all controls provide contextual tooltips- Expandable accordion with detailed explanations of all diagnostics and modes
- Help text stored in
DIAGNOSTIC_HELPandORDERING_MODE_HELPdictionaries - Summary formatting includes severity icons and inline explanations
Batch Analysis
Multi-page document analysis with aggregate statistics:
- Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
- Real-time progress tracking via
gr.Progress() - Outputs: summary stats, issue chart, per-page table, full JSON report
- Performance: ~10-50ms per page depending on complexity
- Identifies common issues (>50% of pages) and critical pages (3+ issues)
Important Implementation Notes
PDF Handling
- Always use pikepdf for structural queries (tags, resources)
- Always use PyMuPDF (fitz) for layout extraction and rendering
- Page indices are 0-based internally, 1-based in UI (convert with
page_num - 1) - Close documents properly using context managers (
with fitz.open(),with pikepdf.open())
Coordinate Systems
- PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
- PIL/ImageDraw expects integer pixel coordinates
- Use
_rect_i()to convert float bboxes to int for drawing - DPI scaling is handled by PyMuPDF's
get_pixmap(dpi=...)
Heuristics Limitations
- Column detection is crude (assumes max 2 columns, uses median x-center as divider)
- Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
- All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
- Type3 font detection is string-based and may have false positives
Gradio Patterns
- File upload provides
.nameattribute for file path - Use
gr.update()to modify component properties dynamically (e.g., slider maximum) - State management relies on component values, not session storage
- Use
gr.Progress()parameter in callbacks for long-running operations (batch analysis) - Tabs organize related functionality (
gr.Tabs()withgr.Tab()children) - Accordions (
gr.Accordion()) for progressive disclosure of help text and detailed results
Adaptive Contrast Implementation
- Always render at low DPI (72) for background sampling to avoid performance impact
- Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
- Use
statistics.median()instead of mean to avoid outliers from text/graphics - Cache key format:
(document.name, page_index)tuple - Clear cache on new document upload if memory becomes an issue
- Fallback to
LIGHT_BG_COLORSif sampling fails orauto_contrast=False
Batch Analysis Performance
- Default max_pages=100 prevents timeout on large documents
- Sample rate allows analyzing every Nth page (useful for 500+ page documents)
- Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
- Progress updates every page to keep UI responsive
- Use dataclasses instead of dicts for better memory efficiency
- Consider adding timeout protection for very large documents (1000+ pages)
Result Formatting
- Use Markdown with severity icons for human-readable summaries
- Icons: β (no issues), β οΈ (warnings), β (critical issues)
- HTML tables for detailed per-page results allow custom styling (color-coded cells)
- Plotly charts via
gr.Plot()for interactive visualizations - All batch results have
.to_dict()method for JSON export
Advanced Analysis Error Handling
- Graceful Degradation: All advanced features check for requirements before execution
- Structure Tree Required: Features 2, 4, 5 require tagged PDFs
@require_structure_treedecorator checks for StructTreeRoot- Returns user-friendly error message if not found
- Explains what tagging is and why it's needed
- Safe Execution: All features wrapped in
@safe_executedecorator- Catches all exceptions with traceback
- Returns formatted error messages instead of crashing
- Content Stream Parsing: Regex-based, may fail on complex/malformed PDFs
- Returns "not matched" status if text object not found
- Shows raw stream even if parsing fails
- MCID Extraction: May fail if content stream uses non-standard encoding
- Returns empty list on failure
- Block-to-tag mapping shows "No mappings found" message
- Performance Limits: Structure tree extraction has max_depth=20 to prevent infinite loops
Testing
Manual Testing Checklist
- Adaptive Contrast: Test with light and dark background PDFs, verify overlay colors contrast properly
- Help System: Hover over all controls, expand help accordion, verify all text displays correctly
- Batch Analysis: Test with 1-page, 10-page, and 100+ page documents
- Edge Cases: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts
Performance Benchmarks
- Single page analysis: <1 second for typical pages
- Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
- Background sampling adds ~50-100ms one-time cost per page
- Memory usage: ~10-20MB per 100 pages of diagnostic data
Deployment to Hugging Face
Pre-deployment Steps
- Test locally:
uv run python app.py - Regenerate requirements.txt:
uv pip compile pyproject.toml -o requirements.txt - Commit both
pyproject.tomlandrequirements.txt - Verify
app.pyis set asapp_filein README.md frontmatter
Hugging Face Configuration
- SDK: gradio
- SDK version: 6.3.0 (or latest compatible)
- Python version: >=3.12 (as specified in pyproject.toml)
- Main file: app.py
Known Limitations on Hugging Face
- Very large PDFs (1000+ pages) may hit timeout limits
- Recommend setting max_pages=100 by default
- Consider adding explicit timeout handling for batch analysis