Spaces:

rianders
/

pdfinspector

Sleeping

App Files Files Community

rianders commited on Jan 21

Commit

27fda3f

1 Parent(s): 9d99474

advanced featrues

Browse files

Files changed (6) hide show

CLAUDE.md +111 -1
advanced_analysis.py +430 -0
app.py +353 -1
content_stream_parser.py +322 -0
screen_reader_sim.py +398 -0
structure_tree.py +493 -0

CLAUDE.md CHANGED Viewed

@@ -107,6 +107,43 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
    - `format_batch_results_table()`: Color-coded HTML table per page
    - `format_batch_results_chart()`: Plotly bar chart of issue distribution
 ### Key Data Structures
 **Single Page Analysis**:
@@ -122,13 +159,24 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
   - `critical_pages`: Pages with 3+ issues
   - `to_dict()`: Method to convert to JSON-serializable format
 **UI State**:
 - The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
 - Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
 ### Gradio UI Flow
-The UI is organized into two tabs: **Single Page Analysis** and **Batch Analysis**.
 #### Single Page Tab
 1. User uploads PDF → `_on_upload` → extracts path and page count
@@ -151,6 +199,51 @@ The UI is organized into two tabs: **Single Page Analysis** and **Batch Analysis
      - Color-coded HTML table of per-page results
      - Full JSON report
 #### Help & Documentation
 - All UI controls have `info` parameters with inline tooltips
 - Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
@@ -233,6 +326,23 @@ Multi-page document analysis with aggregate statistics:
 - Plotly charts via `gr.Plot()` for interactive visualizations
 - All batch results have `.to_dict()` method for JSON export
 ## Testing
 ### Manual Testing Checklist

    - `format_batch_results_table()`: Color-coded HTML table per page
    - `format_batch_results_chart()`: Plotly bar chart of issue distribution
+### Advanced Analysis Modules
+The application includes specialized modules for advanced PDF accessibility analysis:
+**advanced_analysis.py** - Coordinator module
+- Provides facade functions with error handling
+- `require_structure_tree` decorator: checks for tagged PDFs before execution
+- `safe_execute` decorator: comprehensive error handling with user-friendly messages
+- Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
+**content_stream_parser.py** - PDF operator extraction
+- `extract_content_stream_for_block()`: Gets operators for a specific block
+- `_parse_text_objects()`: Extracts BT...ET blocks from content stream
+- `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
+- `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
+- Returns formatted markdown and raw stream text
+**screen_reader_sim.py** - Accessibility simulation
+- `simulate_screen_reader()`: Main simulation function
+- `_simulate_tagged()`: Follows structure tree for tagged PDFs
+- `_simulate_untagged()`: Falls back to visual order for untagged PDFs
+- `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
+- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
+- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
+**structure_tree.py** - Structure tree analysis
+- `StructureNode` dataclass: represents PDF tag hierarchy
+- `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
+- `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
+- `format_tree_text()`: Creates indented text view with box-drawing characters
+- `get_tree_statistics()`: Counts nodes, tags, alt text coverage
+- `extract_mcid_for_page()`: Finds marked content IDs in page content stream
+- `map_blocks_to_tags()`: Correlates visual blocks with structure elements
+- `detect_visual_paragraphs()`: Spacing-based paragraph detection
+- `detect_semantic_paragraphs()`: Extracts &lt;P&gt; tags for a page
+- `compare_paragraphs()`: Calculates match quality between visual and semantic
 ### Key Data Structures
 **Single Page Analysis**:
   - `critical_pages`: Pages with 3+ issues
   - `to_dict()`: Method to convert to JSON-serializable format
+**Advanced Analysis**:
+- `StructureNode`: Represents a node in the PDF structure tree with:
+  - `tag_type`: Tag name (P, H1, Document, Figure, etc.)
+  - `depth`: Nesting level in the tree
+  - `mcid`: Marked Content ID (links to page content)
+  - `alt_text`: Alternative text for accessibility
+  - `actual_text`: Actual text content or replacement text
+  - `page_ref`: 0-based page index
+  - `children`: List of child StructureNode objects
+  - `to_dict()`: Convert to JSON-serializable format
 **UI State**:
 - The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
 - Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
 ### Gradio UI Flow
+The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
 #### Single Page Tab
 1. User uploads PDF → `_on_upload` → extracts path and page count
      - Color-coded HTML table of per-page results
      - Full JSON report
+#### Advanced Analysis Tab
+Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
+1. **Content Stream Inspector**:
+   - Extracts raw PDF content stream operators for a specific block
+   - Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
+   - Useful for debugging text extraction, font issues, and positioning problems
+   - Provides both formatted view and raw stream
+   - Uses regex parsing of content streams (approximate for complex PDFs)
+2. **Screen Reader Simulator**:
+   - Simulates NVDA or JAWS reading behavior for the current page
+   - Two modes:
+     - **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
+     - **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
+   - Three detail levels: minimal (text only), default (element announcements), verbose (full context)
+   - Generates transcript + analysis with alt text coverage statistics
+   - Reading order configurable for untagged fallback (raw/tblr/columns)
+3. **Paragraph Detection**:
+   - Compares visual paragraphs (detected by spacing) vs semantic &lt;P&gt; tags
+   - Visual detection: groups blocks with vertical gap < threshold (default 15pt)
+   - Semantic detection: extracts &lt;P&gt; tags from structure tree
+   - Generates color-coded overlay (green = visual paragraphs)
+   - Reports match quality score and mismatches
+   - Requires tagged PDF for semantic comparison
+4. **Structure Tree Visualizer**:
+   - Extracts complete PDF tag hierarchy from StructTreeRoot
+   - Three visualization formats:
+     - **Tree Diagram**: Interactive Plotly sunburst chart
+     - **Text View**: Indented text with box-drawing characters
+     - **Statistics**: Node counts, tag distribution, alt text coverage
+   - Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
+   - Displays alt text, actual text, page references, and MCID markers
+   - Only works for tagged PDFs
+5. **Block-to-Tag Mapping**:
+   - Maps visual blocks to structure tree elements via MCID (Marked Content ID)
+   - Shows which blocks have proper semantic tagging
+   - DataFrame output with block index, tag type, MCID, alt text
+   - Helps identify untagged content
+   - Requires tagged PDF with MCID references
 #### Help & Documentation
 - All UI controls have `info` parameters with inline tooltips
 - Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
 - Plotly charts via `gr.Plot()` for interactive visualizations
 - All batch results have `.to_dict()` method for JSON export
+### Advanced Analysis Error Handling
+- **Graceful Degradation**: All advanced features check for requirements before execution
+- **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
+  - `@require_structure_tree` decorator checks for StructTreeRoot
+  - Returns user-friendly error message if not found
+  - Explains what tagging is and why it's needed
+- **Safe Execution**: All features wrapped in `@safe_execute` decorator
+  - Catches all exceptions with traceback
+  - Returns formatted error messages instead of crashing
+- **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
+  - Returns "not matched" status if text object not found
+  - Shows raw stream even if parsing fails
+- **MCID Extraction**: May fail if content stream uses non-standard encoding
+  - Returns empty list on failure
+  - Block-to-tag mapping shows "No mappings found" message
+- **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
 ## Testing
 ### Manual Testing Checklist

advanced_analysis.py ADDED Viewed

	@@ -0,0 +1,430 @@

+"""
+Advanced Analysis Coordinator Module
+Provides high-level facade functions for advanced PDF accessibility features,
+with error handling and graceful degradation.
+"""
+from typing import Dict, List, Any, Optional, Callable
+from functools import wraps
+import pikepdf
+import traceback
+# Import feature modules
+from content_stream_parser import (
+    extract_content_stream_for_block,
+    format_operators_markdown,
+    format_raw_stream
+)
+from screen_reader_sim import (
+    simulate_screen_reader,
+    format_transcript
+)
+from structure_tree import (
+    extract_structure_tree,
+    format_tree_text,
+    get_tree_statistics,
+    format_statistics_markdown,
+    map_blocks_to_tags,
+    detect_visual_paragraphs,
+    detect_semantic_paragraphs,
+    compare_paragraphs
+)
+def require_structure_tree(func: Callable) -> Callable:
+    """
+    Decorator to check for structure tree before executing function.
+    Functions decorated with this will return an error message if the PDF
+    does not have a tagged structure tree.
+    """
+    @wraps(func)
+    def wrapper(pdf_path: str, *args, **kwargs):
+        try:
+            with pikepdf.open(pdf_path) as pdf:
+                if '/StructTreeRoot' not in pdf.Root:
+                    return {
+                        'error': True,
+                        'message': '## No Structure Tree Found\n\n'
+                                   'This PDF does not have a tagged structure tree. '
+                                   'This feature requires a tagged PDF.\n\n'
+                                   '**What this means**: The PDF was not created with '
+                                   'accessibility tagging, so semantic structure information '
+                                   '(headings, paragraphs, alt text) is not available.\n\n'
+                                   '**Recommendation**: Use authoring tools that support '
+                                   'PDF/UA tagging (Adobe Acrobat, MS Word with "Save as Tagged PDF").'
+                    }
+        except Exception as e:
+            return {
+                'error': True,
+                'message': f'## Error\n\nCould not open PDF: {str(e)}'
+            }
+        return func(pdf_path, *args, **kwargs)
+    return wrapper
+def safe_execute(func: Callable) -> Callable:
+    """
+    Decorator for safe execution with comprehensive error handling.
+    Catches all exceptions and returns user-friendly error messages.
+    """
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        try:
+            return func(*args, **kwargs)
+        except Exception as e:
+            error_trace = traceback.format_exc()
+            return {
+                'error': True,
+                'message': f'## Error\n\n{str(e)}\n\n**Details**:\n```\n{error_trace}\n```'
+            }
+    return wrapper
+# Feature 1: Content Stream Inspector
+@safe_execute
+def analyze_content_stream(
+    pdf_path: str,
+    page_index: int,
+    block_index: int,
+    blocks: List[Any]
+) -> Dict[str, Any]:
+    """
+    Analyze content stream operators for a specific block.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        block_index: Index of block to analyze
+        blocks: List of BlockInfo objects
+    Returns:
+        Dictionary with formatted operators and raw stream
+    """
+    result = extract_content_stream_for_block(pdf_path, page_index, block_index, blocks)
+    if 'error' in result:
+        return {
+            'error': True,
+            'message': f"## Error\n\n{result['error']}"
+        }
+    return {
+        'error': False,
+        'formatted': format_operators_markdown(result),
+        'raw': format_raw_stream(result.get('raw_stream', '')),
+        'matched': result.get('matched', False)
+    }
+# Feature 2: Screen Reader Simulator
+@safe_execute
+def analyze_screen_reader(
+    pdf_path: str,
+    page_index: int,
+    blocks: List[Any],
+    reader_type: str = "NVDA",
+    detail_level: str = "default",
+    order_mode: str = "tblr"
+) -> Dict[str, Any]:
+    """
+    Simulate screen reader output for a page.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        blocks: List of BlockInfo objects
+        reader_type: "NVDA" or "JAWS"
+        detail_level: "minimal", "default", or "verbose"
+        order_mode: Reading order for untagged fallback
+    Returns:
+        Dictionary with transcript and analysis
+    """
+    result = simulate_screen_reader(
+        pdf_path, page_index, blocks, reader_type, detail_level, order_mode
+    )
+    return {
+        'error': False,
+        'transcript': format_transcript(result),
+        'analysis': result['analysis'],
+        'mode': result['mode']
+    }
+# Feature 3: Paragraph Detection
+@safe_execute
+def analyze_paragraphs(
+    pdf_path: str,
+    page_index: int,
+    blocks: List[Any],
+    vertical_gap_threshold: float = 15.0
+) -> Dict[str, Any]:
+    """
+    Compare visual and semantic paragraph detection.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        blocks: List of BlockInfo objects
+        vertical_gap_threshold: Spacing threshold for visual paragraphs
+    Returns:
+        Dictionary with comparison results
+    """
+    # Detect visual paragraphs
+    visual_paragraphs = detect_visual_paragraphs(blocks, vertical_gap_threshold)
+    # Detect semantic paragraphs
+    semantic_paragraphs = detect_semantic_paragraphs(pdf_path, page_index)
+    # Compare
+    comparison = compare_paragraphs(visual_paragraphs, semantic_paragraphs)
+    # Format mismatches
+    mismatch_lines = [
+        "## Paragraph Comparison",
+        "",
+        f"**Visual Paragraphs Detected**: {comparison['visual_count']}",
+        f"**Semantic &lt;P&gt; Tags Found**: {comparison['semantic_count']}",
+        f"**Match Quality Score**: {comparison['match_score']:.2%}",
+        ""
+    ]
+    if comparison['count_mismatch'] == 0:
+        mismatch_lines.append("✓ Count matches between visual and semantic paragraphs")
+    else:
+        mismatch_lines.append(f"⚠️ Count mismatch: {comparison['count_mismatch']} difference")
+    if comparison['visual_count'] > comparison['semantic_count']:
+        mismatch_lines.extend([
+            "",
+            "**Issue**: More visual paragraphs than semantic tags",
+            "- Some paragraphs may be missing &lt;P&gt; tags",
+            "- Screen readers may not announce paragraph boundaries properly"
+        ])
+    elif comparison['semantic_count'] > comparison['visual_count']:
+        mismatch_lines.extend([
+            "",
+            "**Issue**: More semantic tags than visual paragraphs",
+            "- Tags may not correspond to actual visual layout",
+            "- May cause confusion for users comparing visual and audio presentation"
+        ])
+    if semantic_paragraphs == 0 and visual_paragraphs:
+        mismatch_lines.extend([
+            "",
+            "❌ **No semantic tagging found**",
+            "This page has no &lt;P&gt; tags. Screen readers will not announce paragraphs."
+        ])
+    return {
+        'error': False,
+        'visual_count': comparison['visual_count'],
+        'semantic_count': comparison['semantic_count'],
+        'match_score': comparison['match_score'],
+        'mismatches': '\n'.join(mismatch_lines),
+        'visual_paragraphs': visual_paragraphs,
+        'semantic_paragraphs': semantic_paragraphs
+    }
+# Feature 4: Structure Tree Visualizer
+@require_structure_tree
+@safe_execute
+def analyze_structure_tree(pdf_path: str) -> Dict[str, Any]:
+    """
+    Extract and visualize the PDF structure tree.
+    Args:
+        pdf_path: Path to PDF file
+    Returns:
+        Dictionary with tree visualization and statistics
+    """
+    root = extract_structure_tree(pdf_path)
+    if not root:
+        return {
+            'error': True,
+            'message': '## Error\n\nCould not extract structure tree'
+        }
+    # Generate text view
+    text_view = format_tree_text(root, max_nodes=500)
+    # Generate statistics
+    stats = get_tree_statistics(root)
+    stats_markdown = format_statistics_markdown(stats)
+    # Generate plotly diagram
+    plot_data = _create_tree_plot(root)
+    return {
+        'error': False,
+        'text_view': text_view,
+        'statistics': stats_markdown,
+        'plot_data': plot_data,
+        'stats': stats
+    }
+def _create_tree_plot(root):
+    """
+    Create Plotly sunburst diagram data from structure tree.
+    Args:
+        root: Root StructureNode
+    Returns:
+        Plotly figure
+    """
+    import plotly.graph_objects as go
+    labels = []
+    parents = []
+    values = []
+    colors = []
+    # Color map for common tag types
+    color_map = {
+        'Document': '#1f77b4',
+        'Part': '#ff7f0e',
+        'Sect': '#2ca02c',
+        'H1': '#d62728',
+        'H2': '#9467bd',
+        'H3': '#8c564b',
+        'H4': '#e377c2',
+        'H5': '#7f7f7f',
+        'H6': '#bcbd22',
+        'P': '#17becf',
+        'Figure': '#ff9896',
+        'Table': '#c5b0d5',
+        'L': '#c49c94',
+        'LI': '#f7b6d2',
+        'Link': '#c7c7c7',
+    }
+    def _traverse(node, parent_label=None):
+        # Create unique label
+        if node.depth == 0:
+            label = node.tag_type
+        else:
+            label = f"{node.tag_type}_{len(labels)}"
+        labels.append(label)
+        parents.append(parent_label if parent_label else "")
+        values.append(1)
+        # Assign color
+        base_tag = node.tag_type.split('_')[0]
+        color = color_map.get(base_tag, '#d3d3d3')
+        colors.append(color)
+        # Process children
+        for child in node.children:
+            _traverse(child, label)
+    _traverse(root)
+    fig = go.Figure(go.Sunburst(
+        labels=labels,
+        parents=parents,
+        values=values,
+        marker=dict(colors=colors),
+        branchvalues="total"
+    ))
+    fig.update_layout(
+        title="PDF Structure Tree Hierarchy",
+        height=600,
+        margin=dict(t=50, l=0, r=0, b=0)
+    )
+    return fig
+# Feature 5: Block-to-Tag Mapping
+@require_structure_tree
+@safe_execute
+def analyze_block_tag_mapping(
+    pdf_path: str,
+    page_index: int,
+    blocks: List[Any]
+) -> Dict[str, Any]:
+    """
+    Map visual blocks to structure tree tags.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        blocks: List of BlockInfo objects
+    Returns:
+        Dictionary with mapping table
+    """
+    mappings = map_blocks_to_tags(pdf_path, page_index, blocks)
+    if not mappings:
+        return {
+            'error': False,
+            'mappings': [],
+            'message': '## No Mappings Found\n\n'
+                       'Could not find block-to-tag correlations for this page. '
+                       'This may occur if:\n'
+                       '- The page has no marked content IDs (MCIDs)\n'
+                       '- The structure tree is not properly linked to content\n'
+                       '- The page uses a non-standard tagging approach'
+        }
+    # Format as table data
+    table_data = []
+    for m in mappings:
+        table_data.append([
+            str(m['block_index']),
+            m['tag_type'],
+            str(m['mcid']),
+            m['alt_text'][:50] if m['alt_text'] else ""
+        ])
+    return {
+        'error': False,
+        'mappings': table_data,
+        'count': len(mappings),
+        'message': f'## Block-to-Tag Mapping\n\nFound {len(mappings)} correlations'
+    }
+# Utility function for creating block dropdown choices
+def create_block_choices(blocks: List[Any]) -> List[tuple]:
+    """
+    Create dropdown choices from blocks for UI.
+    Args:
+        blocks: List of BlockInfo objects
+    Returns:
+        List of (label, value) tuples
+    """
+    choices = []
+    for i, block in enumerate(blocks):
+        text_preview = block.text[:50].replace('\n', ' ').strip()
+        if len(block.text) > 50:
+            text_preview += "..."
+        label = f"Block {i}: {text_preview}" if text_preview else f"Block {i} [Image]"
+        choices.append((label, i))
+    return choices

app.py CHANGED Viewed

@@ -13,6 +13,16 @@ import pymupdf as fitz  # PyMuPDF
 import pikepdf
 from PIL import Image, ImageDraw, ImageFont
 # -----------------------------
 # Color Palettes for Adaptive Contrast
 # -----------------------------
@@ -413,8 +423,73 @@ def render_page_with_overlay(
     return img
 # -----------------------------
-# Heuristic “problems” report
 # -----------------------------
 def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
@@ -929,6 +1004,141 @@ Upload a PDF and inspect:
             batch_json = gr.JSON(label="Full Batch Report", visible=False)
     def _on_upload(f):
         path, n, msg = load_pdf(f)
         return path, n, msg, gr.update(maximum=n, value=1)
@@ -947,6 +1157,148 @@ Upload a PDF and inspect:
         outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
     )
 if __name__ == "__main__":
     demo.launch()

 import pikepdf
 from PIL import Image, ImageDraw, ImageFont
+# Advanced analysis modules
+from advanced_analysis import (
+    analyze_content_stream,
+    analyze_screen_reader,
+    analyze_paragraphs,
+    analyze_structure_tree,
+    analyze_block_tag_mapping,
+    create_block_choices
+)
 # -----------------------------
 # Color Palettes for Adaptive Contrast
 # -----------------------------
     return img
+def render_paragraph_overlay(
+    pdf_path: str,
+    page_index: int,
+    dpi: int,
+    visual_paragraphs: List[List[int]],
+    semantic_paragraphs: List[Any]
+) -> Image.Image:
+    """
+    Render page with color-coded paragraph visualizations.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        dpi: Rendering DPI
+        visual_paragraphs: List of visual paragraph groups (block indices)
+        semantic_paragraphs: List of semantic paragraph StructureNodes
+    Returns:
+        PIL Image with paragraph overlays
+    """
+    doc = fitz.open(pdf_path)
+    page = doc[page_index]
+    # Render base image
+    pix = page.get_pixmap(dpi=dpi)
+    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+    draw = ImageDraw.Draw(img, 'RGBA')
+    # Extract blocks for bounding boxes
+    blocks = extract_blocks_spans(pdf_path, page_index)
+    # Scale factor from PDF points to pixels
+    scale = dpi / 72.0
+    def _rect_i(bbox):
+        """Convert PDF bbox to pixel coordinates."""
+        x0, y0, x1, y1 = bbox
+        return (int(x0 * scale), int(y0 * scale), int(x1 * scale), int(y1 * scale))
+    # Draw visual paragraphs (green = matched, red = unmatched)
+    # For simplicity, we'll draw all visual paragraphs in green with transparency
+    for para_blocks in visual_paragraphs:
+        # Calculate bounding box for entire paragraph
+        if not para_blocks:
+            continue
+        min_x0 = min(blocks[i].bbox[0] for i in para_blocks if i < len(blocks))
+        min_y0 = min(blocks[i].bbox[1] for i in para_blocks if i < len(blocks))
+        max_x1 = max(blocks[i].bbox[2] for i in para_blocks if i < len(blocks))
+        max_y1 = max(blocks[i].bbox[3] for i in para_blocks if i < len(blocks))
+        r = _rect_i((min_x0, min_y0, max_x1, max_y1))
+        # Green with transparency for visual paragraphs
+        draw.rectangle(r, outline=(0, 255, 0, 255), width=3, fill=(0, 255, 0, 30))
+    # Draw semantic paragraph indicators (blue borders)
+    # Note: semantic_paragraphs don't have direct bboxes, so we'll just count them
+    # In a more complete implementation, we'd map MCIDs to blocks
+    doc.close()
+    return img
 # -----------------------------
+# Heuristic "problems" report
 # -----------------------------
 def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
             batch_json = gr.JSON(label="Full Batch Report", visible=False)
+        # Advanced Analysis Tab
+        with gr.Tab("Advanced Analysis"):
+            gr.Markdown("""
+            # Advanced PDF Accessibility Analysis
+            Power-user features for deep PDF inspection and accessibility debugging.
+            These tools help diagnose complex accessibility issues and understand internal PDF structure.
+            """)
+            with gr.Accordion("1. Content Stream Inspector", open=False):
+                gr.Markdown("""
+                **Purpose**: Inspect raw PDF content stream operators for a specific block
+                Shows the low-level PDF commands that render text and graphics. Useful for debugging
+                text extraction issues, font problems, and positioning.
+                """)
+                cs_block_dropdown = gr.Dropdown(
+                    label="Select Block",
+                    choices=[],
+                    info="Choose a text or image block to inspect"
+                )
+                cs_inspect_btn = gr.Button("Extract Operators", variant="primary")
+                with gr.Tabs():
+                    with gr.Tab("Formatted"):
+                        cs_operator_display = gr.Markdown()
+                    with gr.Tab("Raw Stream"):
+                        cs_raw_stream = gr.Code(label="Raw PDF Content Stream")
+            with gr.Accordion("2. Screen Reader Simulator", open=False):
+                gr.Markdown("""
+                **Purpose**: Simulate how NVDA or JAWS would read this page
+                Generates a transcript showing what a screen reader user would hear, including
+                element announcements and reading order. Works with both tagged and untagged PDFs.
+                """)
+                with gr.Row():
+                    sr_reader = gr.Radio(
+                        ["NVDA", "JAWS"],
+                        value="NVDA",
+                        label="Screen Reader",
+                        info="Choose which screen reader to simulate"
+                    )
+                    sr_detail = gr.Radio(
+                        ["minimal", "default", "verbose"],
+                        value="default",
+                        label="Detail Level",
+                        info="How much context information to include"
+                    )
+                    sr_order = gr.Radio(
+                        ["raw", "tblr", "columns"],
+                        value="tblr",
+                        label="Reading Order (untagged fallback)",
+                        info="Used only if PDF has no structure tree"
+                    )
+                sr_btn = gr.Button("Generate Transcript", variant="primary")
+                with gr.Tabs():
+                    with gr.Tab("Transcript"):
+                        sr_transcript = gr.Textbox(
+                            lines=20,
+                            label="Screen Reader Output",
+                            interactive=False
+                        )
+                    with gr.Tab("Analysis"):
+                        sr_analysis = gr.Markdown()
+            with gr.Accordion("3. Paragraph Detection", open=False):
+                gr.Markdown("""
+                **Purpose**: Compare visual paragraphs vs semantic paragraph tags
+                Identifies paragraphs based on spacing (visual) and compares them to &lt;P&gt; tags
+                in the structure tree (semantic). Mismatches can cause confusion for screen reader users.
+                """)
+                para_threshold = gr.Slider(
+                    label="Vertical Gap Threshold (points)",
+                    minimum=5,
+                    maximum=30,
+                    value=15,
+                    step=1,
+                    info="Minimum vertical spacing to consider a paragraph break"
+                )
+                para_btn = gr.Button("Analyze Paragraphs", variant="primary")
+                para_overlay = gr.Image(label="Paragraph Visualization", type="pil")
+                with gr.Row():
+                    para_visual = gr.Number(label="Visual Paragraphs", interactive=False)
+                    para_semantic = gr.Number(label="Semantic <P> Tags", interactive=False)
+                    para_score = gr.Number(label="Match Quality", interactive=False)
+                para_mismatches = gr.Markdown()
+            with gr.Accordion("4. Structure Tree Visualizer", open=False):
+                gr.Markdown("""
+                **Purpose**: Display the complete PDF tag hierarchy
+                Shows the entire structure tree for tagged PDFs, including tag types, alt text,
+                and page references. Only works for PDFs with accessibility tagging.
+                """)
+                struct_btn = gr.Button("Extract Structure Tree", variant="primary")
+                with gr.Tabs():
+                    with gr.Tab("Tree Diagram"):
+                        struct_plot = gr.Plot(label="Interactive Hierarchy")
+                    with gr.Tab("Text View"):
+                        struct_text = gr.Textbox(
+                            lines=30,
+                            label="Structure Tree",
+                            interactive=False
+                        )
+                    with gr.Tab("Statistics"):
+                        struct_stats = gr.Markdown()
+            with gr.Accordion("5. Block-to-Tag Mapping", open=False):
+                gr.Markdown("""
+                **Purpose**: Link visual blocks to structure tree elements
+                Maps each visual block to its corresponding tag in the structure tree via
+                MCID (Marked Content ID) references. Shows which content is properly tagged.
+                """)
+                map_btn = gr.Button("Map Blocks to Tags", variant="primary")
+                map_message = gr.Markdown()
+                map_table = gr.DataFrame(
+                    headers=["Block #", "Tag Type", "MCID", "Alt Text"],
+                    label="Block-to-Tag Correlations",
+                    interactive=False
+                )
     def _on_upload(f):
         path, n, msg = load_pdf(f)
         return path, n, msg, gr.update(maximum=n, value=1)
         outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
     )
+    # Advanced Analysis Callbacks
+    def update_block_dropdown(pdf_path_val, page_num_val):
+        """Update block dropdown when page changes."""
+        if not pdf_path_val:
+            return gr.update(choices=[], value=None)
+        try:
+            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
+            choices = create_block_choices(blocks)
+            return gr.update(choices=choices, value=0 if choices else None)
+        except:
+            return gr.update(choices=[], value=None)
+    def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
+        """Run content stream analysis for selected block."""
+        if not pdf_path_val or block_idx is None:
+            return "Please select a block", ""
+        try:
+            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
+            result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
+            if result.get('error'):
+                return result['message'], ""
+            return result['formatted'], result['raw']
+        except Exception as e:
+            return f"## Error\n\n{str(e)}", ""
+    def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
+        """Run screen reader simulation."""
+        if not pdf_path_val:
+            return "Please upload a PDF first", ""
+        try:
+            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
+            result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
+            if result.get('error'):
+                return result.get('message', 'Error'), ""
+            return result['transcript'], result['analysis']
+        except Exception as e:
+            return f"## Error\n\n{str(e)}", ""
+    def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
+        """Run paragraph detection and comparison."""
+        if not pdf_path_val:
+            return None, 0, 0, 0.0, "Please upload a PDF first"
+        try:
+            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
+            result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
+            if result.get('error'):
+                return None, 0, 0, 0.0, result.get('message', 'Error')
+            # Create visualization overlay
+            overlay = render_paragraph_overlay(
+                pdf_path_val, page_num_val - 1, dpi_val,
+                result['visual_paragraphs'], result['semantic_paragraphs']
+            )
+            return (
+                overlay,
+                result['visual_count'],
+                result['semantic_count'],
+                result['match_score'],
+                result['mismatches']
+            )
+        except Exception as e:
+            return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
+    def run_structure_tree_extraction(pdf_path_val):
+        """Extract and visualize structure tree."""
+        if not pdf_path_val:
+            return None, "Please upload a PDF first", ""
+        try:
+            result = analyze_structure_tree(pdf_path_val)
+            if result.get('error'):
+                return None, result['message'], ""
+            return result['plot_data'], result['text_view'], result['statistics']
+        except Exception as e:
+            return None, f"## Error\n\n{str(e)}", ""
+    def run_block_tag_mapping(pdf_path_val, page_num_val):
+        """Map blocks to structure tags."""
+        if not pdf_path_val:
+            return "Please upload a PDF first", []
+        try:
+            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
+            result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
+            if result.get('error'):
+                return result.get('message', 'Error'), []
+            return result['message'], result['mappings']
+        except Exception as e:
+            return f"## Error\n\n{str(e)}", []
+    # Wire up Advanced Analysis callbacks
+    page_num.change(
+        update_block_dropdown,
+        inputs=[pdf_path, page_num],
+        outputs=[cs_block_dropdown]
+    )
+    cs_inspect_btn.click(
+        run_content_stream_inspector,
+        inputs=[pdf_path, page_num, cs_block_dropdown],
+        outputs=[cs_operator_display, cs_raw_stream]
+    )
+    sr_btn.click(
+        run_screen_reader_sim,
+        inputs=[pdf_path, page_num, sr_reader, sr_detail, sr_order],
+        outputs=[sr_transcript, sr_analysis]
+    )
+    para_btn.click(
+        run_paragraph_detection,
+        inputs=[pdf_path, page_num, dpi, para_threshold],
+        outputs=[para_overlay, para_visual, para_semantic, para_score, para_mismatches]
+    )
+    struct_btn.click(
+        run_structure_tree_extraction,
+        inputs=[pdf_path],
+        outputs=[struct_plot, struct_text, struct_stats]
+    )
+    map_btn.click(
+        run_block_tag_mapping,
+        inputs=[pdf_path, page_num],
+        outputs=[map_message, map_table]
+    )
 if __name__ == "__main__":
     demo.launch()

content_stream_parser.py ADDED Viewed

	@@ -0,0 +1,322 @@

+"""
+Content Stream Parser Module
+Provides functionality for extracting and analyzing PDF content stream operators,
+correlating them with visual blocks.
+"""
+import re
+from typing import Dict, List, Optional, Any, Tuple
+import fitz  # PyMuPDF
+def extract_content_stream_for_block(
+    pdf_path: str,
+    page_index: int,
+    block_index: int,
+    blocks: List[Any]
+) -> Dict[str, Any]:
+    """
+    Extract content stream operators for a specific block.
+    Args:
+        pdf_path: Path to the PDF file
+        page_index: 0-based page index
+        block_index: Index of the block to analyze
+        blocks: List of BlockInfo objects from extract_blocks_spans
+    Returns:
+        Dictionary with operators, raw stream, and metadata
+    """
+    if block_index < 0 or block_index >= len(blocks):
+        return {
+            'error': 'Invalid block index',
+            'operators': [],
+            'raw_stream': ''
+        }
+    target_block = blocks[block_index]
+    try:
+        doc = fitz.open(pdf_path)
+        page = doc[page_index]
+        # Clean and consolidate content streams
+        page.clean_contents()
+        # Get the page's content stream xref
+        xref = page.get_contents()[0]  # Get first content stream xref
+        # Extract raw stream data
+        stream_data = doc.xref_stream(xref)
+        try:
+            raw_stream = stream_data.decode('latin-1')
+        except:
+            raw_stream = stream_data.decode('utf-8', errors='ignore')
+        # Parse text objects from the stream
+        text_objects = _parse_text_objects(raw_stream)
+        # Find the text object that matches our target block
+        matching_object = _find_matching_text_object(text_objects, target_block)
+        doc.close()
+        if matching_object:
+            return {
+                'operators': matching_object['operators'],
+                'raw_stream': raw_stream,
+                'matched': True,
+                'block_text': target_block.text[:100]
+            }
+        else:
+            return {
+                'operators': [],
+                'raw_stream': raw_stream,
+                'matched': False,
+                'block_text': target_block.text[:100],
+                'message': 'Could not find matching text object in content stream'
+            }
+    except Exception as e:
+        return {
+            'error': str(e),
+            'operators': [],
+            'raw_stream': ''
+        }
+def _parse_text_objects(content_stream: str) -> List[Dict[str, Any]]:
+    """
+    Parse text objects (BT...ET blocks) from content stream.
+    Args:
+        content_stream: Raw PDF content stream text
+    Returns:
+        List of text objects with their operators
+    """
+    text_objects = []
+    # Find all BT...ET blocks
+    bt_et_pattern = r'BT\s+(.*?)\s+ET'
+    matches = re.finditer(bt_et_pattern, content_stream, re.DOTALL)
+    for match in matches:
+        text_block = match.group(1)
+        operators = _parse_operators(text_block)
+        text_objects.append({
+            'operators': operators,
+            'text': _extract_text_from_operators(operators)
+        })
+    return text_objects
+def _parse_operators(text_block: str) -> List[Dict[str, str]]:
+    """
+    Parse individual operators from a text block.
+    Args:
+        text_block: Text between BT and ET
+    Returns:
+        List of operator dictionaries with type and value
+    """
+    operators = []
+    # Text matrix (Tm)
+    tm_pattern = r'([\d.\-\s]+)\s+Tm'
+    for match in re.finditer(tm_pattern, text_block):
+        operators.append({
+            'type': 'Tm',
+            'value': match.group(1).strip(),
+            'description': 'Text Matrix'
+        })
+    # Font (Tf)
+    tf_pattern = r'/(\S+)\s+([\d.]+)\s+Tf'
+    for match in re.finditer(tf_pattern, text_block):
+        operators.append({
+            'type': 'Tf',
+            'value': f'/{match.group(1)} {match.group(2)}',
+            'description': f'Font: {match.group(1)}, Size: {match.group(2)}'
+        })
+    # Text positioning (Td, TD)
+    td_pattern = r'([\d.\-]+)\s+([\d.\-]+)\s+T[dD]'
+    for match in re.finditer(td_pattern, text_block):
+        operators.append({
+            'type': 'Td',
+            'value': f'{match.group(1)} {match.group(2)}',
+            'description': f'Move text position ({match.group(1)}, {match.group(2)})'
+        })
+    # Text showing (Tj)
+    tj_pattern = r'\((.*?)\)\s*Tj'
+    for match in re.finditer(tj_pattern, text_block):
+        text = match.group(1)
+        operators.append({
+            'type': 'Tj',
+            'value': f'({text})',
+            'description': f'Show text: {text[:50]}'
+        })
+    # Text showing (TJ - array)
+    tj_array_pattern = r'\[(.*?)\]\s*TJ'
+    for match in re.finditer(tj_array_pattern, text_block, re.DOTALL):
+        array_content = match.group(1)
+        operators.append({
+            'type': 'TJ',
+            'value': f'[{array_content[:100]}]',
+            'description': 'Show text array'
+        })
+    # Text leading (TL)
+    tl_pattern = r'([\d.\-]+)\s+TL'
+    for match in re.finditer(tl_pattern, text_block):
+        operators.append({
+            'type': 'TL',
+            'value': match.group(1),
+            'description': f'Text leading: {match.group(1)}'
+        })
+    # Color operators (rg, RG, g, G)
+    color_pattern = r'([\d.\s]+)\s+(rg|RG|g|G)'
+    for match in re.finditer(color_pattern, text_block):
+        operators.append({
+            'type': match.group(2),
+            'value': match.group(1).strip(),
+            'description': f'Color: {match.group(1).strip()}'
+        })
+    return operators
+def _extract_text_from_operators(operators: List[Dict[str, str]]) -> str:
+    """
+    Extract visible text from operator list.
+    Args:
+        operators: List of operator dictionaries
+    Returns:
+        Concatenated text content
+    """
+    text_parts = []
+    for op in operators:
+        if op['type'] in ['Tj', 'TJ']:
+            # Extract text from parentheses or array
+            value = op['value']
+            # Simple extraction - just get content in parentheses
+            matches = re.findall(r'\((.*?)\)', value)
+            text_parts.extend(matches)
+    return ' '.join(text_parts)
+def _find_matching_text_object(
+    text_objects: List[Dict[str, Any]],
+    target_block: Any
+) -> Optional[Dict[str, Any]]:
+    """
+    Find the text object that best matches the target block.
+    Args:
+        text_objects: List of parsed text objects
+        target_block: BlockInfo object to match
+    Returns:
+        Matching text object or None
+    """
+    target_text = target_block.text.strip()
+    if not target_text:
+        return None
+    best_match = None
+    best_score = 0
+    for text_obj in text_objects:
+        obj_text = text_obj['text'].strip()
+        if not obj_text:
+            continue
+        # Calculate similarity score (simple substring matching)
+        # Check if either text contains the other
+        if target_text in obj_text or obj_text in target_text:
+            score = min(len(target_text), len(obj_text)) / max(len(target_text), len(obj_text))
+            if score > best_score:
+                best_score = score
+                best_match = text_obj
+    # Only return match if score is reasonable
+    if best_score > 0.3:
+        return best_match
+    return None
+def format_operators_markdown(result: Dict[str, Any]) -> str:
+    """
+    Format operators as readable Markdown.
+    Args:
+        result: Result dictionary from extract_content_stream_for_block
+    Returns:
+        Formatted Markdown string
+    """
+    if 'error' in result:
+        return f"## Error\n\n{result['error']}"
+    lines = [
+        "## Content Stream Operators",
+        "",
+        f"**Block Text**: {result.get('block_text', 'N/A')}",
+        ""
+    ]
+    if not result.get('matched'):
+        lines.extend([
+            "⚠️ **Warning**: Could not find exact matching text object in content stream.",
+            "",
+            result.get('message', ''),
+            ""
+        ])
+    operators = result.get('operators', [])
+    if operators:
+        lines.extend([
+            "### Operators Found",
+            ""
+        ])
+        for i, op in enumerate(operators, 1):
+            lines.append(f"**{i}. {op['type']}**")
+            lines.append(f"   - Value: `{op['value']}`")
+            lines.append(f"   - {op['description']}")
+            lines.append("")
+    else:
+        lines.append("No operators found.")
+    return "\n".join(lines)
+def format_raw_stream(raw_stream: str, max_lines: int = 100) -> str:
+    """
+    Format raw content stream for display.
+    Args:
+        raw_stream: Raw PDF content stream text
+        max_lines: Maximum number of lines to display
+    Returns:
+        Formatted string
+    """
+    lines = raw_stream.split('\n')
+    if len(lines) > max_lines:
+        lines = lines[:max_lines]
+        lines.append(f"\n... (truncated, {len(raw_stream.split('\n')) - max_lines} more lines)")
+    return '\n'.join(lines)

screen_reader_sim.py ADDED Viewed

	@@ -0,0 +1,398 @@

+"""
+Screen Reader Simulator Module
+Simulates how NVDA and JAWS would read a PDF page, supporting both
+tagged (structure tree) and untagged (visual order fallback) PDFs.
+"""
+from typing import Dict, List, Any, Optional, Tuple
+import pikepdf
+from structure_tree import extract_structure_tree, StructureNode
+def simulate_screen_reader(
+    pdf_path: str,
+    page_index: int,
+    blocks: List[Any],
+    reader_type: str = "NVDA",
+    detail_level: str = "default",
+    order_mode: str = "tblr"
+) -> Dict[str, Any]:
+    """
+    Simulate screen reader output for a PDF page.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        blocks: List of BlockInfo objects from extract_blocks_spans
+        reader_type: "NVDA" or "JAWS"
+        detail_level: "minimal", "default", or "verbose"
+        order_mode: Reading order mode for untagged fallback ("raw", "tblr", "columns")
+    Returns:
+        Dictionary with transcript, analysis, and metadata
+    """
+    # Try tagged approach first
+    root = extract_structure_tree(pdf_path)
+    if root:
+        # Use structure tree
+        transcript, analysis = _simulate_tagged(
+            root, page_index, reader_type, detail_level
+        )
+        mode = "tagged"
+    else:
+        # Fallback to visual order
+        transcript, analysis = _simulate_untagged(
+            blocks, reader_type, detail_level, order_mode
+        )
+        mode = "untagged"
+    return {
+        'transcript': transcript,
+        'analysis': analysis,
+        'mode': mode,
+        'reader_type': reader_type,
+        'detail_level': detail_level
+    }
+def _simulate_tagged(
+    root: StructureNode,
+    page_index: int,
+    reader_type: str,
+    detail_level: str
+) -> Tuple[str, str]:
+    """
+    Simulate screen reader for tagged PDF using structure tree.
+    Args:
+        root: Root StructureNode
+        page_index: Page to simulate (0-based)
+        reader_type: "NVDA" or "JAWS"
+        detail_level: Detail level
+    Returns:
+        Tuple of (transcript, analysis)
+    """
+    # Collect structure elements for this page
+    page_elements = []
+    def _collect_page_elements(node: StructureNode):
+        # Include node if it's for this page or has no page ref (document-level)
+        if node.page_ref is None or node.page_ref == page_index:
+            if node.tag_type not in ['StructTreeRoot', 'MCID']:
+                page_elements.append(node)
+        for child in node.children:
+            _collect_page_elements(child)
+    _collect_page_elements(root)
+    # Generate transcript
+    transcript_lines = []
+    element_count = 0
+    for element in page_elements:
+        announcement = _format_element_announcement(
+            element, reader_type, detail_level
+        )
+        if announcement:
+            transcript_lines.append(announcement)
+            element_count += 1
+    transcript = '\n\n'.join(transcript_lines)
+    # Generate analysis
+    analysis_lines = [
+        "## Screen Reader Analysis (Tagged Mode)",
+        "",
+        f"**Structure**: This page uses PDF tagging (accessible structure tree)",
+        f"**Elements Found**: {element_count}",
+        ""
+    ]
+    # Count element types
+    tag_counts = {}
+    for element in page_elements:
+        tag_counts[element.tag_type] = tag_counts.get(element.tag_type, 0) + 1
+    if tag_counts:
+        analysis_lines.extend([
+            "### Element Types",
+            ""
+        ])
+        for tag, count in sorted(tag_counts.items()):
+            analysis_lines.append(f"- **{tag}**: {count}")
+    # Check for alt text coverage
+    elements_needing_alt = [e for e in page_elements if e.tag_type in ['Figure', 'Formula', 'Artifact']]
+    elements_with_alt = [e for e in elements_needing_alt if e.alt_text]
+    if elements_needing_alt:
+        coverage = len(elements_with_alt) / len(elements_needing_alt) * 100
+        analysis_lines.extend([
+            "",
+            "### Alt Text Coverage",
+            "",
+            f"**Elements needing alt text**: {len(elements_needing_alt)}",
+            f"**Elements with alt text**: {len(elements_with_alt)}",
+            f"**Coverage**: {coverage:.1f}%",
+            ""
+        ])
+        if coverage < 100:
+            analysis_lines.append("⚠️ Some elements are missing alt text")
+    analysis = '\n'.join(analysis_lines)
+    return transcript, analysis
+def _simulate_untagged(
+    blocks: List[Any],
+    reader_type: str,
+    detail_level: str,
+    order_mode: str
+) -> Tuple[str, str]:
+    """
+    Simulate screen reader for untagged PDF using visual order.
+    Args:
+        blocks: List of BlockInfo objects
+        reader_type: "NVDA" or "JAWS"
+        detail_level: Detail level
+        order_mode: Reading order mode
+    Returns:
+        Tuple of (transcript, analysis)
+    """
+    from app import order_blocks  # Import the ordering function
+    # Order blocks according to mode
+    ordered_blocks = order_blocks(blocks, order_mode)
+    # Generate transcript
+    transcript_lines = []
+    text_block_count = 0
+    image_block_count = 0
+    for block in ordered_blocks:
+        if block.block_type == 0:  # Text block
+            # Infer heading from font size
+            is_heading = False
+            heading_level = None
+            if block.spans:
+                avg_size = sum(s.size for s in block.spans) / len(block.spans)
+                if avg_size > 18:
+                    is_heading = True
+                    heading_level = 1
+                elif avg_size > 14:
+                    is_heading = True
+                    heading_level = 2
+            # Format announcement
+            if is_heading and detail_level != "minimal":
+                if reader_type == "NVDA":
+                    transcript_lines.append(f"Heading level {heading_level}")
+                    transcript_lines.append(block.text.strip())
+                else:  # JAWS
+                    transcript_lines.append(f"Heading {heading_level}: {block.text.strip()}")
+            else:
+                transcript_lines.append(block.text.strip())
+            text_block_count += 1
+        elif block.block_type == 1:  # Image block
+            if detail_level != "minimal":
+                transcript_lines.append("[Image - no alt text available]")
+            image_block_count += 1
+    transcript = '\n\n'.join(transcript_lines)
+    # Generate analysis
+    analysis_lines = [
+        "## Screen Reader Analysis (Untagged Mode)",
+        "",
+        "⚠️ **No Structure**: This page does not use PDF tagging",
+        "",
+        "Screen readers will read text in visual order with limited context.",
+        "",
+        f"**Reading Order Mode**: {order_mode}",
+        f"**Text Blocks**: {text_block_count}",
+        f"**Images**: {image_block_count}",
+        "",
+        "### Limitations",
+        "",
+        "- No semantic information (headings, lists, tables)",
+        "- No alt text for images",
+        "- Reading order may not match intended flow",
+        "- Navigation by elements not possible",
+        "",
+        "**Recommendation**: Add PDF tagging for better accessibility"
+    ]
+    analysis = '\n'.join(analysis_lines)
+    return transcript, analysis
+def _format_element_announcement(
+    element: StructureNode,
+    reader_type: str,
+    detail_level: str
+) -> Optional[str]:
+    """
+    Format a structure element as a screen reader announcement.
+    Args:
+        element: StructureNode to announce
+        reader_type: "NVDA" or "JAWS"
+        detail_level: "minimal", "default", or "verbose"
+    Returns:
+        Formatted announcement string or None
+    """
+    tag = element.tag_type
+    lines = []
+    # Map PDF tag types to screen reader announcements
+    if tag.startswith('H'):
+        # Heading
+        level = tag[1:] if len(tag) > 1 else '1'
+        text = element.actual_text or "[Heading]"
+        if detail_level == "minimal":
+            return text
+        if reader_type == "NVDA":
+            lines.append(f"Heading level {level}")
+            lines.append(text)
+        else:  # JAWS
+            lines.append(f"Heading {level}: {text}")
+    elif tag == 'P':
+        # Paragraph
+        text = element.actual_text or "[Paragraph]"
+        if detail_level == "minimal":
+            return text
+        if detail_level == "verbose":
+            if reader_type == "NVDA":
+                lines.append("Paragraph")
+            lines.append(text)
+            if reader_type == "NVDA" and detail_level == "verbose":
+                lines.append("Out of paragraph")
+        else:
+            lines.append(text)
+    elif tag == 'Figure':
+        # Figure/Image
+        alt_text = element.alt_text or "[Image - no alt text]"
+        if detail_level == "minimal":
+            return None
+        if reader_type == "NVDA":
+            lines.append("Graphic")
+            lines.append(alt_text)
+        else:  # JAWS
+            lines.append(f"Graphic: {alt_text}")
+    elif tag == 'Formula':
+        # Math formula
+        alt_text = element.alt_text or element.actual_text or "[Formula]"
+        if detail_level == "minimal":
+            return alt_text
+        if reader_type == "NVDA":
+            lines.append("Formula")
+            lines.append(alt_text)
+        else:  # JAWS
+            lines.append(f"Formula: {alt_text}")
+    elif tag in ['L', 'LI']:
+        # List/List Item
+        text = element.actual_text or "[List item]"
+        if detail_level == "minimal":
+            return text
+        if tag == 'L' and detail_level == "verbose":
+            lines.append("List start")
+        else:
+            if reader_type == "NVDA":
+                lines.append("List item")
+                lines.append(text)
+            else:  # JAWS
+                lines.append(f"Bullet: {text}")
+    elif tag == 'Table':
+        # Table
+        if detail_level != "minimal":
+            if reader_type == "NVDA":
+                lines.append("Table")
+            else:  # JAWS
+                lines.append("Table start")
+    elif tag in ['TR', 'TD', 'TH']:
+        # Table row/cell
+        text = element.actual_text or ""
+        if text and detail_level != "minimal":
+            lines.append(text)
+    elif tag == 'Link':
+        # Link
+        text = element.actual_text or "[Link]"
+        if detail_level == "minimal":
+            return text
+        if reader_type == "NVDA":
+            lines.append("Link")
+            lines.append(text)
+        else:  # JAWS
+            lines.append(f"Link: {text}")
+    elif tag == 'Span':
+        # Inline text
+        text = element.actual_text or ""
+        if text:
+            return text
+    elif tag in ['Document', 'Part', 'Sect', 'Div', 'Art']:
+        # Container elements - usually not announced
+        return None
+    else:
+        # Unknown tag type
+        if element.actual_text:
+            return element.actual_text
+    if lines:
+        return '\n'.join(lines)
+    return None
+def format_transcript(result: Dict[str, Any]) -> str:
+    """
+    Format screen reader transcript for display.
+    Args:
+        result: Result from simulate_screen_reader
+    Returns:
+        Formatted transcript string
+    """
+    header = f"# {result['reader_type']} Transcript ({result['detail_level']} detail)\n\n"
+    if result['mode'] == 'untagged':
+        header += "⚠️ Simulated from visual order (PDF not tagged)\n\n"
+    header += "---\n\n"
+    return header + result['transcript']

structure_tree.py ADDED Viewed

	@@ -0,0 +1,493 @@

+"""
+Structure Tree Analysis Module
+Provides functionality for extracting and analyzing PDF structure trees,
+including paragraph detection and block-to-tag mapping.
+"""
+from dataclasses import dataclass, field
+from typing import List, Optional, Dict, Any, Tuple
+import pikepdf
+import statistics
+from collections import Counter
+@dataclass
+class StructureNode:
+    """Represents a node in the PDF structure tree."""
+    tag_type: str  # P, H1, Document, etc.
+    depth: int
+    mcid: Optional[int] = None
+    alt_text: Optional[str] = None
+    actual_text: Optional[str] = None
+    page_ref: Optional[int] = None
+    children: List['StructureNode'] = field(default_factory=list)
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to dictionary for JSON serialization."""
+        return {
+            'tag_type': self.tag_type,
+            'depth': self.depth,
+            'mcid': self.mcid,
+            'alt_text': self.alt_text,
+            'actual_text': self.actual_text,
+            'page_ref': self.page_ref,
+            'children': [child.to_dict() for child in self.children]
+        }
+def extract_structure_tree(pdf_path: str) -> Optional[StructureNode]:
+    """
+    Extract the complete structure tree from a PDF.
+    Args:
+        pdf_path: Path to the PDF file
+    Returns:
+        Root StructureNode or None if no structure tree exists
+    """
+    try:
+        with pikepdf.open(pdf_path) as pdf:
+            if '/StructTreeRoot' not in pdf.Root:
+                return None
+            struct_root = pdf.Root.StructTreeRoot
+            # Create root node
+            root_node = StructureNode(
+                tag_type="StructTreeRoot",
+                depth=0
+            )
+            # Recursively parse the tree
+            if '/K' in struct_root:
+                _parse_structure_element(struct_root.K, root_node, 1, pdf)
+            return root_node
+    except Exception as e:
+        print(f"Error extracting structure tree: {e}")
+        return None
+def _parse_structure_element(element, parent_node: StructureNode, depth: int, pdf: pikepdf.Pdf, max_depth: int = 20):
+    """
+    Recursively parse a structure element and its children.
+    Args:
+        element: pikepdf object representing the element
+        parent_node: Parent StructureNode to attach children to
+        depth: Current depth in the tree
+        pdf: pikepdf.Pdf object for resolving references
+        max_depth: Maximum recursion depth to prevent infinite loops
+    """
+    if depth > max_depth:
+        return
+    # Handle arrays of elements
+    if isinstance(element, pikepdf.Array):
+        for item in element:
+            _parse_structure_element(item, parent_node, depth, pdf, max_depth)
+        return
+    # Handle MCID (Marked Content ID) - leaf node
+    if isinstance(element, int):
+        node = StructureNode(
+            tag_type="MCID",
+            depth=depth,
+            mcid=element
+        )
+        parent_node.children.append(node)
+        return
+    # Handle dictionary (structure element)
+    if isinstance(element, pikepdf.Dictionary):
+        # Extract tag type
+        tag_type = str(element.get('/S', 'Unknown'))
+        if tag_type.startswith('/'):
+            tag_type = tag_type[1:]  # Remove leading slash
+        # Extract attributes
+        mcid = None
+        if '/MCID' in element:
+            mcid = int(element.MCID)
+        alt_text = None
+        if '/Alt' in element:
+            try:
+                alt_text = str(element.Alt)
+            except:
+                pass
+        actual_text = None
+        if '/ActualText' in element:
+            try:
+                actual_text = str(element.ActualText)
+            except:
+                pass
+        page_ref = None
+        if '/Pg' in element:
+            try:
+                # Find the page number
+                page_obj = element.Pg
+                for i, page in enumerate(pdf.pages):
+                    if page.obj == page_obj:
+                        page_ref = i
+                        break
+            except:
+                pass
+        # Create node
+        node = StructureNode(
+            tag_type=tag_type,
+            depth=depth,
+            mcid=mcid,
+            alt_text=alt_text,
+            actual_text=actual_text,
+            page_ref=page_ref
+        )
+        parent_node.children.append(node)
+        # Recursively process children
+        if '/K' in element:
+            _parse_structure_element(element.K, node, depth + 1, pdf, max_depth)
+def format_tree_text(root: StructureNode, max_nodes: int = 500) -> str:
+    """
+    Format structure tree as indented text with box-drawing characters.
+    Args:
+        root: Root StructureNode
+        max_nodes: Maximum number of nodes to display
+    Returns:
+        Formatted text representation
+    """
+    lines = []
+    node_count = [0]  # Use list to allow modification in nested function
+    def _format_node(node: StructureNode, prefix: str = "", is_last: bool = True):
+        if node_count[0] >= max_nodes:
+            if node_count[0] == max_nodes:
+                lines.append(f"{prefix}... (truncated, tree too large)")
+                node_count[0] += 1
+            return
+        # Format node info
+        info = node.tag_type
+        if node.mcid is not None:
+            info += f" [MCID: {node.mcid}]"
+        if node.alt_text:
+            info += f" (Alt: {node.alt_text[:30]}...)" if len(node.alt_text) > 30 else f" (Alt: {node.alt_text})"
+        if node.actual_text:
+            info += f" (Text: {node.actual_text[:30]}...)" if len(node.actual_text) > 30 else f" (Text: {node.actual_text})"
+        if node.page_ref is not None:
+            info += f" [Page {node.page_ref + 1}]"
+        # Add line with appropriate prefix
+        if node.depth == 0:
+            lines.append(info)
+        else:
+            connector = "└── " if is_last else "├── "
+            lines.append(f"{prefix}{connector}{info}")
+        node_count[0] += 1
+        # Process children
+        if node.children:
+            extension = "    " if is_last else "│   "
+            new_prefix = prefix + extension if node.depth > 0 else ""
+            for i, child in enumerate(node.children):
+                is_last_child = (i == len(node.children) - 1)
+                _format_node(child, new_prefix, is_last_child)
+    _format_node(root)
+    return "\n".join(lines)
+def get_tree_statistics(root: StructureNode) -> Dict[str, Any]:
+    """
+    Calculate statistics about the structure tree.
+    Args:
+        root: Root StructureNode
+    Returns:
+        Dictionary of statistics
+    """
+    node_count = 0
+    max_depth = 0
+    tag_counts = Counter()
+    pages_with_structure = set()
+    nodes_with_alt_text = 0
+    nodes_with_actual_text = 0
+    mcid_count = 0
+    def _traverse(node: StructureNode):
+        nonlocal node_count, max_depth, nodes_with_alt_text, nodes_with_actual_text, mcid_count
+        node_count += 1
+        max_depth = max(max_depth, node.depth)
+        tag_counts[node.tag_type] += 1
+        if node.page_ref is not None:
+            pages_with_structure.add(node.page_ref)
+        if node.alt_text:
+            nodes_with_alt_text += 1
+        if node.actual_text:
+            nodes_with_actual_text += 1
+        if node.mcid is not None:
+            mcid_count += 1
+        for child in node.children:
+            _traverse(child)
+    _traverse(root)
+    return {
+        'total_nodes': node_count,
+        'max_depth': max_depth,
+        'tag_type_counts': dict(tag_counts.most_common()),
+        'pages_with_structure': sorted(list(pages_with_structure)),
+        'nodes_with_alt_text': nodes_with_alt_text,
+        'nodes_with_actual_text': nodes_with_actual_text,
+        'mcid_count': mcid_count
+    }
+def format_statistics_markdown(stats: Dict[str, Any]) -> str:
+    """Format tree statistics as Markdown."""
+    lines = [
+        "## Structure Tree Statistics",
+        "",
+        f"**Total Nodes**: {stats['total_nodes']}",
+        f"**Maximum Depth**: {stats['max_depth']}",
+        f"**Nodes with Alt Text**: {stats['nodes_with_alt_text']}",
+        f"**Nodes with Actual Text**: {stats['nodes_with_actual_text']}",
+        f"**MCID References**: {stats['mcid_count']}",
+        "",
+        "### Tag Type Distribution",
+        ""
+    ]
+    for tag, count in stats['tag_type_counts'].items():
+        lines.append(f"- **{tag}**: {count}")
+    lines.extend([
+        "",
+        f"**Pages with Structure**: {len(stats['pages_with_structure'])}"
+    ])
+    if stats['pages_with_structure']:
+        page_list = ", ".join([str(p + 1) for p in stats['pages_with_structure'][:20]])
+        if len(stats['pages_with_structure']) > 20:
+            page_list += f" ... ({len(stats['pages_with_structure']) - 20} more)"
+        lines.append(f"({page_list})")
+    return "\n".join(lines)
+def extract_mcid_for_page(pdf_path: str, page_index: int) -> List[int]:
+    """
+    Extract all MCIDs (Marked Content IDs) from a page's content stream.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+    Returns:
+        List of MCIDs found in the page
+    """
+    import re
+    try:
+        with pikepdf.open(pdf_path) as pdf:
+            page = pdf.pages[page_index]
+            # Get the content stream
+            if '/Contents' not in page:
+                return []
+            # Extract content stream as text
+            contents = page.Contents
+            if isinstance(contents, pikepdf.Array):
+                # Multiple content streams
+                stream_data = b""
+                for stream in contents:
+                    stream_data += bytes(stream.read_bytes())
+            else:
+                stream_data = bytes(contents.read_bytes())
+            # Decode content stream
+            try:
+                content_text = stream_data.decode('latin-1')
+            except:
+                content_text = stream_data.decode('utf-8', errors='ignore')
+            # Extract MCIDs using regex
+            # Pattern: /MCID <number> BDC or /MCID <number> >> BDC
+            mcid_pattern = r'/MCID\s+(\d+)'
+            matches = re.findall(mcid_pattern, content_text)
+            return [int(m) for m in matches]
+    except Exception as e:
+        print(f"Error extracting MCIDs: {e}")
+        return []
+def map_blocks_to_tags(pdf_path: str, page_index: int, blocks) -> List[Dict[str, Any]]:
+    """
+    Map visual blocks to structure tree tags via MCIDs.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+        blocks: List of BlockInfo objects from extract_blocks_spans
+    Returns:
+        List of mappings with block index, tag info, MCID, alt text
+    """
+    # Extract structure tree
+    root = extract_structure_tree(pdf_path)
+    if not root:
+        return []
+    # Get MCIDs from page
+    page_mcids = extract_mcid_for_page(pdf_path, page_index)
+    # Build MCID to structure node mapping
+    mcid_to_node = {}
+    def _find_mcids(node: StructureNode):
+        if node.mcid is not None and (node.page_ref is None or node.page_ref == page_index):
+            mcid_to_node[node.mcid] = node
+        for child in node.children:
+            _find_mcids(child)
+    _find_mcids(root)
+    # Create mappings
+    mappings = []
+    for i, mcid in enumerate(page_mcids):
+        if i < len(blocks) and mcid in mcid_to_node:
+            node = mcid_to_node[mcid]
+            mappings.append({
+                'block_index': i,
+                'tag_type': node.tag_type,
+                'mcid': mcid,
+                'alt_text': node.alt_text or "",
+                'actual_text': node.actual_text or ""
+            })
+    return mappings
+def detect_visual_paragraphs(blocks, vertical_gap_threshold: float = 15.0) -> List[List[int]]:
+    """
+    Detect visual paragraphs based on spacing heuristics.
+    Args:
+        blocks: List of BlockInfo objects
+        vertical_gap_threshold: Minimum vertical gap to consider a paragraph break
+    Returns:
+        List of paragraph groups, where each group is a list of block indices
+    """
+    # Filter text blocks and sort by position
+    text_blocks = [(i, b) for i, b in enumerate(blocks) if b.block_type == 0 and b.text.strip()]
+    if not text_blocks:
+        return []
+    # Sort by vertical position (top to bottom)
+    text_blocks.sort(key=lambda x: x[1].bbox[1])
+    paragraphs = []
+    current_paragraph = [text_blocks[0][0]]
+    prev_bbox = text_blocks[0][1].bbox
+    for idx, block in text_blocks[1:]:
+        bbox = block.bbox
+        # Calculate vertical gap
+        vertical_gap = bbox[1] - prev_bbox[3]
+        # Check if blocks are roughly aligned horizontally (same column)
+        horizontal_overlap = min(bbox[2], prev_bbox[2]) - max(bbox[0], prev_bbox[0])
+        if vertical_gap < vertical_gap_threshold and horizontal_overlap > 0:
+            # Same paragraph
+            current_paragraph.append(idx)
+        else:
+            # New paragraph
+            paragraphs.append(current_paragraph)
+            current_paragraph = [idx]
+        prev_bbox = bbox
+    # Add last paragraph
+    if current_paragraph:
+        paragraphs.append(current_paragraph)
+    return paragraphs
+def detect_semantic_paragraphs(pdf_path: str, page_index: int) -> List[StructureNode]:
+    """
+    Extract semantic paragraph tags (<P>) from structure tree.
+    Args:
+        pdf_path: Path to PDF file
+        page_index: 0-based page index
+    Returns:
+        List of StructureNode objects with tag_type='P' for the page
+    """
+    root = extract_structure_tree(pdf_path)
+    if not root:
+        return []
+    paragraphs = []
+    def _find_paragraphs(node: StructureNode):
+        if node.tag_type == 'P' and (node.page_ref is None or node.page_ref == page_index):
+            paragraphs.append(node)
+        for child in node.children:
+            _find_paragraphs(child)
+    _find_paragraphs(root)
+    return paragraphs
+def compare_paragraphs(visual_paragraphs: List[List[int]], semantic_paragraphs: List[StructureNode]) -> Dict[str, Any]:
+    """
+    Compare visual and semantic paragraph detection.
+    Args:
+        visual_paragraphs: List of visual paragraph groups (block indices)
+        semantic_paragraphs: List of semantic <P> tags
+    Returns:
+        Dictionary with comparison statistics
+    """
+    visual_count = len(visual_paragraphs)
+    semantic_count = len(semantic_paragraphs)
+    # Calculate match quality score (simple heuristic)
+    if visual_count == 0 and semantic_count == 0:
+        match_score = 1.0
+    elif visual_count == 0 or semantic_count == 0:
+        match_score = 0.0
+    else:
+        # Score based on count similarity
+        match_score = min(visual_count, semantic_count) / max(visual_count, semantic_count)
+    return {
+        'visual_count': visual_count,
+        'semantic_count': semantic_count,
+        'match_score': match_score,
+        'count_mismatch': abs(visual_count - semantic_count)
+    }