Spaces:

rianders
/

pdfinspector

Sleeping

App Files Files Community

rianders commited on Jan 22

Commit

0d61aa0

1 Parent(s): 27fda3f

Fix file load errors and implement auto-refresh functionality

Browse files

Files changed (7) hide show

AGENTS.md +377 -0
DEBUGGING_WORKFLOW.md +72 -0
TEST_PLAN.md +76 -0
USABILITY_AUDIT.md +45 -0
app.py +316 -461
layout_utils.py +174 -0
screen_reader_sim.py +2 -2

AGENTS.md ADDED Viewed

	@@ -0,0 +1,377 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.
+**Target deployment**: Hugging Face Spaces (gradio SDK)
+## Commands
+### Development
+```bash
+# Run the Gradio app locally
+uv run python app.py
+# The app will launch at http://localhost:7860 by default
+```
+### Dependencies
+```bash
+# Sync environment (after cloning or pulling changes)
+uv sync
+# Add a new dependency
+uv add <package>
+# Add a dev dependency
+uv add --dev <package>
+# Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
+uv pip compile pyproject.toml -o requirements.txt
+```
+The project uses `pyproject.toml` for dependency management with uv lock file support. **Always use `uv run`** for running commands in the development environment.
+## Architecture
+### Core Libraries
+- **PyMuPDF (fitz)**: Layout extraction, block/span detection, rendering pages as images
+- **pikepdf**: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
+- **Gradio**: Web UI framework
+- **PIL (Pillow)**: Image manipulation for overlay rendering
+### Main Application Flow (app.py)
+The application has two main modes: **Single Page Analysis** and **Batch Analysis**.
+#### Single Page Analysis Pipeline
+1. **PDF Structure Analysis** (`pdf_struct_report`):
+   - Uses pikepdf to inspect PDF-level metadata
+   - Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
+   - Analyzes per-page resources (fonts, XObjects)
+2. **Layout Extraction** (`extract_blocks_spans`):
+   - Uses PyMuPDF's `get_text("dict")` to extract blocks, lines, and spans with bounding boxes
+   - Returns structured `BlockInfo` objects containing text, bbox, font info, and span details
+   - Block types: 0=text, 1=image, 2=drawing
+3. **Reading Order Analysis** (`order_blocks`):
+   - Three ordering modes:
+     - `raw`: Extraction order (as stored in PDF)
+     - `tblr`: Top-to-bottom, left-to-right sorting by bbox
+     - `columns`: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
+4. **Diagnostic Heuristics** (`diagnose_page`):
+   - Detects scanned pages (no text + images)
+   - Identifies text-as-vector-outlines (no text + many drawings)
+   - Flags Type3 fonts (often correlate with broken text extraction)
+   - Detects garbled text (replacement characters, missing ToUnicode)
+   - Guesses multi-column layouts (x-center clustering)
+5. **Adaptive Contrast Detection** (for visualization):
+   - `sample_background_color()`: Samples page at 9 points (corners, edges, center) to determine background
+   - `calculate_luminance()`: Uses WCAG formula to compute relative luminance (0-1)
+   - `get_contrast_colors()`: Returns appropriate color palette based on luminance
+   - Background colors cached per page for performance
+6. **Visualization** (`render_page_with_overlay`):
+   - Renders page at specified DPI using PyMuPDF
+   - Automatically detects background and chooses contrasting overlay colors
+   - Overlays numbered block rectangles showing reading order
+   - Optionally shows span-level boxes
+   - Flags math-like content using regex heuristics (`_looks_like_math`)
+7. **Result Formatting** (`format_diagnostic_summary`):
+   - Generates Markdown with severity icons (✓, ⚠️, ❌)
+   - Includes inline explanations from `DIAGNOSTIC_HELP` dictionary
+#### Batch Analysis Pipeline
+1. **Multi-Page Processing** (`diagnose_all_pages`):
+   - Analyzes multiple pages (configurable max_pages and sample_rate)
+   - Progress tracking via `gr.Progress()`
+   - Calls `diagnose_page()` for each page with timing
+   - Returns `BatchAnalysisResult` dataclass
+2. **Aggregation** (`aggregate_results`):
+   - Counts issues across all pages
+   - Identifies critical pages (3+ issues)
+   - Detects common issues (affecting >50% of pages)
+3. **Result Formatting**:
+   - `format_batch_summary_markdown()`: Executive summary with statistics
+   - `format_batch_results_table()`: Color-coded HTML table per page
+   - `format_batch_results_chart()`: Plotly bar chart of issue distribution
+### Advanced Analysis Modules
+The application includes specialized modules for advanced PDF accessibility analysis:
+**advanced_analysis.py** - Coordinator module
+- Provides facade functions with error handling
+- `require_structure_tree` decorator: checks for tagged PDFs before execution
+- `safe_execute` decorator: comprehensive error handling with user-friendly messages
+- Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
+**content_stream_parser.py** - PDF operator extraction
+- `extract_content_stream_for_block()`: Gets operators for a specific block
+- `_parse_text_objects()`: Extracts BT...ET blocks from content stream
+- `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
+- `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
+- Returns formatted markdown and raw stream text
+**screen_reader_sim.py** - Accessibility simulation
+- `simulate_screen_reader()`: Main simulation function
+- `_simulate_tagged()`: Follows structure tree for tagged PDFs
+- `_simulate_untagged()`: Falls back to visual order for untagged PDFs
+- `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
+- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
+- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
+**structure_tree.py** - Structure tree analysis
+- `StructureNode` dataclass: represents PDF tag hierarchy
+- `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
+- `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
+- `format_tree_text()`: Creates indented text view with box-drawing characters
+- `get_tree_statistics()`: Counts nodes, tags, alt text coverage
+- `extract_mcid_for_page()`: Finds marked content IDs in page content stream
+- `map_blocks_to_tags()`: Correlates visual blocks with structure elements
+- `detect_visual_paragraphs()`: Spacing-based paragraph detection
+- `detect_semantic_paragraphs()`: Extracts &lt;P&gt; tags for a page
+- `compare_paragraphs()`: Calculates match quality between visual and semantic
+### Key Data Structures
+**Single Page Analysis**:
+- `SpanInfo`: Individual text run with bbox, text, font, size
+- `BlockInfo`: Text/image block with bbox, text, type, and list of spans
+**Batch Analysis**:
+- `PageDiagnostic`: Per-page diagnostic results with all issue flags and processing time
+- `BatchAnalysisResult`: Aggregated statistics across multiple pages including:
+  - `summary_stats`: Dictionary of issue counts
+  - `per_page_results`: List of PageDiagnostic objects
+  - `common_issues`: Issues affecting >50% of pages
+  - `critical_pages`: Pages with 3+ issues
+  - `to_dict()`: Method to convert to JSON-serializable format
+**Advanced Analysis**:
+- `StructureNode`: Represents a node in the PDF structure tree with:
+  - `tag_type`: Tag name (P, H1, Document, Figure, etc.)
+  - `depth`: Nesting level in the tree
+  - `mcid`: Marked Content ID (links to page content)
+  - `alt_text`: Alternative text for accessibility
+  - `actual_text`: Actual text content or replacement text
+  - `page_ref`: 0-based page index
+  - `children`: List of child StructureNode objects
+  - `to_dict()`: Convert to JSON-serializable format
+**UI State**:
+- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
+- Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
+### Gradio UI Flow
+The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
+#### Single Page Tab
+1. User uploads PDF → `_on_upload` → extracts path and page count
+2. User adjusts parameters (page, DPI, order mode, visualization options)
+3. Click "Analyze" → `analyze` function:
+   - Runs structural report (pikepdf)
+   - Extracts and orders blocks (PyMuPDF)
+   - Generates diagnostic report with adaptive contrast detection
+   - Creates overlay image with high-contrast colors
+   - Returns reading order preview + formatted summary with icons
+#### Batch Analysis Tab
+1. User sets max_pages (default 100) and sample_rate (default 1)
+2. Click "Analyze All Pages" → `analyze_batch_with_progress` function:
+   - Calls `diagnose_all_pages()` with progress tracking
+   - Aggregates results across pages
+   - Returns:
+     - Summary markdown with statistics and common issues
+     - Plotly bar chart of issue distribution
+     - Color-coded HTML table of per-page results
+     - Full JSON report
+#### Advanced Analysis Tab
+Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
+1. **Content Stream Inspector**:
+   - Extracts raw PDF content stream operators for a specific block
+   - Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
+   - Useful for debugging text extraction, font issues, and positioning problems
+   - Provides both formatted view and raw stream
+   - Uses regex parsing of content streams (approximate for complex PDFs)
+2. **Screen Reader Simulator**:
+   - Simulates NVDA or JAWS reading behavior for the current page
+   - Two modes:
+     - **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
+     - **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
+   - Three detail levels: minimal (text only), default (element announcements), verbose (full context)
+   - Generates transcript + analysis with alt text coverage statistics
+   - Reading order configurable for untagged fallback (raw/tblr/columns)
+3. **Paragraph Detection**:
+   - Compares visual paragraphs (detected by spacing) vs semantic &lt;P&gt; tags
+   - Visual detection: groups blocks with vertical gap < threshold (default 15pt)
+   - Semantic detection: extracts &lt;P&gt; tags from structure tree
+   - Generates color-coded overlay (green = visual paragraphs)
+   - Reports match quality score and mismatches
+   - Requires tagged PDF for semantic comparison
+4. **Structure Tree Visualizer**:
+   - Extracts complete PDF tag hierarchy from StructTreeRoot
+   - Three visualization formats:
+     - **Tree Diagram**: Interactive Plotly sunburst chart
+     - **Text View**: Indented text with box-drawing characters
+     - **Statistics**: Node counts, tag distribution, alt text coverage
+   - Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
+   - Displays alt text, actual text, page references, and MCID markers
+   - Only works for tagged PDFs
+5. **Block-to-Tag Mapping**:
+   - Maps visual blocks to structure tree elements via MCID (Marked Content ID)
+   - Shows which blocks have proper semantic tagging
+   - DataFrame output with block index, tag type, MCID, alt text
+   - Helps identify untagged content
+   - Requires tagged PDF with MCID references
+#### Help & Documentation
+- All UI controls have `info` parameters with inline tooltips
+- Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
+- `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries provide explanation text
+- Summary sections use severity icons (✓, ⚠️, ❌) for quick scanning
+## Key Features
+### Adaptive Contrast Overlays
+The overlay visualization automatically adapts to document background colors:
+- **Light backgrounds** (luminance > 0.5) → Dark overlays (dark blue #00008B, black text)
+- **Dark backgrounds** (luminance ≤ 0.5) → Light overlays (yellow #FFFF00, white text)
+- Background sampled at 9 strategic points using low DPI (72) for performance
+- Results cached in `_bg_color_cache` to avoid re-sampling
+- Color palettes defined in `LIGHT_BG_COLORS` and `DARK_BG_COLORS` constants
+### Inline Help System
+Comprehensive documentation integrated into the UI:
+- `info` parameters on all controls provide contextual tooltips
+- Expandable accordion with detailed explanations of all diagnostics and modes
+- Help text stored in `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries
+- Summary formatting includes severity icons and inline explanations
+### Batch Analysis
+Multi-page document analysis with aggregate statistics:
+- Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
+- Real-time progress tracking via `gr.Progress()`
+- Outputs: summary stats, issue chart, per-page table, full JSON report
+- Performance: ~10-50ms per page depending on complexity
+- Identifies common issues (>50% of pages) and critical pages (3+ issues)
+## Important Implementation Notes
+### PDF Handling
+- Always use pikepdf for structural queries (tags, resources)
+- Always use PyMuPDF (fitz) for layout extraction and rendering
+- Page indices are 0-based internally, 1-based in UI (convert with `page_num - 1`)
+- Close documents properly using context managers (`with fitz.open()`, `with pikepdf.open()`)
+### Coordinate Systems
+- PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
+- PIL/ImageDraw expects integer pixel coordinates
+- Use `_rect_i()` to convert float bboxes to int for drawing
+- DPI scaling is handled by PyMuPDF's `get_pixmap(dpi=...)`
+### Heuristics Limitations
+- Column detection is crude (assumes max 2 columns, uses median x-center as divider)
+- Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
+- All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
+- Type3 font detection is string-based and may have false positives
+### Gradio Patterns
+- File upload provides `.name` attribute for file path
+- Use `gr.update()` to modify component properties dynamically (e.g., slider maximum)
+- State management relies on component values, not session storage
+- Use `gr.Progress()` parameter in callbacks for long-running operations (batch analysis)
+- Tabs organize related functionality (`gr.Tabs()` with `gr.Tab()` children)
+- Accordions (`gr.Accordion()`) for progressive disclosure of help text and detailed results
+### Adaptive Contrast Implementation
+- Always render at low DPI (72) for background sampling to avoid performance impact
+- Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
+- Use `statistics.median()` instead of mean to avoid outliers from text/graphics
+- Cache key format: `(document.name, page_index)` tuple
+- Clear cache on new document upload if memory becomes an issue
+- Fallback to `LIGHT_BG_COLORS` if sampling fails or `auto_contrast=False`
+### Batch Analysis Performance
+- Default max_pages=100 prevents timeout on large documents
+- Sample rate allows analyzing every Nth page (useful for 500+ page documents)
+- Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
+- Progress updates every page to keep UI responsive
+- Use dataclasses instead of dicts for better memory efficiency
+- Consider adding timeout protection for very large documents (1000+ pages)
+### Result Formatting
+- Use Markdown with severity icons for human-readable summaries
+- Icons: ✓ (no issues), ⚠️ (warnings), ❌ (critical issues)
+- HTML tables for detailed per-page results allow custom styling (color-coded cells)
+- Plotly charts via `gr.Plot()` for interactive visualizations
+- All batch results have `.to_dict()` method for JSON export
+### Advanced Analysis Error Handling
+- **Graceful Degradation**: All advanced features check for requirements before execution
+- **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
+  - `@require_structure_tree` decorator checks for StructTreeRoot
+  - Returns user-friendly error message if not found
+  - Explains what tagging is and why it's needed
+- **Safe Execution**: All features wrapped in `@safe_execute` decorator
+  - Catches all exceptions with traceback
+  - Returns formatted error messages instead of crashing
+- **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
+  - Returns "not matched" status if text object not found
+  - Shows raw stream even if parsing fails
+- **MCID Extraction**: May fail if content stream uses non-standard encoding
+  - Returns empty list on failure
+  - Block-to-tag mapping shows "No mappings found" message
+- **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
+## Testing
+### Manual Testing Checklist
+1. **Adaptive Contrast**: Test with light and dark background PDFs, verify overlay colors contrast properly
+2. **Help System**: Hover over all controls, expand help accordion, verify all text displays correctly
+3. **Batch Analysis**: Test with 1-page, 10-page, and 100+ page documents
+4. **Edge Cases**: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts
+### Performance Benchmarks
+- Single page analysis: <1 second for typical pages
+- Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
+- Background sampling adds ~50-100ms one-time cost per page
+- Memory usage: ~10-20MB per 100 pages of diagnostic data
+## Deployment to Hugging Face
+### Pre-deployment Steps
+1. Test locally: `uv run python app.py`
+2. Regenerate requirements.txt: `uv pip compile pyproject.toml -o requirements.txt`
+3. Commit both `pyproject.toml` and `requirements.txt`
+4. Verify `app.py` is set as `app_file` in README.md frontmatter
+### Hugging Face Configuration
+- SDK: gradio
+- SDK version: 6.3.0 (or latest compatible)
+- Python version: >=3.12 (as specified in pyproject.toml)
+- Main file: app.py
+### Known Limitations on Hugging Face
+- Very large PDFs (1000+ pages) may hit timeout limits
+- Recommend setting max_pages=100 by default
+- Consider adding explicit timeout handling for batch analysis

DEBUGGING_WORKFLOW.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# PDF Debugging Workflow
+This guide details how to use the PDF Inspector tool to diagnose and remediate common PDF accessibility issues.
+## 1. Initial Compatibility Check
+**Goal**: Determine if the document requires major remediation before detailed analysis.
+1.  **Upload the PDF**: Use the file uploader or select an example from the list.
+2.  **Run Single Page Analysis**: Click "Analyze".
+3.  **Check for Alerts**: Look for the "Accessibility Alert" box at the top of the summary.
+    *   **Untagged Document**: If you see this, the document lacks the "Structure Tree" required for screen readers.
+        *   *Remediation*: Open the source file (Word/PPT) and "Save as PDF" with tags enabled, or use Adobe Acrobat Pro's "Autotag" feature.
+    *   **Scanned Page**: If you see this, the page is an image with no selectable text.
+        *   *Remediation*: Perform Optical Character Recognition (OCR) using Adobe Acrobat or a similar tool.
+## 2. Detailed Single-Page Inspection
+**Goal**: Verify reading order and content types on a specific page.
+1.  **Visual Inspection**: Look at the "Analysis Results" image.
+    *   **Red Boxes**: Indicate detected text blocks.
+    *   **Numbers**: Show the reading order.
+2.  **Verify Reading Order**:
+    *   Does the order (1, 2, 3...) follow the logical flow of the document?
+    *   *Issue*: If columns are read left-to-right across the page instead of down the column, the reading order is broken.
+    *   *Fix*: This usually requires manual retagging in Acrobat (Order panel).
+3.  **Check for Artifacts**:
+    *   Are headers/footers marked as text blocks? (They should generally be artifacts/ignored by screen readers).
+## 3. Advanced Diagnostics
+**Goal**: Deep dive into specific issues using the "Advanced Analysis" tab.
+### Content Stream Inspector
+*   **Use when**: Text looks correct visually but copies weirdly or reads wrong (e.g., "fi" ligaure issues).
+*   **Action**: Select a block and click "Extract Operators".
+*   **Look for**: `TJ` or `Tj` operators showing garbled characters or strange spacing adjustments.
+### Screen Reader Simulator
+*   **Use when**: You want to "hear" what a user hears.
+*   **Action**: Select "NVDA" and click "Generate Transcript".
+*   **Check**:
+    *   Are headings announced as "Heading Level X"?
+    *   Is alt text read for images?
+    *   Is the reading order intelligible?
+### Paragraph Detection
+*   **Use when**: Text seems run-on or broken into too many fragments.
+*   **Action**: Click "Analyze Paragraphs".
+*   **Check**:
+    *   **Visual vs. Semantic**: Large discrepancies suggest the `<P>` tags don't match the visual layout, which can confuse users navigating by paragraph.
+### Structure Tree Visualizer
+*   **Use when**: The document is tagged, but navigation is broken.
+*   **Action**: Click "Extract Structure Tree".
+*   **Check**:
+    *   Hierarchy depth.
+    *   Correct nesting (e.g., `L` -> `LI` -> `LBody`).
+## 4. Batch Analysis for Large Documents
+**Goal**: Identify problematic pages in a long report.
+1.  **Go to Batch Analysis Tab**.
+2.  **Run Batch**: Analyze 50-100 pages.
+3.  **Review the Report**:
+    *   **Issues Found**: Look for "Scanned Pages" or "Garbled Text".
+    *   **Page List**: Use the list of page numbers to targeting your remediation efforts.
+## Summary Checklist
+- [ ] Document is Tagged (`/StructTreeRoot` exists)
+- [ ] Text is selectable (not an image/scan)
+- [ ] Reading order is logical (columns handled correctly)
+- [ ] Images have Alt Text (or are marked as artifacts)
+- [ ] Headings use Heading tags (`<H1>`, `<H2>`), not just bold text.

TEST_PLAN.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# PDF Inspector - Test Plan
+## Overview
+This test plan outlines valid verification steps for the PDF Inspector application using the provided example documents. Since all currently included examples are **untagged** documents, this plan focuses on verifying the "Untagged" detection logic, fallback heuristics (math detection, reading order), and error handling.
+## Test Environment
+- **URL**: http://127.0.0.1:7860
+- **Browsers**: Chrome / Safari / Firefox (Any modern browser)
+---
+## 1. Test Case: Untagged Document Detection
+**Target Document**: `test_document.pdf`
+| Step | Action | Expected Result | Pass/Fail |
+|------|--------|-----------------|-----------|
+| 1.1 | Select `test_document.pdf` from Examples. | File loads into the input box. | |
+| 1.2 | Click **Analyze** button. | Analysis completes; "Analysis Results" image appears. | |
+| 1.3 | Check Summary Report. | **Alert**: "⚠️ Accessibility Alert: Untagged Document" is visible. | |
+| 1.4 | Go to **Advanced Analysis** tab. | Tab opens. | |
+| 1.5 | Open **4. Structure Tree Visualizer** and click **Extract**. | **Result**: "## No Structure Tree Found" message. | |
+**Success Criteria**: The application correctly identifies the document as untagged and prevents structure-dependent tools from crashing.
+---
+## 2. Test Case: Math & Visual Block Detection
+**Target Document**: `18.1 Notes.pdf` (Handwritten/Math Slides)
+| Step | Action | Expected Result | Pass/Fail |
+|------|--------|-----------------|-----------|
+| 2.1 | Select `18.1 Notes.pdf` from Examples. | File loads. | |
+| 2.2 | Click **Analyze** button. | Analysis completes (~1-2 seconds). | |
+| 2.3 | Inspect "Page overlay" image. | - **Red Boxes**: Detected around text blocks.<br>- **Math Highlight**: Math formulas (e.g., integrals, sums) should have specific bounding boxes. | |
+| 2.4 | Check Summary Report. | **Alert**: "Untagged Document". <br> **Stats**: Should show > 0 "Math-like blocks detected". | |
+**Success Criteria**: The heuristic regex-based math detection works on the text extracted from the slides.
+---
+## 3. Test Case: Screen Reader Simulation (Untagged Fallback)
+**Target Document**: `logic.pdf` (Academic Text)
+| Step | Action | Expected Result | Pass/Fail |
+|------|--------|-----------------|-----------|
+| 3.1 | Select `logic.pdf`. | File loads. | |
+| 3.2 | Click **Analyze**. | Analysis completes. | |
+| 3.3 | Go to **Advanced Analysis** -> **2. Screen Reader Simulator**. | Accordion opens. | |
+| 3.4 | Set **Reading Order** to "Raw" or "TBLR". | Settings accepted. | |
+| 3.5 | Click **Generate Transcript**. | **Result**: Transcript appears in the textbook.<br> **Header**: "⚠️ Simulated from visual order (PDF not tagged)".<br> **Content**: Contains readable text (e.g., "A Logical Interpretation..."). | |
+**Success Criteria**: The simulator successfully uses the fallback logic (visual ordering) instead of crashing when no structure tree is present.
+---
+## 4. Test Case: Feature Availability Check (Negative Testing)
+**Target Document**: Any of the above
+| Step | Action | Expected Result | Pass/Fail |
+|------|--------|-----------------|-----------|
+| 4.1 | Open **5. Block-to-Tag Mapping**. | Accordion opens. | |
+| 4.2 | Click **Map Blocks to Tags**. | **Result**: "## No Mappings Found" (because there are no tags). | |
+| 4.3 | Open **3. Paragraph Detection** and click **Analyze**. | **Result**: Visual paragraphs are detected (green boxes), but **Semantic <P> Tags** count is 0. | |
+### 1.6 Landscape / Rotated Documents
+- **Why**: Ensure overlays align correctly on rotated pages.
+- **Test**:
+  - Load a PDF with landscape pages (or 90-degree rotation).
+  - Verify that the blue/red bounding boxes align perfectly with the text.
+  - Verify that "reading order" flows logically (e.g., top-left of the *visual* page).
+**Success Criteria**: Features requiring tags explicitly state that tags are missing rather than showing empty/broken UIs.
+## Known Limitations / Expected Behavior
+*   **Untagged Alerts**: All examples provided are untagged; the alert is **expected behavior**.
+*   **Reading Order**: Without tags, reading order is a guess. Columns might be read left-to-right across the page in "Raw" mode.

USABILITY_AUDIT.md ADDED Viewed

	@@ -0,0 +1,45 @@

+# Usability & Workflow Audit Report
+## 1. Overview
+This audit focused on the "Exploration" workflow: how easily a user can navigate a document, identify issues, and understand the relationship between the visual layout and the underlying structure (tags/order).
+**Tested Document**: `logic.pdf` (Untagged, Academic Text)
+## 2. Critical Friction Points
+### 2.1. Disconnected Views (Split Attention)
+*   **Issue**: The "Visual Map" (colored boxes showing reading order) and the "Tools" (Screen Reader, Structure Tree) live in separate tabs.
+*   **Impact**: A user cannot see *why* the Screen Reader is reading text in a specific order because the visual map disappears when they switch to the "Advanced Analysis" tab.
+*   **Severity**: **High**. It breaks the mental model of "Cause (Visual Block) -> Effect (Screen Reader Output)".
+### 2.2. Broken Feedback Loops (No Auto-Update)
+*   **Issue**: Changing critical exploration controls—specifically **Page Number** and **Order Mode** (Raw/TBLR)—does not immediately update the visualization.
+*   **Impact**: The user must click "Analyze" after every minor adjustment. This discourages exploration and makes "A/B testing" settings (like toggling between Raw and TBLR sorting) frustratingly slow.
+*   **Severity**: **High**.
+### 2.3. Stale State in Advanced Tools
+*   **Issue**: When the global Page Number is changed, the "Screen Reader Simulator" text remains on the previous page's content until "Generate" is manually clicked.
+*   **Impact**: Users may mistakenly analyze the wrong page text.
+*   **Severity**: Medium.
+### 2.4. Hidden Navigation
+*   **Issue**: The "Pages" gallery component is empty/gray, providing no visual cues for navigation. Users are forced to guess page numbers.
+*   **Severity**: Low/Medium.
+## 3. Recommended Solutions
+### 3.1. "Unified Explorer" Layout
+Refactor the UI to a split-screen design:
+*   **Left Panel (Persistent)**: The Main Page Visualizer (Image with overlays). This remains visible at all times.
+*   **Right Panel (Contextual)**: Tabbed interface for "Summary", "Screen Reader", "Structure Tree", and "Paragraphs".
+*   **Benefit**: Users can run the Screen Reader simulation while looking at the visual block numbers to verify the path.
+### 3.2. Reactive Controls
+*   **Fix**: Wire the **Page Number** input and **Order Mode** radio buttons to trigger the analysis function automatically (with a debounce if necessary).
+*   **Fix**: Ensure Advanced Tools listen to the global page number and auto-refresh (or show a "Refresh Needed" indicator).
+### 3.3. Navigation Clarity
+*   Add simple "Previous / Next" buttons next to the page number for easier sequential browsing.
+## 4. Conclusion
+The current tool works correctly but requires excessive clicking and context switching. Implementing the **Unified Explorer Layout** and **Reactive Controls** will significantly reduce the cognitive load and make the tool 10x more effective for debugging.

app.py CHANGED Viewed

@@ -1,6 +1,6 @@
 # app.py
 from __future__ import annotations
 import math
 import re
 import time
@@ -76,19 +76,16 @@ def _rect_i(rect: Tuple[float, float, float, float]) -> Tuple[int, int, int, int
     x0, y0, x1, y1 = rect
     return (int(round(x0)), int(round(y0)), int(round(x1)), int(round(y1)))
-def _safe_str(x: Any, max_len: int = 400) -> str:
-    s = str(x)
-    if len(s) > max_len:
-        s = s[:max_len] + "…"
-    return s
-def _looks_like_math(text: str) -> bool:
-    # Heuristic: mathy glyphs/symbols and patterns
-    if not text:
-        return False
-    math_syms = r"[∑∫√≈≠≤≥∞±×÷∂∇∈∩∪⊂⊆⊇⊃→↦∀∃ℝℤℚℕ]"
-    latexy = r"(\\frac|\\sqrt|\\sum|\\int|_|\^|\b(?:sin|cos|tan|log|ln)\b)"
-    return bool(re.search(math_syms, text) or re.search(latexy, text))
 # -----------------------------
 # Background Color Sampling for Adaptive Contrast
@@ -159,72 +156,20 @@ def get_contrast_colors(luminance: float) -> Dict[str, Tuple[int, int, int, int]
     """
     return LIGHT_BG_COLORS if luminance > 0.5 else DARK_BG_COLORS
-@dataclass
-class SpanInfo:
-    bbox: Tuple[float, float, float, float]
-    text: str
-    font: str
-    size: float
-@dataclass
-class BlockInfo:
-    bbox: Tuple[float, float, float, float]
-    text: str
-    block_type: int  # 0 text, 1 image, 2 drawing in PyMuPDF terms for some outputs
-    spans: List[SpanInfo]
-@dataclass
-class PageDiagnostic:
-    """Extended diagnostic for batch processing."""
-    page_num: int
-    tagged_pdf: bool
-    text_len: int
-    image_block_count: int
-    font_count: int
-    has_type3_fonts: bool
-    suspicious_garbled_text: bool
-    likely_scanned_image_page: bool
-    likely_text_as_vector_outlines: bool
-    multi_column_guess: bool
-    processing_time_ms: Optional[int] = None
-@dataclass
-class BatchAnalysisResult:
-    """Aggregate results from all pages."""
-    total_pages: int
-    pages_analyzed: int
-    summary_stats: Dict[str, int]
-    per_page_results: List[PageDiagnostic]
-    common_issues: List[str]
-    critical_pages: List[int]
-    processing_time_sec: float
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to JSON-serializable dict."""
-        return {
-            "total_pages": self.total_pages,
-            "pages_analyzed": self.pages_analyzed,
-            "summary_stats": self.summary_stats,
-            "per_page_results": [
-                {
-                    "page_num": p.page_num,
-                    "tagged_pdf": p.tagged_pdf,
-                    "text_len": p.text_len,
-                    "image_block_count": p.image_block_count,
-                    "font_count": p.font_count,
-                    "has_type3_fonts": p.has_type3_fonts,
-                    "suspicious_garbled_text": p.suspicious_garbled_text,
-                    "likely_scanned_image_page": p.likely_scanned_image_page,
-                    "likely_text_as_vector_outlines": p.likely_text_as_vector_outlines,
-                    "multi_column_guess": p.multi_column_guess,
-                    "processing_time_ms": p.processing_time_ms,
-                }
-                for p in self.per_page_results
-            ],
-            "common_issues": self.common_issues,
-            "critical_pages": self.critical_pages,
-            "processing_time_sec": self.processing_time_sec,
-        }
 # -----------------------------
 # PDF structural checks (pikepdf)
@@ -282,83 +227,6 @@ def pdf_struct_report(pdf_path: str) -> Dict[str, Any]:
 # Layout extraction + ordering (PyMuPDF)
 # -----------------------------
-def extract_blocks_spans(doc: fitz.Document, page_index: int) -> List[BlockInfo]:
-    page = doc[page_index]
-    raw = page.get_text("dict")  # includes blocks/lines/spans with bboxes
-    blocks: List[BlockInfo] = []
-    for b in raw.get("blocks", []):
-        btype = int(b.get("type", -1))
-        bbox = tuple(b.get("bbox", (0, 0, 0, 0)))
-        text_parts: List[str] = []
-        spans: List[SpanInfo] = []
-        if btype == 0:  # text
-            for line in b.get("lines", []):
-                for sp in line.get("spans", []):
-                    t = sp.get("text", "")
-                    if t:
-                        text_parts.append(t)
-                    spans.append(
-                        SpanInfo(
-                            bbox=tuple(sp.get("bbox", (0, 0, 0, 0))),
-                            text=t,
-                            font=_safe_str(sp.get("font", "")),
-                            size=float(sp.get("size", 0.0)),
-                        )
-                    )
-        text = "".join(text_parts).strip()
-        blocks.append(BlockInfo(bbox=bbox, text=text, block_type=btype, spans=spans))
-    return blocks
-def order_blocks(blocks: List[BlockInfo], mode: str) -> List[Tuple[int, BlockInfo]]:
-    """
-    Return list of (idx, block) in chosen order.
-    """
-    indexed = list(enumerate(blocks))
-    if mode == "raw":
-        return indexed
-    def key_tblr(item: Tuple[int, BlockInfo]) -> Tuple[int, int]:
-        _, b = item
-        x0, y0, x1, y1 = b.bbox
-        return (int(y0), int(x0))
-    if mode == "tblr":
-        return sorted(indexed, key=key_tblr)
-    if mode == "columns":
-        # Simple 2-column heuristic:
-        # cluster by x-center around midline, then sort within each column.
-        # This is a heuristic; tagged PDFs should make this unnecessary.
-        xs = []
-        for _, b in indexed:
-            x0, y0, x1, y1 = b.bbox
-            if (x1 - x0) > 5:
-                xs.append((x0 + x1) / 2.0)
-        if not xs:
-            return sorted(indexed, key=key_tblr)
-        mid = sorted(xs)[len(xs) // 2]
-        left = []
-        right = []
-        for it in indexed:
-            _, b = it
-            x0, y0, x1, y1 = b.bbox
-            cx = (x0 + x1) / 2.0
-            (left if cx < mid else right).append(it)
-        left = sorted(left, key=key_tblr)
-        right = sorted(right, key=key_tblr)
-        # Read left column first, then right
-        return left + right
-    # Fallback
-    return sorted(indexed, key=key_tblr)
-# -----------------------------
-# Render overlay images
-# -----------------------------
 def render_page_with_overlay(
     doc: fitz.Document,
     page_index: int,
@@ -453,7 +321,7 @@ def render_paragraph_overlay(
     draw = ImageDraw.Draw(img, 'RGBA')
     # Extract blocks for bounding boxes
-    blocks = extract_blocks_spans(pdf_path, page_index)
     # Scale factor from PDF points to pixels
     scale = dpi / 72.0
@@ -664,23 +532,37 @@ def format_batch_summary_markdown(batch: BatchAnalysisResult) -> str:
 **Issues Found:**
 """
-    for issue, count in batch.summary_stats.items():
-        pct = (count / batch.pages_analyzed) * 100 if batch.pages_analyzed > 0 else 0
-        icon = "❌" if count > 0 else "✓"
-        issue_name = issue.replace('_', ' ').title()
-        md += f"\n- {icon} **{issue_name}**: {count} pages ({pct:.1f}%)"
-    if batch.common_issues:
-        md += f"\n\n**Common Issues (affecting >50% of pages):**\n"
-        for issue in batch.common_issues:
-            md += f"- {issue.replace('_', ' ').title()}\n"
-    if batch.critical_pages:
-        md += f"\n\n**Critical Pages (3+ issues):**\n"
-        pages_str = ', '.join(map(str, batch.critical_pages[:20]))
-        md += f"Pages: {pages_str}"
-        if len(batch.critical_pages) > 20:
-            md += f" ... and {len(batch.critical_pages) - 20} more"
     return md
@@ -797,15 +679,36 @@ def format_diagnostic_summary(diag: Dict[str, Any], struct: Dict[str, Any]) -> s
 # -----------------------------
 def load_pdf(fileobj) -> Tuple[str, int, str]:
-    # fileobj is a gradio UploadedFile-like with .name
-    pdf_path = fileobj.name
-    with fitz.open(pdf_path) as doc:
-        n = doc.page_count
-    return pdf_path, n, f"Loaded: {pdf_path} ({n} pages)"
 def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans: bool, highlight_math: bool):
     if not pdf_path:
-        return None, {}, "Upload a PDF first."
     # page_num is 1-based in UI
     page_index = max(0, int(page_num) - 1)
@@ -860,7 +763,24 @@ def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans:
     # Generate formatted summary with icons and explanations
     summary = format_diagnostic_summary(diag, struct)
-    return overlay, report, summary
 def analyze_batch_with_progress(
     pdf_path: str,
@@ -891,55 +811,76 @@ def analyze_batch_with_progress(
 # UI
 # -----------------------------
 with gr.Blocks(title="PDF Structure Inspector") as demo:
     gr.Markdown(
         """
 # PDF Structure Inspector (screen reader / reading order / math debugging)
-Upload a PDF and inspect:
-- **Tagged vs untagged**
-- **Text/image blocks**
-- Different **reading order heuristics**
-- Red flags for **OCR-needed**, **text-as-outlines**, **Type3 fonts**, **garbled text**
 """
     )
     with gr.Row():
-        pdf_file = gr.File(label="PDF file", file_types=[".pdf"])
-        status = gr.Textbox(label="Status", interactive=False)
-    with gr.Row():
-        pdf_path = gr.Textbox(label="Internal path", visible=False)
-        page_count = gr.Number(label="Pages", value=1, precision=0, interactive=False)
-    with gr.Tabs():
-        with gr.Tab("Single Page Analysis"):
             with gr.Row():
-                page_num = gr.Slider(label="Page", minimum=1, maximum=1, value=1, step=1)
-                dpi = gr.Slider(label="Render DPI", minimum=72, maximum=300, value=150, step=1)
-            with gr.Row():
-                order_mode = gr.Radio(
-                    ["raw", "tblr", "columns"],
-                    value="raw",
-                    label="Overlay order mode",
-                    info="Choose block ordering strategy. Hover options for details.",
-                )
-                show_spans = gr.Checkbox(
-                    value=False,
-                    label="Show span boxes",
-                    info="Display individual text spans (words/fragments) for font-level debugging"
-                )
-                highlight_math = gr.Checkbox(
-                    value=True,
-                    label="Flag blocks that look mathy",
-                    info="Highlights blocks with math notation (needs MathML or alt text)"
-                )
-            run_btn = gr.Button("Analyze")
-            with gr.Accordion("📖 Understanding the Diagnostics", open=False):
-                gr.Markdown("""
 ### What Each Diagnostic Means
 **🏷️ Tagged PDF**: Tagged PDFs include structure tags (headings, lists, reading order) that screen readers use for navigation. Untagged PDFs force assistive technology to guess the reading order based on visual layout, often leading to incorrect results.
@@ -951,7 +892,7 @@ Upload a PDF and inspect:
 - Screen readers cannot pronounce text correctly
 - Text search doesn't work
-**🔀 Garbled Text**: Replacement characters (�) indicate missing or incorrect ToUnicode mappings in the PDF. Screen readers will mispronounce affected text.
 **✏️ Text as Outlines**: When text is rendered as vector paths instead of actual text, screen readers cannot extract or read it. The document appears to have text visually but is inaccessible.
@@ -959,314 +900,222 @@ Upload a PDF and inspect:
 ### Reading Order Modes
-**Raw**: Shows blocks in the order PyMuPDF extracted them from the PDF structure. This often reflects the order content was added to the PDF during creation, not the intended reading order.
-**Top-to-Bottom Left-to-Right (TBLR)**: Simple geometric sorting that reads from the top of the page to the bottom, breaking ties by left-to-right position. Works well for simple single-column documents.
-**Columns**: Attempts to detect two-column layouts by finding the median x-position and reading the left column completely before the right column. This is a heuristic and may fail on complex layouts.
-                """)
-            with gr.Row():
-                overlay_img = gr.Image(label="Page overlay (blocks/spans labeled)", type="pil")
-            summary = gr.Markdown()
-            report = gr.JSON(label="Report (struct + diagnosis + reading order preview)")
-        with gr.Tab("Batch Analysis"):
-            with gr.Row():
-                batch_max_pages = gr.Slider(
-                    label="Max pages to analyze",
-                    minimum=1,
-                    maximum=500,
-                    value=100,
-                    step=1,
-                    info="Limit analysis for very large documents"
-                )
-                batch_sample_rate = gr.Slider(
-                    label="Sample rate",
-                    minimum=1,
-                    maximum=10,
-                    value=1,
-                    step=1,
-                    info="Analyze every Nth page (1 = all pages)"
-                )
-            batch_run_btn = gr.Button("Analyze All Pages", variant="primary")
-            batch_progress = gr.Textbox(label="Progress", interactive=False)
-            with gr.Accordion("Summary Statistics", open=True):
-                batch_summary_md = gr.Markdown()
-            with gr.Accordion("Issue Breakdown", open=True):
-                batch_chart = gr.Plot(label="Issues by Type")
-            with gr.Accordion("Per-Page Results", open=False):
-                batch_table = gr.HTML()
-            batch_json = gr.JSON(label="Full Batch Report", visible=False)
-        # Advanced Analysis Tab
-        with gr.Tab("Advanced Analysis"):
-            gr.Markdown("""
-            # Advanced PDF Accessibility Analysis
-            Power-user features for deep PDF inspection and accessibility debugging.
-            These tools help diagnose complex accessibility issues and understand internal PDF structure.
-            """)
-            with gr.Accordion("1. Content Stream Inspector", open=False):
-                gr.Markdown("""
-                **Purpose**: Inspect raw PDF content stream operators for a specific block
-                Shows the low-level PDF commands that render text and graphics. Useful for debugging
-                text extraction issues, font problems, and positioning.
-                """)
-                cs_block_dropdown = gr.Dropdown(
-                    label="Select Block",
-                    choices=[],
-                    info="Choose a text or image block to inspect"
-                )
-                cs_inspect_btn = gr.Button("Extract Operators", variant="primary")
-                with gr.Tabs():
-                    with gr.Tab("Formatted"):
-                        cs_operator_display = gr.Markdown()
-                    with gr.Tab("Raw Stream"):
-                        cs_raw_stream = gr.Code(label="Raw PDF Content Stream")
-            with gr.Accordion("2. Screen Reader Simulator", open=False):
-                gr.Markdown("""
-                **Purpose**: Simulate how NVDA or JAWS would read this page
-                Generates a transcript showing what a screen reader user would hear, including
-                element announcements and reading order. Works with both tagged and untagged PDFs.
-                """)
-                with gr.Row():
-                    sr_reader = gr.Radio(
-                        ["NVDA", "JAWS"],
-                        value="NVDA",
-                        label="Screen Reader",
-                        info="Choose which screen reader to simulate"
-                    )
-                    sr_detail = gr.Radio(
-                        ["minimal", "default", "verbose"],
-                        value="default",
-                        label="Detail Level",
-                        info="How much context information to include"
-                    )
-                    sr_order = gr.Radio(
-                        ["raw", "tblr", "columns"],
-                        value="tblr",
-                        label="Reading Order (untagged fallback)",
-                        info="Used only if PDF has no structure tree"
-                    )
-                sr_btn = gr.Button("Generate Transcript", variant="primary")
-                with gr.Tabs():
-                    with gr.Tab("Transcript"):
-                        sr_transcript = gr.Textbox(
-                            lines=20,
-                            label="Screen Reader Output",
-                            interactive=False
-                        )
-                    with gr.Tab("Analysis"):
-                        sr_analysis = gr.Markdown()
-            with gr.Accordion("3. Paragraph Detection", open=False):
-                gr.Markdown("""
-                **Purpose**: Compare visual paragraphs vs semantic paragraph tags
-                Identifies paragraphs based on spacing (visual) and compares them to &lt;P&gt; tags
-                in the structure tree (semantic). Mismatches can cause confusion for screen reader users.
-                """)
-                para_threshold = gr.Slider(
-                    label="Vertical Gap Threshold (points)",
-                    minimum=5,
-                    maximum=30,
-                    value=15,
-                    step=1,
-                    info="Minimum vertical spacing to consider a paragraph break"
-                )
-                para_btn = gr.Button("Analyze Paragraphs", variant="primary")
-                para_overlay = gr.Image(label="Paragraph Visualization", type="pil")
-                with gr.Row():
-                    para_visual = gr.Number(label="Visual Paragraphs", interactive=False)
-                    para_semantic = gr.Number(label="Semantic <P> Tags", interactive=False)
-                    para_score = gr.Number(label="Match Quality", interactive=False)
-                para_mismatches = gr.Markdown()
-            with gr.Accordion("4. Structure Tree Visualizer", open=False):
-                gr.Markdown("""
-                **Purpose**: Display the complete PDF tag hierarchy
-                Shows the entire structure tree for tagged PDFs, including tag types, alt text,
-                and page references. Only works for PDFs with accessibility tagging.
-                """)
-                struct_btn = gr.Button("Extract Structure Tree", variant="primary")
-                with gr.Tabs():
-                    with gr.Tab("Tree Diagram"):
-                        struct_plot = gr.Plot(label="Interactive Hierarchy")
-                    with gr.Tab("Text View"):
-                        struct_text = gr.Textbox(
-                            lines=30,
-                            label="Structure Tree",
-                            interactive=False
-                        )
-                    with gr.Tab("Statistics"):
-                        struct_stats = gr.Markdown()
-            with gr.Accordion("5. Block-to-Tag Mapping", open=False):
-                gr.Markdown("""
-                **Purpose**: Link visual blocks to structure tree elements
-                Maps each visual block to its corresponding tag in the structure tree via
-                MCID (Marked Content ID) references. Shows which content is properly tagged.
-                """)
-                map_btn = gr.Button("Map Blocks to Tags", variant="primary")
-                map_message = gr.Markdown()
-                map_table = gr.DataFrame(
-                    headers=["Block #", "Tag Type", "MCID", "Alt Text"],
-                    label="Block-to-Tag Correlations",
-                    interactive=False
-                )
-    def _on_upload(f):
         path, n, msg = load_pdf(f)
         return path, n, msg, gr.update(maximum=n, value=1)
-    pdf_file.change(_on_upload, inputs=[pdf_file], outputs=[pdf_path, page_count, status, page_num])
-    run_btn.click(
-        analyze,
-        inputs=[pdf_path, page_num, dpi, order_mode, show_spans, highlight_math],
-        outputs=[overlay_img, report, summary],
-    )
-    batch_run_btn.click(
-        analyze_batch_with_progress,
-        inputs=[pdf_path, batch_max_pages, batch_sample_rate],
-        outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
-    )
-    # Advanced Analysis Callbacks
     def update_block_dropdown(pdf_path_val, page_num_val):
         """Update block dropdown when page changes."""
         if not pdf_path_val:
             return gr.update(choices=[], value=None)
         try:
-            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
-            choices = create_block_choices(blocks)
-            return gr.update(choices=choices, value=0 if choices else None)
         except:
             return gr.update(choices=[], value=None)
     def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
-        """Run content stream analysis for selected block."""
         if not pdf_path_val or block_idx is None:
             return "Please select a block", ""
         try:
-            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
-            result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
             if result.get('error'):
                 return result['message'], ""
             return result['formatted'], result['raw']
         except Exception as e:
             return f"## Error\n\n{str(e)}", ""
     def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
-        """Run screen reader simulation."""
         if not pdf_path_val:
             return "Please upload a PDF first", ""
         try:
-            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
-            result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
             if result.get('error'):
                 return result.get('message', 'Error'), ""
             return result['transcript'], result['analysis']
         except Exception as e:
             return f"## Error\n\n{str(e)}", ""
     def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
-        """Run paragraph detection and comparison."""
         if not pdf_path_val:
             return None, 0, 0, 0.0, "Please upload a PDF first"
         try:
-            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
-            result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
             if result.get('error'):
                 return None, 0, 0, 0.0, result.get('message', 'Error')
-            # Create visualization overlay
             overlay = render_paragraph_overlay(
                 pdf_path_val, page_num_val - 1, dpi_val,
                 result['visual_paragraphs'], result['semantic_paragraphs']
             )
             return (
-                overlay,
-                result['visual_count'],
-                result['semantic_count'],
-                result['match_score'],
-                result['mismatches']
             )
         except Exception as e:
             return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
     def run_structure_tree_extraction(pdf_path_val):
-        """Extract and visualize structure tree."""
         if not pdf_path_val:
             return None, "Please upload a PDF first", ""
         try:
             result = analyze_structure_tree(pdf_path_val)
             if result.get('error'):
                 return None, result['message'], ""
             return result['plot_data'], result['text_view'], result['statistics']
         except Exception as e:
             return None, f"## Error\n\n{str(e)}", ""
     def run_block_tag_mapping(pdf_path_val, page_num_val):
-        """Map blocks to structure tags."""
         if not pdf_path_val:
             return "Please upload a PDF first", []
         try:
-            blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
-            result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
             if result.get('error'):
                 return result.get('message', 'Error'), []
             return result['message'], result['mappings']
         except Exception as e:
             return f"## Error\n\n{str(e)}", []
-    # Wire up Advanced Analysis callbacks
     page_num.change(
-        update_block_dropdown,
-        inputs=[pdf_path, page_num],
-        outputs=[cs_block_dropdown]
     )
     cs_inspect_btn.click(
@@ -1299,6 +1148,12 @@ Upload a PDF and inspect:
         outputs=[map_message, map_table]
     )
 if __name__ == "__main__":
-    demo.launch()

 # app.py
 from __future__ import annotations
+import os
 import math
 import re
 import time
     x0, y0, x1, y1 = rect
     return (int(round(x0)), int(round(y0)), int(round(x1)), int(round(y1)))
+# Removed _safe_str and _looks_like_math from app.py as they are now in layout_utils
+# but keeping them here might be needed if other local functions use them without prefix.
+# Checking usage...
+# _safe_str is used in pdf_struct_report
+# _looks_like_math is used in render_page_with_overlay
+# Since we imported them from layout_utils above, we can remove the definitions here.
+# -----------------------------
+# Background Color Sampling for Adaptive Contrast
+# -----------------------------
 # -----------------------------
 # Background Color Sampling for Adaptive Contrast
     """
     return LIGHT_BG_COLORS if luminance > 0.5 else DARK_BG_COLORS
+# Moving layout logic to layout_utils.py
+from layout_utils import (
+    SpanInfo,
+    BlockInfo,
+    extract_blocks_spans,
+    order_blocks,
+    _safe_str,
+    _looks_like_math,
+    PageDiagnostic,
+    BatchAnalysisResult
+)
+# Re-exporting for compatibility if needed, using the imported names directly from now on.
 # -----------------------------
 # PDF structural checks (pikepdf)
 # Layout extraction + ordering (PyMuPDF)
 # -----------------------------
 def render_page_with_overlay(
     doc: fitz.Document,
     page_index: int,
     draw = ImageDraw.Draw(img, 'RGBA')
     # Extract blocks for bounding boxes
+    blocks = extract_blocks_spans(doc, page_index)
     # Scale factor from PDF points to pixels
     scale = dpi / 72.0
 **Issues Found:**
 """
+    md += "\n\n**Detailed Breakdown:**\n"
+    # Define issues and their readable names
+    from layout_utils import PageDiagnostic
+    issue_map = {
+        'likely_scanned_image_page': 'Scanned Pages',
+        'has_type3_fonts': 'Type3 Fonts',
+        'suspicious_garbled_text': 'Garbled Text',
+        'multi_column_guess': 'Multi-Column (Untagged)',
+        'likely_text_as_vector_outlines': 'Text as Outlines'
+    }
+    for issue_attr, issue_name in issue_map.items():
+        # Find pages with this issue
+        affected_pages = []
+        for p in batch.per_page_results:
+            if getattr(p, issue_attr, False):
+                affected_pages.append(p.page_num)
+        if affected_pages:
+            icon = "❌"
+            count = len(affected_pages)
+            pct = (count / batch.pages_analyzed) * 100 if batch.pages_analyzed > 0 else 0
+            # Format page list (truncate if too long)
+            page_list_str = ", ".join(map(str, affected_pages[:30]))
+            if len(affected_pages) > 30:
+                page_list_str += f" ... ({len(affected_pages) - 30} more)"
+            md += f"\n### {icon} {issue_name}: {count} pages ({pct:.1f}%)\n"
+            md += f"**Pages**: {page_list_str}\n"
     return md
 # -----------------------------
 def load_pdf(fileobj) -> Tuple[str, int, str]:
+    """
+    Robustly load a PDF file and return its path, page count, and status message.
+    Handles Gradio FileData objects, string paths (from examples), and None.
+    """
+    if fileobj is None:
+        return "", 0, "Waiting for PDF upload..."
+    # Extract path from Gadio FileData or use string directly
+    if isinstance(fileobj, str):
+        pdf_path = fileobj
+    elif hasattr(fileobj, "path"):
+        pdf_path = fileobj.path
+    elif hasattr(fileobj, "name"):
+        pdf_path = fileobj.name
+    else:
+        pdf_path = str(fileobj)
+    if not pdf_path or not os.path.exists(pdf_path):
+        return "", 0, f"Error: File not found at {pdf_path}"
+    try:
+        with fitz.open(pdf_path) as doc:
+            n = doc.page_count
+        return pdf_path, n, f"✓ Loaded: {os.path.basename(pdf_path)} ({n} pages)"
+    except Exception as e:
+        return "", 0, f"❌ Error loading PDF: {str(e)}"
 def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans: bool, highlight_math: bool):
     if not pdf_path:
+        return None, {}, "Upload a PDF first.", ""
     # page_num is 1-based in UI
     page_index = max(0, int(page_num) - 1)
     # Generate formatted summary with icons and explanations
     summary = format_diagnostic_summary(diag, struct)
+    # Check for compatibility and prepend warning if needed
+    if not struct.get("has_struct_tree_root"):
+        summary = "### ⚠️ Accessibility Alert: Untagged Document\n\n" + \
+                  "**This document is likely incompatible with screen readers.**\n\n" + \
+                  "It lacks the 'structure tree' (tags) required for accessibility tools to understand headings, paragraphs, and reading order.\n\n" + \
+                  "**What you can do:**\n" + \
+                  "- **Remediate**: Open the original source file (Word, PowerPoint) and save as 'PDF (Best for electronic distribution and accessibility)'\n" + \
+                  "- **Retrofit**: Use Adobe Acrobat Pro's 'Accessibility' tool to auto-tag the document.\n\n" + \
+                  "---\n\n" + summary
+    if diag["likely_scanned_image_page"]:
+         summary = "### ❌ Critical Issue: Scanned Page\n\n" + \
+                  "**This page appears to be an image with no readable text.**\n\n" + \
+                  "Screen readers cannot read this content at all.\n\n" + \
+                   "**Action Required**: Perform Optical Character Recognition (OCR) using Adobe Acrobat or an OCR tool to make the text selectable and readable.\n\n" + \
+                  "---\n\n" + summary
+    return overlay, report, summary, preview
 def analyze_batch_with_progress(
     pdf_path: str,
 # UI
 # -----------------------------
+# -----------------------------
+# UI
+# -----------------------------
 with gr.Blocks(title="PDF Structure Inspector") as demo:
     gr.Markdown(
         """
 # PDF Structure Inspector (screen reader / reading order / math debugging)
 """
     )
+    # 1. Top Bar: Loader & Global Stats
     with gr.Row():
+        pdf_file = gr.File(label="Upload PDF", file_types=[".pdf"], scale=1)
+        with gr.Column(scale=2):
+             status = gr.Textbox(label="Status", interactive=False)
+             # Hidden states
+             pdf_path = gr.Textbox(visible=False)
+             page_count = gr.Number(visible=False)
+    gr.Examples(
+        examples=["test_document.pdf", "18.1 Notes.pdf", "logic.pdf"],
+        inputs=pdf_file
+    )
+    # 2. Control Panel
+    with gr.Row(variant="panel"):
+        with gr.Column(scale=2):
+            page_num = gr.Slider(label="Page Number", minimum=1, maximum=1, value=1, step=1)
+        with gr.Column(scale=1):
+            dpi = gr.Slider(label="Zoom (DPI)", minimum=72, maximum=300, value=150, step=1)
+        with gr.Column(scale=1):
+            order_mode = gr.Dropdown(
+                ["raw", "tblr", "columns"], value="raw", label="Reading Order",
+                info="Strategy for untagged content"
+            )
+        with gr.Column(scale=2, min_width=200):
             with gr.Row():
+                show_spans = gr.Checkbox(label="Show Spans", value=False)
+                highlight_math = gr.Checkbox(label="Highlight Math", value=True)
+            run_btn = gr.Button("Forced Refresh", variant="secondary", size="sm")
+    # 3. Main Workspace (Split View)
+    with gr.Row():
+        # LEFT: Visualization (Persistent)
+        with gr.Column(scale=6):
+            gr.Markdown("### 1. Visual Inspection")
+            overlay_img = gr.Image(label="Page Analysis Overlay (Live)", type="pil", interactive=False, height=800)
+            summary = gr.Markdown(elem_classes=["result-markdown"])
+        # RIGHT: Tools (Contextual)
+        with gr.Column(scale=5):
+            gr.Markdown("### 2. Deep Dive Tools")
+            with gr.Tabs():
+                # --- TAB 1: DETAILS ---
+                with gr.Tab("Details & Structure"):
+                    with gr.Accordion("Reading Order Preview", open=True):
+                         reading_order_preview = gr.Textbox(
+                             label="Detected text flow",
+                             lines=20,
+                             interactive=False,
+                             info="This is the order text will be fed to accessibility tools (if untagged)."
+                         )
+                    with gr.Accordion("Full Technical Report (JSON)", open=False):
+                        report = gr.JSON(label="Page Report")
+                    with gr.Accordion("Help: Understanding Diagnostics", open=False):
+                        gr.Markdown("""
 ### What Each Diagnostic Means
 **🏷️ Tagged PDF**: Tagged PDFs include structure tags (headings, lists, reading order) that screen readers use for navigation. Untagged PDFs force assistive technology to guess the reading order based on visual layout, often leading to incorrect results.
 - Screen readers cannot pronounce text correctly
 - Text search doesn't work
+**🔀 Garbled Text**: Replacement characters () indicate missing or incorrect ToUnicode mappings in the PDF. Screen readers will mispronounce affected text.
 **✏️ Text as Outlines**: When text is rendered as vector paths instead of actual text, screen readers cannot extract or read it. The document appears to have text visually but is inaccessible.
 ### Reading Order Modes
+**Raw**: Extraction order, how PyMuPDF found blocks (often = creation order).
+**TBLR**: Top-to-bottom, left-to-right geometric sorting.
+**Columns**: Two-column heuristic (clusters by x-position).
+                        """)
+                # --- TAB 2: ADVANCED ---
+                with gr.Tab("Advanced Tools"):
+                     gr.Markdown("Power-user features for deep PDF inspection.")
+                     # 1. Content Stream
+                     with gr.Accordion("1. Content Stream Inspector", open=False):
+                        gr.Markdown("**Inspect raw PDF content stream operators for a specific block**")
+                        cs_block_dropdown = gr.Dropdown(label="Select Block", choices=[], info="Choose a block to inspect")
+                        cs_inspect_btn = gr.Button("Extract Operators", size="sm")
+                        with gr.Tabs():
+                            with gr.Tab("Formatted"):
+                                cs_operator_display = gr.Markdown()
+                            with gr.Tab("Raw"):
+                                cs_raw_stream = gr.Code(label="Raw Stream")
+                     # 2. Screen Reader
+                     with gr.Accordion("2. Screen Reader Simulator", open=True):
+                        gr.Markdown("**Simulate how NVDA or JAWS would read this page**")
+                        with gr.Row():
+                            sr_reader = gr.Radio(["NVDA", "JAWS"], value="NVDA", label="Reader", scale=1)
+                            sr_detail = gr.Radio(["minimal", "default", "verbose"], value="default", label="Detail", scale=1)
+                            sr_order = gr.Radio(["raw", "tblr", "columns"], value="tblr", label="Fallback Order", scale=1)
+                        sr_btn = gr.Button("Generate Transcript", variant="primary")
+                        with gr.Tabs():
+                            with gr.Tab("Transcript"):
+                                sr_transcript = gr.Textbox(lines=15, label="Output", interactive=False)
+                            with gr.Tab("Analysis"):
+                                sr_analysis = gr.Markdown()
+                     # 3. Paragraph Detection
+                     with gr.Accordion("3. Paragraph Detection", open=False):
+                        gr.Markdown("**Compare visual paragraphs vs semantic paragraph tags**")
+                        para_threshold = gr.Slider(label="Gap Threshold", minimum=5, maximum=30, value=15, step=1)
+                        para_btn = gr.Button("Analyze Paragraphs")
+                        para_overlay = gr.Image(label="Paragraph Visualization", type="pil", height=400)
+                        with gr.Row():
+                            para_visual = gr.Number(label="Visual", interactive=False)
+                            para_semantic = gr.Number(label="Semantic <P>", interactive=False)
+                            para_score = gr.Number(label="Match Quality", interactive=False)
+                        para_mismatches = gr.Markdown()
+                     # 4. Structure Tree
+                     with gr.Accordion("4. Structure Tree Visualizer", open=False):
+                        gr.Markdown("**Display the complete PDF tag hierarchy**")
+                        struct_btn = gr.Button("Extract Tree")
+                        with gr.Tabs():
+                            with gr.Tab("Diagram"):
+                                struct_plot = gr.Plot()
+                            with gr.Tab("Text View"):
+                                struct_text = gr.Textbox(lines=20)
+                            with gr.Tab("Stats"):
+                                struct_stats = gr.Markdown()
+                     # 5. Mapping
+                     with gr.Accordion("5. Block-to-Tag Mapping", open=False):
+                        gr.Markdown("**Link visual blocks to structure tree elements**")
+                        map_btn = gr.Button("Map Blocks")
+                        map_message = gr.Markdown()
+                        map_table = gr.DataFrame(headers=["Block #", "Tag Type", "MCID", "Alt Text"])
+                # --- TAB 3: BATCH ---
+                with gr.Tab("Batch Analysis"):
+                    with gr.Row():
+                        batch_max_pages = gr.Slider(label="Max pages", minimum=1, maximum=500, value=100)
+                        batch_sample_rate = gr.Slider(label="Sample rate", minimum=1, maximum=10, value=1)
+                    batch_run_btn = gr.Button("Analyze All Pages", variant="primary")
+                    batch_progress = gr.Textbox(label="Progress", interactive=False)
+                    with gr.Accordion("Summary", open=True):
+                        batch_summary_md = gr.Markdown()
+                    with gr.Accordion("Details", open=False):
+                        batch_chart = gr.Plot()
+                        batch_table = gr.HTML()
+                        batch_json = gr.JSON(visible=False)
+    # --- CALLBACKS & WIRING ---
+    def _on_file_change(f):
         path, n, msg = load_pdf(f)
+        if not path:
+             return path, n, msg, gr.update(maximum=1, value=1)
         return path, n, msg, gr.update(maximum=n, value=1)
+    # Main Analysis Inputs/Outputs
+    # Note: analyze() now returns (overlay, report, summary, preview)
+    analysis_inputs = [pdf_path, page_num, dpi, order_mode, show_spans, highlight_math]
+    analysis_outputs = [overlay_img, report, summary, reading_order_preview]
+    # Upload & Example Triggers
+    pdf_file.change(_on_file_change, inputs=[pdf_file], outputs=[pdf_path, page_count, status, page_num]) \
+            .then(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    # Reactive Event Listeners
+    # Note: page_num.change is strictly better for 'Exploration' than release,
+    # as it updates while typing or stepping.
+    page_num.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    dpi.release(analyze, inputs=analysis_inputs, outputs=analysis_outputs) # DPI is heavy, use release
+    order_mode.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    show_spans.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    highlight_math.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    run_btn.click(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
+    # Advanced Analysis Helper Functions (Closures to capture inputs if needed, or just pure)
     def update_block_dropdown(pdf_path_val, page_num_val):
         """Update block dropdown when page changes."""
         if not pdf_path_val:
             return gr.update(choices=[], value=None)
         try:
+            with fitz.open(pdf_path_val) as doc:
+                blocks = extract_blocks_spans(doc, page_num_val - 1)
+                if not blocks:
+                    return gr.update(choices=[], value=None)
+                choices = create_block_choices(blocks)
+                return gr.update(choices=choices, value=0 if choices else None)
         except:
             return gr.update(choices=[], value=None)
     def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
         if not pdf_path_val or block_idx is None:
             return "Please select a block", ""
         try:
+            with fitz.open(pdf_path_val) as doc:
+                blocks = extract_blocks_spans(doc, page_num_val - 1)
+                result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
             if result.get('error'):
                 return result['message'], ""
             return result['formatted'], result['raw']
         except Exception as e:
             return f"## Error\n\n{str(e)}", ""
     def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
         if not pdf_path_val:
             return "Please upload a PDF first", ""
         try:
+            with fitz.open(pdf_path_val) as doc:
+                blocks = extract_blocks_spans(doc, page_num_val - 1)
+                result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
             if result.get('error'):
                 return result.get('message', 'Error'), ""
             return result['transcript'], result['analysis']
         except Exception as e:
             return f"## Error\n\n{str(e)}", ""
     def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
         if not pdf_path_val:
             return None, 0, 0, 0.0, "Please upload a PDF first"
         try:
+            with fitz.open(pdf_path_val) as doc:
+                blocks = extract_blocks_spans(doc, page_num_val - 1)
+                result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
             if result.get('error'):
                 return None, 0, 0, 0.0, result.get('message', 'Error')
             overlay = render_paragraph_overlay(
                 pdf_path_val, page_num_val - 1, dpi_val,
                 result['visual_paragraphs'], result['semantic_paragraphs']
             )
             return (
+                overlay, result['visual_count'], result['semantic_count'],
+                result['match_score'], result['mismatches']
             )
         except Exception as e:
             return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
     def run_structure_tree_extraction(pdf_path_val):
         if not pdf_path_val:
             return None, "Please upload a PDF first", ""
         try:
             result = analyze_structure_tree(pdf_path_val)
             if result.get('error'):
                 return None, result['message'], ""
             return result['plot_data'], result['text_view'], result['statistics']
         except Exception as e:
             return None, f"## Error\n\n{str(e)}", ""
     def run_block_tag_mapping(pdf_path_val, page_num_val):
         if not pdf_path_val:
             return "Please upload a PDF first", []
         try:
+            with fitz.open(pdf_path_val) as doc:
+                blocks = extract_blocks_spans(doc, page_num_val - 1)
+                result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
             if result.get('error'):
                 return result.get('message', 'Error'), []
             return result['message'], result['mappings']
         except Exception as e:
             return f"## Error\n\n{str(e)}", []
+    # 5. Advanced Tool Wiring
+    # Update dropdown when page changes
+    page_num.change(update_block_dropdown, inputs=[pdf_path, page_num], outputs=[cs_block_dropdown])
+    # Clear stale results when page changes (User Request: "Did it reset?")
+    # We clear the outputs of advanced tools so users know they need to regenerate
+    def clear_stale():
+        return None, "", None, "", None, 0, 0, 0, "", None, "", ""
+    # Actually, let's keep it simple. Just clearing the main ones users look at.
     page_num.change(
+        lambda: ("", ""),
+        outputs=[sr_transcript, sr_analysis]
+    )
+    # Also clear paragraph overlay?
+    page_num.change(
+        lambda: None,
+        outputs=[para_overlay]
     )
     cs_inspect_btn.click(
         outputs=[map_message, map_table]
     )
+    batch_run_btn.click(
+        analyze_batch_with_progress,
+        inputs=[pdf_path, batch_max_pages, batch_sample_rate],
+        outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
+    )
 if __name__ == "__main__":
+    demo.launch(css=".result-markdown { font-size: 14px; } .help-md { font-size: 12px; color: #666; }")

layout_utils.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""
+Layout Utilities Module
+Contains shared logic for block extraction, ordering, and data structures
+to avoid circular dependencies between app.py and other modules.
+"""
+from dataclasses import dataclass
+from typing import List, Tuple, Any, Dict, Optional
+import pymupdf as fitz
+import re
+@dataclass
+class SpanInfo:
+    bbox: Tuple[float, float, float, float]
+    text: str
+    font: str
+    size: float
+@dataclass
+class BlockInfo:
+    bbox: Tuple[float, float, float, float]
+    text: str
+    block_type: int  # 0 text, 1 image, 2 drawing in PyMuPDF terms for some outputs
+    spans: List[SpanInfo]
+@dataclass
+class PageDiagnostic:
+    """Extended diagnostic for batch processing."""
+    page_num: int
+    tagged_pdf: bool
+    text_len: int
+    image_block_count: int
+    font_count: int
+    has_type3_fonts: bool
+    suspicious_garbled_text: bool
+    likely_scanned_image_page: bool
+    likely_text_as_vector_outlines: bool
+    multi_column_guess: bool
+    processing_time_ms: Optional[int] = None
+@dataclass
+class BatchAnalysisResult:
+    """Aggregate results from all pages."""
+    total_pages: int
+    pages_analyzed: int
+    summary_stats: Dict[str, int]
+    per_page_results: List[PageDiagnostic]
+    common_issues: List[str]
+    critical_pages: List[int]
+    processing_time_sec: float
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to JSON-serializable dict."""
+        return {
+            "total_pages": self.total_pages,
+            "pages_analyzed": self.pages_analyzed,
+            "summary_stats": self.summary_stats,
+            "per_page_results": [
+                {
+                    "page_num": p.page_num,
+                    "tagged_pdf": p.tagged_pdf,
+                    "text_len": p.text_len,
+                    "image_block_count": p.image_block_count,
+                    "font_count": p.font_count,
+                    "has_type3_fonts": p.has_type3_fonts,
+                    "suspicious_garbled_text": p.suspicious_garbled_text,
+                    "likely_scanned_image_page": p.likely_scanned_image_page,
+                    "likely_text_as_vector_outlines": p.likely_text_as_vector_outlines,
+                    "multi_column_guess": p.multi_column_guess,
+                    "processing_time_ms": p.processing_time_ms,
+                }
+                for p in self.per_page_results
+            ],
+            "common_issues": self.common_issues,
+            "critical_pages": self.critical_pages,
+            "processing_time_sec": self.processing_time_sec,
+        }
+def _safe_str(x: Any, max_len: int = 400) -> str:
+    s = str(x)
+    if len(s) > max_len:
+        s = s[:max_len] + "…"
+    return s
+def _looks_like_math(text: str) -> bool:
+    # Heuristic: mathy glyphs/symbols and patterns
+    if not text:
+        return False
+    math_syms = r"[∑∫√≈≠≤≥∞±×÷∂∇∈∩∪⊂⊆⊇⊃→↦∀∃ℝℤℚℕ]"
+    latexy = r"(\\frac|\\sqrt|\\sum|\\int|_|\^|\b(?:sin|cos|tan|log|ln)\b)"
+    return bool(re.search(math_syms, text) or re.search(latexy, text))
+def extract_blocks_spans(doc: fitz.Document, page_index: int) -> List[BlockInfo]:
+    page = doc[page_index]
+    raw = page.get_text("dict")  # includes blocks/lines/spans with bboxes
+    mat = page.rotation_matrix
+    blocks: List[BlockInfo] = []
+    for b in raw.get("blocks", []):
+        btype = int(b.get("type", -1))
+        # Transform block bbox to visual coordinates
+        bbox_rect = fitz.Rect(b.get("bbox", (0, 0, 0, 0))) * mat
+        bbox = tuple(bbox_rect)
+        text_parts: List[str] = []
+        spans: List[SpanInfo] = []
+        if btype == 0:  # text
+            for line in b.get("lines", []):
+                for sp in line.get("spans", []):
+                    t = sp.get("text", "")
+                    if t:
+                        text_parts.append(t)
+                    # Transform span bbox to visual coordinates
+                    sp_bbox_rect = fitz.Rect(sp.get("bbox", (0, 0, 0, 0))) * mat
+                    spans.append(
+                        SpanInfo(
+                            bbox=tuple(sp_bbox_rect),
+                            text=t,
+                            font=_safe_str(sp.get("font", "")),
+                            size=float(sp.get("size", 0.0)),
+                        )
+                    )
+        text = "".join(text_parts).strip()
+        blocks.append(BlockInfo(bbox=bbox, text=text, block_type=btype, spans=spans))
+    return blocks
+def order_blocks(blocks: List[BlockInfo], mode: str) -> List[Tuple[int, BlockInfo]]:
+    """
+    Return list of (idx, block) in chosen order.
+    """
+    indexed = list(enumerate(blocks))
+    if mode == "raw":
+        return indexed
+    def key_tblr(item: Tuple[int, BlockInfo]) -> Tuple[int, int]:
+        _, b = item
+        x0, y0, x1, y1 = b.bbox
+        return (int(y0), int(x0))
+    if mode == "tblr":
+        return sorted(indexed, key=key_tblr)
+    if mode == "columns":
+        # Simple 2-column heuristic:
+        # cluster by x-center around midline, then sort within each column.
+        # This is a heuristic; tagged PDFs should make this unnecessary.
+        xs = []
+        for _, b in indexed:
+            x0, y0, x1, y1 = b.bbox
+            if (x1 - x0) > 5:
+                xs.append((x0 + x1) / 2.0)
+        if not xs:
+            return sorted(indexed, key=key_tblr)
+        mid = sorted(xs)[len(xs) // 2]
+        left = []
+        right = []
+        for it in indexed:
+            _, b = it
+            x0, y0, x1, y1 = b.bbox
+            cx = (x0 + x1) / 2.0
+            (left if cx < mid else right).append(it)
+        left = sorted(left, key=key_tblr)
+        right = sorted(right, key=key_tblr)
+        # Read left column first, then right
+        return left + right
+    # Fallback
+    return sorted(indexed, key=key_tblr)

screen_reader_sim.py CHANGED Viewed

@@ -167,7 +167,7 @@ def _simulate_untagged(
     Returns:
         Tuple of (transcript, analysis)
     """
-    from app import order_blocks  # Import the ordering function
     # Order blocks according to mode
     ordered_blocks = order_blocks(blocks, order_mode)
@@ -177,7 +177,7 @@ def _simulate_untagged(
     text_block_count = 0
     image_block_count = 0
-    for block in ordered_blocks:
         if block.block_type == 0:  # Text block
             # Infer heading from font size
             is_heading = False

     Returns:
         Tuple of (transcript, analysis)
     """
+    from layout_utils import order_blocks  # Import the ordering function
     # Order blocks according to mode
     ordered_blocks = order_blocks(blocks, order_mode)
     text_block_count = 0
     image_block_count = 0
+    for idx, block in ordered_blocks:
         if block.block_type == 0:  # Text block
             # Infer heading from font size
             is_heading = False