rianders commited on
Commit
0d61aa0
Β·
1 Parent(s): 27fda3f

Fix file load errors and implement auto-refresh functionality

Browse files
Files changed (7) hide show
  1. AGENTS.md +377 -0
  2. DEBUGGING_WORKFLOW.md +72 -0
  3. TEST_PLAN.md +76 -0
  4. USABILITY_AUDIT.md +45 -0
  5. app.py +316 -461
  6. layout_utils.py +174 -0
  7. screen_reader_sim.py +2 -2
AGENTS.md ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ PDF Structure Inspector is a Gradio-based web application designed for debugging PDF accessibility, reading order, and structure. It helps identify issues that affect screen readers and assistive technologies by analyzing PDF structure, text extraction quality, and layout ordering.
8
+
9
+ **Target deployment**: Hugging Face Spaces (gradio SDK)
10
+
11
+ ## Commands
12
+
13
+ ### Development
14
+ ```bash
15
+ # Run the Gradio app locally
16
+ uv run python app.py
17
+
18
+ # The app will launch at http://localhost:7860 by default
19
+ ```
20
+
21
+ ### Dependencies
22
+ ```bash
23
+ # Sync environment (after cloning or pulling changes)
24
+ uv sync
25
+
26
+ # Add a new dependency
27
+ uv add <package>
28
+
29
+ # Add a dev dependency
30
+ uv add --dev <package>
31
+
32
+ # Regenerate requirements.txt for Hugging Face deployment (after dependency changes)
33
+ uv pip compile pyproject.toml -o requirements.txt
34
+ ```
35
+
36
+ The project uses `pyproject.toml` for dependency management with uv lock file support. **Always use `uv run`** for running commands in the development environment.
37
+
38
+ ## Architecture
39
+
40
+ ### Core Libraries
41
+ - **PyMuPDF (fitz)**: Layout extraction, block/span detection, rendering pages as images
42
+ - **pikepdf**: Low-level PDF structure inspection (tags, MarkInfo, OCProperties, page resources)
43
+ - **Gradio**: Web UI framework
44
+ - **PIL (Pillow)**: Image manipulation for overlay rendering
45
+
46
+ ### Main Application Flow (app.py)
47
+
48
+ The application has two main modes: **Single Page Analysis** and **Batch Analysis**.
49
+
50
+ #### Single Page Analysis Pipeline
51
+
52
+ 1. **PDF Structure Analysis** (`pdf_struct_report`):
53
+ - Uses pikepdf to inspect PDF-level metadata
54
+ - Checks for StructTreeRoot (tagging), MarkInfo, OCProperties (layers)
55
+ - Analyzes per-page resources (fonts, XObjects)
56
+
57
+ 2. **Layout Extraction** (`extract_blocks_spans`):
58
+ - Uses PyMuPDF's `get_text("dict")` to extract blocks, lines, and spans with bounding boxes
59
+ - Returns structured `BlockInfo` objects containing text, bbox, font info, and span details
60
+ - Block types: 0=text, 1=image, 2=drawing
61
+
62
+ 3. **Reading Order Analysis** (`order_blocks`):
63
+ - Three ordering modes:
64
+ - `raw`: Extraction order (as stored in PDF)
65
+ - `tblr`: Top-to-bottom, left-to-right sorting by bbox
66
+ - `columns`: Simple 2-column heuristic (clusters by x-center, sorts each column separately)
67
+
68
+ 4. **Diagnostic Heuristics** (`diagnose_page`):
69
+ - Detects scanned pages (no text + images)
70
+ - Identifies text-as-vector-outlines (no text + many drawings)
71
+ - Flags Type3 fonts (often correlate with broken text extraction)
72
+ - Detects garbled text (replacement characters, missing ToUnicode)
73
+ - Guesses multi-column layouts (x-center clustering)
74
+
75
+ 5. **Adaptive Contrast Detection** (for visualization):
76
+ - `sample_background_color()`: Samples page at 9 points (corners, edges, center) to determine background
77
+ - `calculate_luminance()`: Uses WCAG formula to compute relative luminance (0-1)
78
+ - `get_contrast_colors()`: Returns appropriate color palette based on luminance
79
+ - Background colors cached per page for performance
80
+
81
+ 6. **Visualization** (`render_page_with_overlay`):
82
+ - Renders page at specified DPI using PyMuPDF
83
+ - Automatically detects background and chooses contrasting overlay colors
84
+ - Overlays numbered block rectangles showing reading order
85
+ - Optionally shows span-level boxes
86
+ - Flags math-like content using regex heuristics (`_looks_like_math`)
87
+
88
+ 7. **Result Formatting** (`format_diagnostic_summary`):
89
+ - Generates Markdown with severity icons (βœ“, ⚠️, ❌)
90
+ - Includes inline explanations from `DIAGNOSTIC_HELP` dictionary
91
+
92
+ #### Batch Analysis Pipeline
93
+
94
+ 1. **Multi-Page Processing** (`diagnose_all_pages`):
95
+ - Analyzes multiple pages (configurable max_pages and sample_rate)
96
+ - Progress tracking via `gr.Progress()`
97
+ - Calls `diagnose_page()` for each page with timing
98
+ - Returns `BatchAnalysisResult` dataclass
99
+
100
+ 2. **Aggregation** (`aggregate_results`):
101
+ - Counts issues across all pages
102
+ - Identifies critical pages (3+ issues)
103
+ - Detects common issues (affecting >50% of pages)
104
+
105
+ 3. **Result Formatting**:
106
+ - `format_batch_summary_markdown()`: Executive summary with statistics
107
+ - `format_batch_results_table()`: Color-coded HTML table per page
108
+ - `format_batch_results_chart()`: Plotly bar chart of issue distribution
109
+
110
+ ### Advanced Analysis Modules
111
+
112
+ The application includes specialized modules for advanced PDF accessibility analysis:
113
+
114
+ **advanced_analysis.py** - Coordinator module
115
+ - Provides facade functions with error handling
116
+ - `require_structure_tree` decorator: checks for tagged PDFs before execution
117
+ - `safe_execute` decorator: comprehensive error handling with user-friendly messages
118
+ - Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
119
+
120
+ **content_stream_parser.py** - PDF operator extraction
121
+ - `extract_content_stream_for_block()`: Gets operators for a specific block
122
+ - `_parse_text_objects()`: Extracts BT...ET blocks from content stream
123
+ - `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
124
+ - `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
125
+ - Returns formatted markdown and raw stream text
126
+
127
+ **screen_reader_sim.py** - Accessibility simulation
128
+ - `simulate_screen_reader()`: Main simulation function
129
+ - `_simulate_tagged()`: Follows structure tree for tagged PDFs
130
+ - `_simulate_untagged()`: Falls back to visual order for untagged PDFs
131
+ - `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
132
+ - Supports heading levels, paragraphs, figures, formulas, lists, tables, links
133
+ - Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
134
+
135
+ **structure_tree.py** - Structure tree analysis
136
+ - `StructureNode` dataclass: represents PDF tag hierarchy
137
+ - `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
138
+ - `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
139
+ - `format_tree_text()`: Creates indented text view with box-drawing characters
140
+ - `get_tree_statistics()`: Counts nodes, tags, alt text coverage
141
+ - `extract_mcid_for_page()`: Finds marked content IDs in page content stream
142
+ - `map_blocks_to_tags()`: Correlates visual blocks with structure elements
143
+ - `detect_visual_paragraphs()`: Spacing-based paragraph detection
144
+ - `detect_semantic_paragraphs()`: Extracts &lt;P&gt; tags for a page
145
+ - `compare_paragraphs()`: Calculates match quality between visual and semantic
146
+
147
+ ### Key Data Structures
148
+
149
+ **Single Page Analysis**:
150
+ - `SpanInfo`: Individual text run with bbox, text, font, size
151
+ - `BlockInfo`: Text/image block with bbox, text, type, and list of spans
152
+
153
+ **Batch Analysis**:
154
+ - `PageDiagnostic`: Per-page diagnostic results with all issue flags and processing time
155
+ - `BatchAnalysisResult`: Aggregated statistics across multiple pages including:
156
+ - `summary_stats`: Dictionary of issue counts
157
+ - `per_page_results`: List of PageDiagnostic objects
158
+ - `common_issues`: Issues affecting >50% of pages
159
+ - `critical_pages`: Pages with 3+ issues
160
+ - `to_dict()`: Method to convert to JSON-serializable format
161
+
162
+ **Advanced Analysis**:
163
+ - `StructureNode`: Represents a node in the PDF structure tree with:
164
+ - `tag_type`: Tag name (P, H1, Document, Figure, etc.)
165
+ - `depth`: Nesting level in the tree
166
+ - `mcid`: Marked Content ID (links to page content)
167
+ - `alt_text`: Alternative text for accessibility
168
+ - `actual_text`: Actual text content or replacement text
169
+ - `page_ref`: 0-based page index
170
+ - `children`: List of child StructureNode objects
171
+ - `to_dict()`: Convert to JSON-serializable format
172
+
173
+ **UI State**:
174
+ - The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
175
+ - Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
176
+
177
+ ### Gradio UI Flow
178
+
179
+ The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
180
+
181
+ #### Single Page Tab
182
+ 1. User uploads PDF β†’ `_on_upload` β†’ extracts path and page count
183
+ 2. User adjusts parameters (page, DPI, order mode, visualization options)
184
+ 3. Click "Analyze" β†’ `analyze` function:
185
+ - Runs structural report (pikepdf)
186
+ - Extracts and orders blocks (PyMuPDF)
187
+ - Generates diagnostic report with adaptive contrast detection
188
+ - Creates overlay image with high-contrast colors
189
+ - Returns reading order preview + formatted summary with icons
190
+
191
+ #### Batch Analysis Tab
192
+ 1. User sets max_pages (default 100) and sample_rate (default 1)
193
+ 2. Click "Analyze All Pages" β†’ `analyze_batch_with_progress` function:
194
+ - Calls `diagnose_all_pages()` with progress tracking
195
+ - Aggregates results across pages
196
+ - Returns:
197
+ - Summary markdown with statistics and common issues
198
+ - Plotly bar chart of issue distribution
199
+ - Color-coded HTML table of per-page results
200
+ - Full JSON report
201
+
202
+ #### Advanced Analysis Tab
203
+
204
+ Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
205
+
206
+ 1. **Content Stream Inspector**:
207
+ - Extracts raw PDF content stream operators for a specific block
208
+ - Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
209
+ - Useful for debugging text extraction, font issues, and positioning problems
210
+ - Provides both formatted view and raw stream
211
+ - Uses regex parsing of content streams (approximate for complex PDFs)
212
+
213
+ 2. **Screen Reader Simulator**:
214
+ - Simulates NVDA or JAWS reading behavior for the current page
215
+ - Two modes:
216
+ - **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
217
+ - **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
218
+ - Three detail levels: minimal (text only), default (element announcements), verbose (full context)
219
+ - Generates transcript + analysis with alt text coverage statistics
220
+ - Reading order configurable for untagged fallback (raw/tblr/columns)
221
+
222
+ 3. **Paragraph Detection**:
223
+ - Compares visual paragraphs (detected by spacing) vs semantic &lt;P&gt; tags
224
+ - Visual detection: groups blocks with vertical gap < threshold (default 15pt)
225
+ - Semantic detection: extracts &lt;P&gt; tags from structure tree
226
+ - Generates color-coded overlay (green = visual paragraphs)
227
+ - Reports match quality score and mismatches
228
+ - Requires tagged PDF for semantic comparison
229
+
230
+ 4. **Structure Tree Visualizer**:
231
+ - Extracts complete PDF tag hierarchy from StructTreeRoot
232
+ - Three visualization formats:
233
+ - **Tree Diagram**: Interactive Plotly sunburst chart
234
+ - **Text View**: Indented text with box-drawing characters
235
+ - **Statistics**: Node counts, tag distribution, alt text coverage
236
+ - Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
237
+ - Displays alt text, actual text, page references, and MCID markers
238
+ - Only works for tagged PDFs
239
+
240
+ 5. **Block-to-Tag Mapping**:
241
+ - Maps visual blocks to structure tree elements via MCID (Marked Content ID)
242
+ - Shows which blocks have proper semantic tagging
243
+ - DataFrame output with block index, tag type, MCID, alt text
244
+ - Helps identify untagged content
245
+ - Requires tagged PDF with MCID references
246
+
247
+ #### Help & Documentation
248
+ - All UI controls have `info` parameters with inline tooltips
249
+ - Expandable "πŸ“– Understanding the Diagnostics" accordion with detailed explanations
250
+ - `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries provide explanation text
251
+ - Summary sections use severity icons (βœ“, ⚠️, ❌) for quick scanning
252
+
253
+ ## Key Features
254
+
255
+ ### Adaptive Contrast Overlays
256
+ The overlay visualization automatically adapts to document background colors:
257
+ - **Light backgrounds** (luminance > 0.5) β†’ Dark overlays (dark blue #00008B, black text)
258
+ - **Dark backgrounds** (luminance ≀ 0.5) β†’ Light overlays (yellow #FFFF00, white text)
259
+ - Background sampled at 9 strategic points using low DPI (72) for performance
260
+ - Results cached in `_bg_color_cache` to avoid re-sampling
261
+ - Color palettes defined in `LIGHT_BG_COLORS` and `DARK_BG_COLORS` constants
262
+
263
+ ### Inline Help System
264
+ Comprehensive documentation integrated into the UI:
265
+ - `info` parameters on all controls provide contextual tooltips
266
+ - Expandable accordion with detailed explanations of all diagnostics and modes
267
+ - Help text stored in `DIAGNOSTIC_HELP` and `ORDERING_MODE_HELP` dictionaries
268
+ - Summary formatting includes severity icons and inline explanations
269
+
270
+ ### Batch Analysis
271
+ Multi-page document analysis with aggregate statistics:
272
+ - Configurable limits: max_pages (default 100), sample_rate (analyze every Nth page)
273
+ - Real-time progress tracking via `gr.Progress()`
274
+ - Outputs: summary stats, issue chart, per-page table, full JSON report
275
+ - Performance: ~10-50ms per page depending on complexity
276
+ - Identifies common issues (>50% of pages) and critical pages (3+ issues)
277
+
278
+ ## Important Implementation Notes
279
+
280
+ ### PDF Handling
281
+ - Always use pikepdf for structural queries (tags, resources)
282
+ - Always use PyMuPDF (fitz) for layout extraction and rendering
283
+ - Page indices are 0-based internally, 1-based in UI (convert with `page_num - 1`)
284
+ - Close documents properly using context managers (`with fitz.open()`, `with pikepdf.open()`)
285
+
286
+ ### Coordinate Systems
287
+ - PyMuPDF bboxes are (x0, y0, x1, y1) in PDF points (1/72 inch)
288
+ - PIL/ImageDraw expects integer pixel coordinates
289
+ - Use `_rect_i()` to convert float bboxes to int for drawing
290
+ - DPI scaling is handled by PyMuPDF's `get_pixmap(dpi=...)`
291
+
292
+ ### Heuristics Limitations
293
+ - Column detection is crude (assumes max 2 columns, uses median x-center as divider)
294
+ - Math detection is pattern-based (Unicode symbols + LaTeX-like patterns)
295
+ - All diagnostics are heuristic; tagged PDFs with proper structure should be preferred
296
+ - Type3 font detection is string-based and may have false positives
297
+
298
+ ### Gradio Patterns
299
+ - File upload provides `.name` attribute for file path
300
+ - Use `gr.update()` to modify component properties dynamically (e.g., slider maximum)
301
+ - State management relies on component values, not session storage
302
+ - Use `gr.Progress()` parameter in callbacks for long-running operations (batch analysis)
303
+ - Tabs organize related functionality (`gr.Tabs()` with `gr.Tab()` children)
304
+ - Accordions (`gr.Accordion()`) for progressive disclosure of help text and detailed results
305
+
306
+ ### Adaptive Contrast Implementation
307
+ - Always render at low DPI (72) for background sampling to avoid performance impact
308
+ - Sample 9 points: 4 corners + 4 edge midpoints + 1 center (at 5%, 50%, 95% positions)
309
+ - Use `statistics.median()` instead of mean to avoid outliers from text/graphics
310
+ - Cache key format: `(document.name, page_index)` tuple
311
+ - Clear cache on new document upload if memory becomes an issue
312
+ - Fallback to `LIGHT_BG_COLORS` if sampling fails or `auto_contrast=False`
313
+
314
+ ### Batch Analysis Performance
315
+ - Default max_pages=100 prevents timeout on large documents
316
+ - Sample rate allows analyzing every Nth page (useful for 500+ page documents)
317
+ - Each page takes ~10-50ms depending on complexity (text extraction + diagnostics)
318
+ - Progress updates every page to keep UI responsive
319
+ - Use dataclasses instead of dicts for better memory efficiency
320
+ - Consider adding timeout protection for very large documents (1000+ pages)
321
+
322
+ ### Result Formatting
323
+ - Use Markdown with severity icons for human-readable summaries
324
+ - Icons: βœ“ (no issues), ⚠️ (warnings), ❌ (critical issues)
325
+ - HTML tables for detailed per-page results allow custom styling (color-coded cells)
326
+ - Plotly charts via `gr.Plot()` for interactive visualizations
327
+ - All batch results have `.to_dict()` method for JSON export
328
+
329
+ ### Advanced Analysis Error Handling
330
+ - **Graceful Degradation**: All advanced features check for requirements before execution
331
+ - **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
332
+ - `@require_structure_tree` decorator checks for StructTreeRoot
333
+ - Returns user-friendly error message if not found
334
+ - Explains what tagging is and why it's needed
335
+ - **Safe Execution**: All features wrapped in `@safe_execute` decorator
336
+ - Catches all exceptions with traceback
337
+ - Returns formatted error messages instead of crashing
338
+ - **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
339
+ - Returns "not matched" status if text object not found
340
+ - Shows raw stream even if parsing fails
341
+ - **MCID Extraction**: May fail if content stream uses non-standard encoding
342
+ - Returns empty list on failure
343
+ - Block-to-tag mapping shows "No mappings found" message
344
+ - **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
345
+
346
+ ## Testing
347
+
348
+ ### Manual Testing Checklist
349
+ 1. **Adaptive Contrast**: Test with light and dark background PDFs, verify overlay colors contrast properly
350
+ 2. **Help System**: Hover over all controls, expand help accordion, verify all text displays correctly
351
+ 3. **Batch Analysis**: Test with 1-page, 10-page, and 100+ page documents
352
+ 4. **Edge Cases**: Scanned PDFs, multi-column layouts, math-heavy documents, Type3 fonts
353
+
354
+ ### Performance Benchmarks
355
+ - Single page analysis: <1 second for typical pages
356
+ - Batch analysis: ~10-50ms per page (100 pages in 1-5 seconds)
357
+ - Background sampling adds ~50-100ms one-time cost per page
358
+ - Memory usage: ~10-20MB per 100 pages of diagnostic data
359
+
360
+ ## Deployment to Hugging Face
361
+
362
+ ### Pre-deployment Steps
363
+ 1. Test locally: `uv run python app.py`
364
+ 2. Regenerate requirements.txt: `uv pip compile pyproject.toml -o requirements.txt`
365
+ 3. Commit both `pyproject.toml` and `requirements.txt`
366
+ 4. Verify `app.py` is set as `app_file` in README.md frontmatter
367
+
368
+ ### Hugging Face Configuration
369
+ - SDK: gradio
370
+ - SDK version: 6.3.0 (or latest compatible)
371
+ - Python version: >=3.12 (as specified in pyproject.toml)
372
+ - Main file: app.py
373
+
374
+ ### Known Limitations on Hugging Face
375
+ - Very large PDFs (1000+ pages) may hit timeout limits
376
+ - Recommend setting max_pages=100 by default
377
+ - Consider adding explicit timeout handling for batch analysis
DEBUGGING_WORKFLOW.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PDF Debugging Workflow
2
+
3
+ This guide details how to use the PDF Inspector tool to diagnose and remediate common PDF accessibility issues.
4
+
5
+ ## 1. Initial Compatibility Check
6
+ **Goal**: Determine if the document requires major remediation before detailed analysis.
7
+
8
+ 1. **Upload the PDF**: Use the file uploader or select an example from the list.
9
+ 2. **Run Single Page Analysis**: Click "Analyze".
10
+ 3. **Check for Alerts**: Look for the "Accessibility Alert" box at the top of the summary.
11
+ * **Untagged Document**: If you see this, the document lacks the "Structure Tree" required for screen readers.
12
+ * *Remediation*: Open the source file (Word/PPT) and "Save as PDF" with tags enabled, or use Adobe Acrobat Pro's "Autotag" feature.
13
+ * **Scanned Page**: If you see this, the page is an image with no selectable text.
14
+ * *Remediation*: Perform Optical Character Recognition (OCR) using Adobe Acrobat or a similar tool.
15
+
16
+ ## 2. Detailed Single-Page Inspection
17
+ **Goal**: Verify reading order and content types on a specific page.
18
+
19
+ 1. **Visual Inspection**: Look at the "Analysis Results" image.
20
+ * **Red Boxes**: Indicate detected text blocks.
21
+ * **Numbers**: Show the reading order.
22
+ 2. **Verify Reading Order**:
23
+ * Does the order (1, 2, 3...) follow the logical flow of the document?
24
+ * *Issue*: If columns are read left-to-right across the page instead of down the column, the reading order is broken.
25
+ * *Fix*: This usually requires manual retagging in Acrobat (Order panel).
26
+ 3. **Check for Artifacts**:
27
+ * Are headers/footers marked as text blocks? (They should generally be artifacts/ignored by screen readers).
28
+
29
+ ## 3. Advanced Diagnostics
30
+ **Goal**: Deep dive into specific issues using the "Advanced Analysis" tab.
31
+
32
+ ### Content Stream Inspector
33
+ * **Use when**: Text looks correct visually but copies weirdly or reads wrong (e.g., "fi" ligaure issues).
34
+ * **Action**: Select a block and click "Extract Operators".
35
+ * **Look for**: `TJ` or `Tj` operators showing garbled characters or strange spacing adjustments.
36
+
37
+ ### Screen Reader Simulator
38
+ * **Use when**: You want to "hear" what a user hears.
39
+ * **Action**: Select "NVDA" and click "Generate Transcript".
40
+ * **Check**:
41
+ * Are headings announced as "Heading Level X"?
42
+ * Is alt text read for images?
43
+ * Is the reading order intelligible?
44
+
45
+ ### Paragraph Detection
46
+ * **Use when**: Text seems run-on or broken into too many fragments.
47
+ * **Action**: Click "Analyze Paragraphs".
48
+ * **Check**:
49
+ * **Visual vs. Semantic**: Large discrepancies suggest the `<P>` tags don't match the visual layout, which can confuse users navigating by paragraph.
50
+
51
+ ### Structure Tree Visualizer
52
+ * **Use when**: The document is tagged, but navigation is broken.
53
+ * **Action**: Click "Extract Structure Tree".
54
+ * **Check**:
55
+ * Hierarchy depth.
56
+ * Correct nesting (e.g., `L` -> `LI` -> `LBody`).
57
+
58
+ ## 4. Batch Analysis for Large Documents
59
+ **Goal**: Identify problematic pages in a long report.
60
+
61
+ 1. **Go to Batch Analysis Tab**.
62
+ 2. **Run Batch**: Analyze 50-100 pages.
63
+ 3. **Review the Report**:
64
+ * **Issues Found**: Look for "Scanned Pages" or "Garbled Text".
65
+ * **Page List**: Use the list of page numbers to targeting your remediation efforts.
66
+
67
+ ## Summary Checklist
68
+ - [ ] Document is Tagged (`/StructTreeRoot` exists)
69
+ - [ ] Text is selectable (not an image/scan)
70
+ - [ ] Reading order is logical (columns handled correctly)
71
+ - [ ] Images have Alt Text (or are marked as artifacts)
72
+ - [ ] Headings use Heading tags (`<H1>`, `<H2>`), not just bold text.
TEST_PLAN.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PDF Inspector - Test Plan
2
+
3
+ ## Overview
4
+ This test plan outlines valid verification steps for the PDF Inspector application using the provided example documents. Since all currently included examples are **untagged** documents, this plan focuses on verifying the "Untagged" detection logic, fallback heuristics (math detection, reading order), and error handling.
5
+
6
+ ## Test Environment
7
+ - **URL**: http://127.0.0.1:7860
8
+ - **Browsers**: Chrome / Safari / Firefox (Any modern browser)
9
+
10
+ ---
11
+
12
+ ## 1. Test Case: Untagged Document Detection
13
+ **Target Document**: `test_document.pdf`
14
+
15
+ | Step | Action | Expected Result | Pass/Fail |
16
+ |------|--------|-----------------|-----------|
17
+ | 1.1 | Select `test_document.pdf` from Examples. | File loads into the input box. | |
18
+ | 1.2 | Click **Analyze** button. | Analysis completes; "Analysis Results" image appears. | |
19
+ | 1.3 | Check Summary Report. | **Alert**: "⚠️ Accessibility Alert: Untagged Document" is visible. | |
20
+ | 1.4 | Go to **Advanced Analysis** tab. | Tab opens. | |
21
+ | 1.5 | Open **4. Structure Tree Visualizer** and click **Extract**. | **Result**: "## No Structure Tree Found" message. | |
22
+
23
+ **Success Criteria**: The application correctly identifies the document as untagged and prevents structure-dependent tools from crashing.
24
+
25
+ ---
26
+
27
+ ## 2. Test Case: Math & Visual Block Detection
28
+ **Target Document**: `18.1 Notes.pdf` (Handwritten/Math Slides)
29
+
30
+ | Step | Action | Expected Result | Pass/Fail |
31
+ |------|--------|-----------------|-----------|
32
+ | 2.1 | Select `18.1 Notes.pdf` from Examples. | File loads. | |
33
+ | 2.2 | Click **Analyze** button. | Analysis completes (~1-2 seconds). | |
34
+ | 2.3 | Inspect "Page overlay" image. | - **Red Boxes**: Detected around text blocks.<br>- **Math Highlight**: Math formulas (e.g., integrals, sums) should have specific bounding boxes. | |
35
+ | 2.4 | Check Summary Report. | **Alert**: "Untagged Document". <br> **Stats**: Should show > 0 "Math-like blocks detected". | |
36
+
37
+ **Success Criteria**: The heuristic regex-based math detection works on the text extracted from the slides.
38
+
39
+ ---
40
+
41
+ ## 3. Test Case: Screen Reader Simulation (Untagged Fallback)
42
+ **Target Document**: `logic.pdf` (Academic Text)
43
+
44
+ | Step | Action | Expected Result | Pass/Fail |
45
+ |------|--------|-----------------|-----------|
46
+ | 3.1 | Select `logic.pdf`. | File loads. | |
47
+ | 3.2 | Click **Analyze**. | Analysis completes. | |
48
+ | 3.3 | Go to **Advanced Analysis** -> **2. Screen Reader Simulator**. | Accordion opens. | |
49
+ | 3.4 | Set **Reading Order** to "Raw" or "TBLR". | Settings accepted. | |
50
+ | 3.5 | Click **Generate Transcript**. | **Result**: Transcript appears in the textbook.<br> **Header**: "⚠️ Simulated from visual order (PDF not tagged)".<br> **Content**: Contains readable text (e.g., "A Logical Interpretation..."). | |
51
+
52
+ **Success Criteria**: The simulator successfully uses the fallback logic (visual ordering) instead of crashing when no structure tree is present.
53
+
54
+ ---
55
+
56
+ ## 4. Test Case: Feature Availability Check (Negative Testing)
57
+ **Target Document**: Any of the above
58
+
59
+ | Step | Action | Expected Result | Pass/Fail |
60
+ |------|--------|-----------------|-----------|
61
+ | 4.1 | Open **5. Block-to-Tag Mapping**. | Accordion opens. | |
62
+ | 4.2 | Click **Map Blocks to Tags**. | **Result**: "## No Mappings Found" (because there are no tags). | |
63
+ | 4.3 | Open **3. Paragraph Detection** and click **Analyze**. | **Result**: Visual paragraphs are detected (green boxes), but **Semantic <P> Tags** count is 0. | |
64
+
65
+ ### 1.6 Landscape / Rotated Documents
66
+ - **Why**: Ensure overlays align correctly on rotated pages.
67
+ - **Test**:
68
+ - Load a PDF with landscape pages (or 90-degree rotation).
69
+ - Verify that the blue/red bounding boxes align perfectly with the text.
70
+ - Verify that "reading order" flows logically (e.g., top-left of the *visual* page).
71
+
72
+ **Success Criteria**: Features requiring tags explicitly state that tags are missing rather than showing empty/broken UIs.
73
+
74
+ ## Known Limitations / Expected Behavior
75
+ * **Untagged Alerts**: All examples provided are untagged; the alert is **expected behavior**.
76
+ * **Reading Order**: Without tags, reading order is a guess. Columns might be read left-to-right across the page in "Raw" mode.
USABILITY_AUDIT.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Usability & Workflow Audit Report
2
+
3
+ ## 1. Overview
4
+ This audit focused on the "Exploration" workflow: how easily a user can navigate a document, identify issues, and understand the relationship between the visual layout and the underlying structure (tags/order).
5
+
6
+ **Tested Document**: `logic.pdf` (Untagged, Academic Text)
7
+
8
+ ## 2. Critical Friction Points
9
+
10
+ ### 2.1. Disconnected Views (Split Attention)
11
+ * **Issue**: The "Visual Map" (colored boxes showing reading order) and the "Tools" (Screen Reader, Structure Tree) live in separate tabs.
12
+ * **Impact**: A user cannot see *why* the Screen Reader is reading text in a specific order because the visual map disappears when they switch to the "Advanced Analysis" tab.
13
+ * **Severity**: **High**. It breaks the mental model of "Cause (Visual Block) -> Effect (Screen Reader Output)".
14
+
15
+ ### 2.2. Broken Feedback Loops (No Auto-Update)
16
+ * **Issue**: Changing critical exploration controlsβ€”specifically **Page Number** and **Order Mode** (Raw/TBLR)β€”does not immediately update the visualization.
17
+ * **Impact**: The user must click "Analyze" after every minor adjustment. This discourages exploration and makes "A/B testing" settings (like toggling between Raw and TBLR sorting) frustratingly slow.
18
+ * **Severity**: **High**.
19
+
20
+ ### 2.3. Stale State in Advanced Tools
21
+ * **Issue**: When the global Page Number is changed, the "Screen Reader Simulator" text remains on the previous page's content until "Generate" is manually clicked.
22
+ * **Impact**: Users may mistakenly analyze the wrong page text.
23
+ * **Severity**: Medium.
24
+
25
+ ### 2.4. Hidden Navigation
26
+ * **Issue**: The "Pages" gallery component is empty/gray, providing no visual cues for navigation. Users are forced to guess page numbers.
27
+ * **Severity**: Low/Medium.
28
+
29
+ ## 3. Recommended Solutions
30
+
31
+ ### 3.1. "Unified Explorer" Layout
32
+ Refactor the UI to a split-screen design:
33
+ * **Left Panel (Persistent)**: The Main Page Visualizer (Image with overlays). This remains visible at all times.
34
+ * **Right Panel (Contextual)**: Tabbed interface for "Summary", "Screen Reader", "Structure Tree", and "Paragraphs".
35
+ * **Benefit**: Users can run the Screen Reader simulation while looking at the visual block numbers to verify the path.
36
+
37
+ ### 3.2. Reactive Controls
38
+ * **Fix**: Wire the **Page Number** input and **Order Mode** radio buttons to trigger the analysis function automatically (with a debounce if necessary).
39
+ * **Fix**: Ensure Advanced Tools listen to the global page number and auto-refresh (or show a "Refresh Needed" indicator).
40
+
41
+ ### 3.3. Navigation Clarity
42
+ * Add simple "Previous / Next" buttons next to the page number for easier sequential browsing.
43
+
44
+ ## 4. Conclusion
45
+ The current tool works correctly but requires excessive clicking and context switching. Implementing the **Unified Explorer Layout** and **Reactive Controls** will significantly reduce the cognitive load and make the tool 10x more effective for debugging.
app.py CHANGED
@@ -1,6 +1,6 @@
1
  # app.py
2
  from __future__ import annotations
3
-
4
  import math
5
  import re
6
  import time
@@ -76,19 +76,16 @@ def _rect_i(rect: Tuple[float, float, float, float]) -> Tuple[int, int, int, int
76
  x0, y0, x1, y1 = rect
77
  return (int(round(x0)), int(round(y0)), int(round(x1)), int(round(y1)))
78
 
79
- def _safe_str(x: Any, max_len: int = 400) -> str:
80
- s = str(x)
81
- if len(s) > max_len:
82
- s = s[:max_len] + "…"
83
- return s
 
84
 
85
- def _looks_like_math(text: str) -> bool:
86
- # Heuristic: mathy glyphs/symbols and patterns
87
- if not text:
88
- return False
89
- math_syms = r"[βˆ‘βˆ«βˆšβ‰ˆβ‰ β‰€β‰₯βˆžΒ±Γ—Γ·βˆ‚βˆ‡βˆˆβˆ©βˆͺβŠ‚βŠ†βŠ‡βŠƒβ†’β†¦βˆ€βˆƒβ„β„€β„šβ„•]"
90
- latexy = r"(\\frac|\\sqrt|\\sum|\\int|_|\^|\b(?:sin|cos|tan|log|ln)\b)"
91
- return bool(re.search(math_syms, text) or re.search(latexy, text))
92
 
93
  # -----------------------------
94
  # Background Color Sampling for Adaptive Contrast
@@ -159,72 +156,20 @@ def get_contrast_colors(luminance: float) -> Dict[str, Tuple[int, int, int, int]
159
  """
160
  return LIGHT_BG_COLORS if luminance > 0.5 else DARK_BG_COLORS
161
 
162
- @dataclass
163
- class SpanInfo:
164
- bbox: Tuple[float, float, float, float]
165
- text: str
166
- font: str
167
- size: float
168
-
169
- @dataclass
170
- class BlockInfo:
171
- bbox: Tuple[float, float, float, float]
172
- text: str
173
- block_type: int # 0 text, 1 image, 2 drawing in PyMuPDF terms for some outputs
174
- spans: List[SpanInfo]
175
-
176
- @dataclass
177
- class PageDiagnostic:
178
- """Extended diagnostic for batch processing."""
179
- page_num: int
180
- tagged_pdf: bool
181
- text_len: int
182
- image_block_count: int
183
- font_count: int
184
- has_type3_fonts: bool
185
- suspicious_garbled_text: bool
186
- likely_scanned_image_page: bool
187
- likely_text_as_vector_outlines: bool
188
- multi_column_guess: bool
189
- processing_time_ms: Optional[int] = None
190
-
191
- @dataclass
192
- class BatchAnalysisResult:
193
- """Aggregate results from all pages."""
194
- total_pages: int
195
- pages_analyzed: int
196
- summary_stats: Dict[str, int]
197
- per_page_results: List[PageDiagnostic]
198
- common_issues: List[str]
199
- critical_pages: List[int]
200
- processing_time_sec: float
201
-
202
- def to_dict(self) -> Dict[str, Any]:
203
- """Convert to JSON-serializable dict."""
204
- return {
205
- "total_pages": self.total_pages,
206
- "pages_analyzed": self.pages_analyzed,
207
- "summary_stats": self.summary_stats,
208
- "per_page_results": [
209
- {
210
- "page_num": p.page_num,
211
- "tagged_pdf": p.tagged_pdf,
212
- "text_len": p.text_len,
213
- "image_block_count": p.image_block_count,
214
- "font_count": p.font_count,
215
- "has_type3_fonts": p.has_type3_fonts,
216
- "suspicious_garbled_text": p.suspicious_garbled_text,
217
- "likely_scanned_image_page": p.likely_scanned_image_page,
218
- "likely_text_as_vector_outlines": p.likely_text_as_vector_outlines,
219
- "multi_column_guess": p.multi_column_guess,
220
- "processing_time_ms": p.processing_time_ms,
221
- }
222
- for p in self.per_page_results
223
- ],
224
- "common_issues": self.common_issues,
225
- "critical_pages": self.critical_pages,
226
- "processing_time_sec": self.processing_time_sec,
227
- }
228
 
229
  # -----------------------------
230
  # PDF structural checks (pikepdf)
@@ -282,83 +227,6 @@ def pdf_struct_report(pdf_path: str) -> Dict[str, Any]:
282
  # Layout extraction + ordering (PyMuPDF)
283
  # -----------------------------
284
 
285
- def extract_blocks_spans(doc: fitz.Document, page_index: int) -> List[BlockInfo]:
286
- page = doc[page_index]
287
- raw = page.get_text("dict") # includes blocks/lines/spans with bboxes
288
- blocks: List[BlockInfo] = []
289
- for b in raw.get("blocks", []):
290
- btype = int(b.get("type", -1))
291
- bbox = tuple(b.get("bbox", (0, 0, 0, 0)))
292
- text_parts: List[str] = []
293
- spans: List[SpanInfo] = []
294
- if btype == 0: # text
295
- for line in b.get("lines", []):
296
- for sp in line.get("spans", []):
297
- t = sp.get("text", "")
298
- if t:
299
- text_parts.append(t)
300
- spans.append(
301
- SpanInfo(
302
- bbox=tuple(sp.get("bbox", (0, 0, 0, 0))),
303
- text=t,
304
- font=_safe_str(sp.get("font", "")),
305
- size=float(sp.get("size", 0.0)),
306
- )
307
- )
308
- text = "".join(text_parts).strip()
309
- blocks.append(BlockInfo(bbox=bbox, text=text, block_type=btype, spans=spans))
310
- return blocks
311
-
312
- def order_blocks(blocks: List[BlockInfo], mode: str) -> List[Tuple[int, BlockInfo]]:
313
- """
314
- Return list of (idx, block) in chosen order.
315
- """
316
- indexed = list(enumerate(blocks))
317
- if mode == "raw":
318
- return indexed
319
-
320
- def key_tblr(item: Tuple[int, BlockInfo]) -> Tuple[int, int]:
321
- _, b = item
322
- x0, y0, x1, y1 = b.bbox
323
- return (int(y0), int(x0))
324
-
325
- if mode == "tblr":
326
- return sorted(indexed, key=key_tblr)
327
-
328
- if mode == "columns":
329
- # Simple 2-column heuristic:
330
- # cluster by x-center around midline, then sort within each column.
331
- # This is a heuristic; tagged PDFs should make this unnecessary.
332
- xs = []
333
- for _, b in indexed:
334
- x0, y0, x1, y1 = b.bbox
335
- if (x1 - x0) > 5:
336
- xs.append((x0 + x1) / 2.0)
337
- if not xs:
338
- return sorted(indexed, key=key_tblr)
339
- mid = sorted(xs)[len(xs) // 2]
340
-
341
- left = []
342
- right = []
343
- for it in indexed:
344
- _, b = it
345
- x0, y0, x1, y1 = b.bbox
346
- cx = (x0 + x1) / 2.0
347
- (left if cx < mid else right).append(it)
348
-
349
- left = sorted(left, key=key_tblr)
350
- right = sorted(right, key=key_tblr)
351
-
352
- # Read left column first, then right
353
- return left + right
354
-
355
- # Fallback
356
- return sorted(indexed, key=key_tblr)
357
-
358
- # -----------------------------
359
- # Render overlay images
360
- # -----------------------------
361
-
362
  def render_page_with_overlay(
363
  doc: fitz.Document,
364
  page_index: int,
@@ -453,7 +321,7 @@ def render_paragraph_overlay(
453
  draw = ImageDraw.Draw(img, 'RGBA')
454
 
455
  # Extract blocks for bounding boxes
456
- blocks = extract_blocks_spans(pdf_path, page_index)
457
 
458
  # Scale factor from PDF points to pixels
459
  scale = dpi / 72.0
@@ -664,23 +532,37 @@ def format_batch_summary_markdown(batch: BatchAnalysisResult) -> str:
664
  **Issues Found:**
665
  """
666
 
667
- for issue, count in batch.summary_stats.items():
668
- pct = (count / batch.pages_analyzed) * 100 if batch.pages_analyzed > 0 else 0
669
- icon = "❌" if count > 0 else "βœ“"
670
- issue_name = issue.replace('_', ' ').title()
671
- md += f"\n- {icon} **{issue_name}**: {count} pages ({pct:.1f}%)"
672
 
673
- if batch.common_issues:
674
- md += f"\n\n**Common Issues (affecting >50% of pages):**\n"
675
- for issue in batch.common_issues:
676
- md += f"- {issue.replace('_', ' ').title()}\n"
 
 
 
 
 
677
 
678
- if batch.critical_pages:
679
- md += f"\n\n**Critical Pages (3+ issues):**\n"
680
- pages_str = ', '.join(map(str, batch.critical_pages[:20]))
681
- md += f"Pages: {pages_str}"
682
- if len(batch.critical_pages) > 20:
683
- md += f" ... and {len(batch.critical_pages) - 20} more"
 
 
 
 
 
 
 
 
 
 
 
 
 
684
 
685
  return md
686
 
@@ -797,15 +679,36 @@ def format_diagnostic_summary(diag: Dict[str, Any], struct: Dict[str, Any]) -> s
797
  # -----------------------------
798
 
799
  def load_pdf(fileobj) -> Tuple[str, int, str]:
800
- # fileobj is a gradio UploadedFile-like with .name
801
- pdf_path = fileobj.name
802
- with fitz.open(pdf_path) as doc:
803
- n = doc.page_count
804
- return pdf_path, n, f"Loaded: {pdf_path} ({n} pages)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
805
 
806
  def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans: bool, highlight_math: bool):
807
  if not pdf_path:
808
- return None, {}, "Upload a PDF first."
809
 
810
  # page_num is 1-based in UI
811
  page_index = max(0, int(page_num) - 1)
@@ -860,7 +763,24 @@ def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans:
860
  # Generate formatted summary with icons and explanations
861
  summary = format_diagnostic_summary(diag, struct)
862
 
863
- return overlay, report, summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
864
 
865
  def analyze_batch_with_progress(
866
  pdf_path: str,
@@ -891,55 +811,76 @@ def analyze_batch_with_progress(
891
  # UI
892
  # -----------------------------
893
 
 
 
 
 
894
  with gr.Blocks(title="PDF Structure Inspector") as demo:
895
  gr.Markdown(
896
  """
897
  # PDF Structure Inspector (screen reader / reading order / math debugging)
898
-
899
- Upload a PDF and inspect:
900
- - **Tagged vs untagged**
901
- - **Text/image blocks**
902
- - Different **reading order heuristics**
903
- - Red flags for **OCR-needed**, **text-as-outlines**, **Type3 fonts**, **garbled text**
904
  """
905
  )
906
 
 
907
  with gr.Row():
908
- pdf_file = gr.File(label="PDF file", file_types=[".pdf"])
909
- status = gr.Textbox(label="Status", interactive=False)
910
-
911
- with gr.Row():
912
- pdf_path = gr.Textbox(label="Internal path", visible=False)
913
- page_count = gr.Number(label="Pages", value=1, precision=0, interactive=False)
 
 
 
 
 
914
 
915
- with gr.Tabs():
916
- with gr.Tab("Single Page Analysis"):
 
 
 
 
 
 
 
 
 
 
917
  with gr.Row():
918
- page_num = gr.Slider(label="Page", minimum=1, maximum=1, value=1, step=1)
919
- dpi = gr.Slider(label="Render DPI", minimum=72, maximum=300, value=150, step=1)
 
920
 
921
- with gr.Row():
922
- order_mode = gr.Radio(
923
- ["raw", "tblr", "columns"],
924
- value="raw",
925
- label="Overlay order mode",
926
- info="Choose block ordering strategy. Hover options for details.",
927
- )
928
- show_spans = gr.Checkbox(
929
- value=False,
930
- label="Show span boxes",
931
- info="Display individual text spans (words/fragments) for font-level debugging"
932
- )
933
- highlight_math = gr.Checkbox(
934
- value=True,
935
- label="Flag blocks that look mathy",
936
- info="Highlights blocks with math notation (needs MathML or alt text)"
937
- )
938
-
939
- run_btn = gr.Button("Analyze")
940
-
941
- with gr.Accordion("πŸ“– Understanding the Diagnostics", open=False):
942
- gr.Markdown("""
 
 
 
 
 
 
943
  ### What Each Diagnostic Means
944
 
945
  **🏷️ Tagged PDF**: Tagged PDFs include structure tags (headings, lists, reading order) that screen readers use for navigation. Untagged PDFs force assistive technology to guess the reading order based on visual layout, often leading to incorrect results.
@@ -951,7 +892,7 @@ Upload a PDF and inspect:
951
  - Screen readers cannot pronounce text correctly
952
  - Text search doesn't work
953
 
954
- **πŸ”€ Garbled Text**: Replacement characters (οΏ½) indicate missing or incorrect ToUnicode mappings in the PDF. Screen readers will mispronounce affected text.
955
 
956
  **✏️ Text as Outlines**: When text is rendered as vector paths instead of actual text, screen readers cannot extract or read it. The document appears to have text visually but is inaccessible.
957
 
@@ -959,314 +900,222 @@ Upload a PDF and inspect:
959
 
960
  ### Reading Order Modes
961
 
962
- **Raw**: Shows blocks in the order PyMuPDF extracted them from the PDF structure. This often reflects the order content was added to the PDF during creation, not the intended reading order.
963
-
964
- **Top-to-Bottom Left-to-Right (TBLR)**: Simple geometric sorting that reads from the top of the page to the bottom, breaking ties by left-to-right position. Works well for simple single-column documents.
965
-
966
- **Columns**: Attempts to detect two-column layouts by finding the median x-position and reading the left column completely before the right column. This is a heuristic and may fail on complex layouts.
967
- """)
968
-
969
- with gr.Row():
970
- overlay_img = gr.Image(label="Page overlay (blocks/spans labeled)", type="pil")
971
- summary = gr.Markdown()
972
- report = gr.JSON(label="Report (struct + diagnosis + reading order preview)")
973
-
974
- with gr.Tab("Batch Analysis"):
975
- with gr.Row():
976
- batch_max_pages = gr.Slider(
977
- label="Max pages to analyze",
978
- minimum=1,
979
- maximum=500,
980
- value=100,
981
- step=1,
982
- info="Limit analysis for very large documents"
983
- )
984
- batch_sample_rate = gr.Slider(
985
- label="Sample rate",
986
- minimum=1,
987
- maximum=10,
988
- value=1,
989
- step=1,
990
- info="Analyze every Nth page (1 = all pages)"
991
- )
992
-
993
- batch_run_btn = gr.Button("Analyze All Pages", variant="primary")
994
- batch_progress = gr.Textbox(label="Progress", interactive=False)
995
-
996
- with gr.Accordion("Summary Statistics", open=True):
997
- batch_summary_md = gr.Markdown()
998
-
999
- with gr.Accordion("Issue Breakdown", open=True):
1000
- batch_chart = gr.Plot(label="Issues by Type")
1001
-
1002
- with gr.Accordion("Per-Page Results", open=False):
1003
- batch_table = gr.HTML()
1004
-
1005
- batch_json = gr.JSON(label="Full Batch Report", visible=False)
1006
-
1007
- # Advanced Analysis Tab
1008
- with gr.Tab("Advanced Analysis"):
1009
- gr.Markdown("""
1010
- # Advanced PDF Accessibility Analysis
1011
-
1012
- Power-user features for deep PDF inspection and accessibility debugging.
1013
- These tools help diagnose complex accessibility issues and understand internal PDF structure.
1014
- """)
1015
-
1016
- with gr.Accordion("1. Content Stream Inspector", open=False):
1017
- gr.Markdown("""
1018
- **Purpose**: Inspect raw PDF content stream operators for a specific block
1019
-
1020
- Shows the low-level PDF commands that render text and graphics. Useful for debugging
1021
- text extraction issues, font problems, and positioning.
1022
- """)
1023
-
1024
- cs_block_dropdown = gr.Dropdown(
1025
- label="Select Block",
1026
- choices=[],
1027
- info="Choose a text or image block to inspect"
1028
- )
1029
- cs_inspect_btn = gr.Button("Extract Operators", variant="primary")
1030
-
1031
- with gr.Tabs():
1032
- with gr.Tab("Formatted"):
1033
- cs_operator_display = gr.Markdown()
1034
- with gr.Tab("Raw Stream"):
1035
- cs_raw_stream = gr.Code(label="Raw PDF Content Stream")
1036
-
1037
- with gr.Accordion("2. Screen Reader Simulator", open=False):
1038
- gr.Markdown("""
1039
- **Purpose**: Simulate how NVDA or JAWS would read this page
1040
-
1041
- Generates a transcript showing what a screen reader user would hear, including
1042
- element announcements and reading order. Works with both tagged and untagged PDFs.
1043
- """)
1044
-
1045
- with gr.Row():
1046
- sr_reader = gr.Radio(
1047
- ["NVDA", "JAWS"],
1048
- value="NVDA",
1049
- label="Screen Reader",
1050
- info="Choose which screen reader to simulate"
1051
- )
1052
- sr_detail = gr.Radio(
1053
- ["minimal", "default", "verbose"],
1054
- value="default",
1055
- label="Detail Level",
1056
- info="How much context information to include"
1057
- )
1058
- sr_order = gr.Radio(
1059
- ["raw", "tblr", "columns"],
1060
- value="tblr",
1061
- label="Reading Order (untagged fallback)",
1062
- info="Used only if PDF has no structure tree"
1063
- )
1064
-
1065
- sr_btn = gr.Button("Generate Transcript", variant="primary")
1066
-
1067
- with gr.Tabs():
1068
- with gr.Tab("Transcript"):
1069
- sr_transcript = gr.Textbox(
1070
- lines=20,
1071
- label="Screen Reader Output",
1072
- interactive=False
1073
- )
1074
- with gr.Tab("Analysis"):
1075
- sr_analysis = gr.Markdown()
1076
-
1077
- with gr.Accordion("3. Paragraph Detection", open=False):
1078
- gr.Markdown("""
1079
- **Purpose**: Compare visual paragraphs vs semantic paragraph tags
1080
-
1081
- Identifies paragraphs based on spacing (visual) and compares them to &lt;P&gt; tags
1082
- in the structure tree (semantic). Mismatches can cause confusion for screen reader users.
1083
- """)
1084
-
1085
- para_threshold = gr.Slider(
1086
- label="Vertical Gap Threshold (points)",
1087
- minimum=5,
1088
- maximum=30,
1089
- value=15,
1090
- step=1,
1091
- info="Minimum vertical spacing to consider a paragraph break"
1092
- )
1093
-
1094
- para_btn = gr.Button("Analyze Paragraphs", variant="primary")
1095
- para_overlay = gr.Image(label="Paragraph Visualization", type="pil")
1096
-
1097
- with gr.Row():
1098
- para_visual = gr.Number(label="Visual Paragraphs", interactive=False)
1099
- para_semantic = gr.Number(label="Semantic <P> Tags", interactive=False)
1100
- para_score = gr.Number(label="Match Quality", interactive=False)
1101
-
1102
- para_mismatches = gr.Markdown()
1103
-
1104
- with gr.Accordion("4. Structure Tree Visualizer", open=False):
1105
- gr.Markdown("""
1106
- **Purpose**: Display the complete PDF tag hierarchy
1107
-
1108
- Shows the entire structure tree for tagged PDFs, including tag types, alt text,
1109
- and page references. Only works for PDFs with accessibility tagging.
1110
- """)
1111
-
1112
- struct_btn = gr.Button("Extract Structure Tree", variant="primary")
1113
-
1114
- with gr.Tabs():
1115
- with gr.Tab("Tree Diagram"):
1116
- struct_plot = gr.Plot(label="Interactive Hierarchy")
1117
- with gr.Tab("Text View"):
1118
- struct_text = gr.Textbox(
1119
- lines=30,
1120
- label="Structure Tree",
1121
- interactive=False
1122
- )
1123
- with gr.Tab("Statistics"):
1124
- struct_stats = gr.Markdown()
1125
-
1126
- with gr.Accordion("5. Block-to-Tag Mapping", open=False):
1127
- gr.Markdown("""
1128
- **Purpose**: Link visual blocks to structure tree elements
1129
-
1130
- Maps each visual block to its corresponding tag in the structure tree via
1131
- MCID (Marked Content ID) references. Shows which content is properly tagged.
1132
- """)
1133
-
1134
- map_btn = gr.Button("Map Blocks to Tags", variant="primary")
1135
- map_message = gr.Markdown()
1136
- map_table = gr.DataFrame(
1137
- headers=["Block #", "Tag Type", "MCID", "Alt Text"],
1138
- label="Block-to-Tag Correlations",
1139
- interactive=False
1140
- )
1141
-
1142
- def _on_upload(f):
1143
  path, n, msg = load_pdf(f)
 
 
1144
  return path, n, msg, gr.update(maximum=n, value=1)
1145
-
1146
- pdf_file.change(_on_upload, inputs=[pdf_file], outputs=[pdf_path, page_count, status, page_num])
1147
-
1148
- run_btn.click(
1149
- analyze,
1150
- inputs=[pdf_path, page_num, dpi, order_mode, show_spans, highlight_math],
1151
- outputs=[overlay_img, report, summary],
1152
- )
1153
-
1154
- batch_run_btn.click(
1155
- analyze_batch_with_progress,
1156
- inputs=[pdf_path, batch_max_pages, batch_sample_rate],
1157
- outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
1158
- )
1159
-
1160
- # Advanced Analysis Callbacks
1161
-
 
 
 
 
 
1162
  def update_block_dropdown(pdf_path_val, page_num_val):
1163
  """Update block dropdown when page changes."""
1164
  if not pdf_path_val:
1165
  return gr.update(choices=[], value=None)
1166
-
1167
  try:
1168
- blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1169
- choices = create_block_choices(blocks)
1170
- return gr.update(choices=choices, value=0 if choices else None)
 
 
 
1171
  except:
1172
  return gr.update(choices=[], value=None)
1173
 
1174
  def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
1175
- """Run content stream analysis for selected block."""
1176
  if not pdf_path_val or block_idx is None:
1177
  return "Please select a block", ""
1178
-
1179
  try:
1180
- blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1181
- result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
1182
-
1183
  if result.get('error'):
1184
  return result['message'], ""
1185
-
1186
  return result['formatted'], result['raw']
1187
  except Exception as e:
1188
  return f"## Error\n\n{str(e)}", ""
1189
 
1190
  def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
1191
- """Run screen reader simulation."""
1192
  if not pdf_path_val:
1193
  return "Please upload a PDF first", ""
1194
-
1195
  try:
1196
- blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1197
- result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
1198
-
1199
  if result.get('error'):
1200
  return result.get('message', 'Error'), ""
1201
-
1202
  return result['transcript'], result['analysis']
1203
  except Exception as e:
1204
  return f"## Error\n\n{str(e)}", ""
1205
 
1206
  def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
1207
- """Run paragraph detection and comparison."""
1208
  if not pdf_path_val:
1209
  return None, 0, 0, 0.0, "Please upload a PDF first"
1210
-
1211
  try:
1212
- blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1213
- result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
1214
-
1215
  if result.get('error'):
1216
  return None, 0, 0, 0.0, result.get('message', 'Error')
1217
-
1218
- # Create visualization overlay
1219
  overlay = render_paragraph_overlay(
1220
  pdf_path_val, page_num_val - 1, dpi_val,
1221
  result['visual_paragraphs'], result['semantic_paragraphs']
1222
  )
1223
-
1224
  return (
1225
- overlay,
1226
- result['visual_count'],
1227
- result['semantic_count'],
1228
- result['match_score'],
1229
- result['mismatches']
1230
  )
1231
  except Exception as e:
1232
  return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
1233
 
1234
  def run_structure_tree_extraction(pdf_path_val):
1235
- """Extract and visualize structure tree."""
1236
  if not pdf_path_val:
1237
  return None, "Please upload a PDF first", ""
1238
-
1239
  try:
1240
  result = analyze_structure_tree(pdf_path_val)
1241
-
1242
  if result.get('error'):
1243
  return None, result['message'], ""
1244
-
1245
  return result['plot_data'], result['text_view'], result['statistics']
1246
  except Exception as e:
1247
  return None, f"## Error\n\n{str(e)}", ""
1248
 
1249
  def run_block_tag_mapping(pdf_path_val, page_num_val):
1250
- """Map blocks to structure tags."""
1251
  if not pdf_path_val:
1252
  return "Please upload a PDF first", []
1253
-
1254
  try:
1255
- blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1256
- result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
1257
-
1258
  if result.get('error'):
1259
  return result.get('message', 'Error'), []
1260
-
1261
  return result['message'], result['mappings']
1262
  except Exception as e:
1263
  return f"## Error\n\n{str(e)}", []
1264
-
1265
- # Wire up Advanced Analysis callbacks
 
 
 
 
 
 
 
 
 
 
1266
  page_num.change(
1267
- update_block_dropdown,
1268
- inputs=[pdf_path, page_num],
1269
- outputs=[cs_block_dropdown]
 
 
 
 
1270
  )
1271
 
1272
  cs_inspect_btn.click(
@@ -1299,6 +1148,12 @@ Upload a PDF and inspect:
1299
  outputs=[map_message, map_table]
1300
  )
1301
 
 
 
 
 
 
 
1302
  if __name__ == "__main__":
1303
- demo.launch()
1304
 
 
1
  # app.py
2
  from __future__ import annotations
3
+ import os
4
  import math
5
  import re
6
  import time
 
76
  x0, y0, x1, y1 = rect
77
  return (int(round(x0)), int(round(y0)), int(round(x1)), int(round(y1)))
78
 
79
+ # Removed _safe_str and _looks_like_math from app.py as they are now in layout_utils
80
+ # but keeping them here might be needed if other local functions use them without prefix.
81
+ # Checking usage...
82
+ # _safe_str is used in pdf_struct_report
83
+ # _looks_like_math is used in render_page_with_overlay
84
+ # Since we imported them from layout_utils above, we can remove the definitions here.
85
 
86
+ # -----------------------------
87
+ # Background Color Sampling for Adaptive Contrast
88
+ # -----------------------------
 
 
 
 
89
 
90
  # -----------------------------
91
  # Background Color Sampling for Adaptive Contrast
 
156
  """
157
  return LIGHT_BG_COLORS if luminance > 0.5 else DARK_BG_COLORS
158
 
159
+ # Moving layout logic to layout_utils.py
160
+ from layout_utils import (
161
+ SpanInfo,
162
+ BlockInfo,
163
+ extract_blocks_spans,
164
+ order_blocks,
165
+ _safe_str,
166
+ _looks_like_math,
167
+ PageDiagnostic,
168
+ BatchAnalysisResult
169
+ )
170
+
171
+
172
+ # Re-exporting for compatibility if needed, using the imported names directly from now on.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  # -----------------------------
175
  # PDF structural checks (pikepdf)
 
227
  # Layout extraction + ordering (PyMuPDF)
228
  # -----------------------------
229
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
  def render_page_with_overlay(
231
  doc: fitz.Document,
232
  page_index: int,
 
321
  draw = ImageDraw.Draw(img, 'RGBA')
322
 
323
  # Extract blocks for bounding boxes
324
+ blocks = extract_blocks_spans(doc, page_index)
325
 
326
  # Scale factor from PDF points to pixels
327
  scale = dpi / 72.0
 
532
  **Issues Found:**
533
  """
534
 
535
+ md += "\n\n**Detailed Breakdown:**\n"
 
 
 
 
536
 
537
+ # Define issues and their readable names
538
+ from layout_utils import PageDiagnostic
539
+ issue_map = {
540
+ 'likely_scanned_image_page': 'Scanned Pages',
541
+ 'has_type3_fonts': 'Type3 Fonts',
542
+ 'suspicious_garbled_text': 'Garbled Text',
543
+ 'multi_column_guess': 'Multi-Column (Untagged)',
544
+ 'likely_text_as_vector_outlines': 'Text as Outlines'
545
+ }
546
 
547
+ for issue_attr, issue_name in issue_map.items():
548
+ # Find pages with this issue
549
+ affected_pages = []
550
+ for p in batch.per_page_results:
551
+ if getattr(p, issue_attr, False):
552
+ affected_pages.append(p.page_num)
553
+
554
+ if affected_pages:
555
+ icon = "❌"
556
+ count = len(affected_pages)
557
+ pct = (count / batch.pages_analyzed) * 100 if batch.pages_analyzed > 0 else 0
558
+
559
+ # Format page list (truncate if too long)
560
+ page_list_str = ", ".join(map(str, affected_pages[:30]))
561
+ if len(affected_pages) > 30:
562
+ page_list_str += f" ... ({len(affected_pages) - 30} more)"
563
+
564
+ md += f"\n### {icon} {issue_name}: {count} pages ({pct:.1f}%)\n"
565
+ md += f"**Pages**: {page_list_str}\n"
566
 
567
  return md
568
 
 
679
  # -----------------------------
680
 
681
  def load_pdf(fileobj) -> Tuple[str, int, str]:
682
+ """
683
+ Robustly load a PDF file and return its path, page count, and status message.
684
+ Handles Gradio FileData objects, string paths (from examples), and None.
685
+ """
686
+ if fileobj is None:
687
+ return "", 0, "Waiting for PDF upload..."
688
+
689
+ # Extract path from Gadio FileData or use string directly
690
+ if isinstance(fileobj, str):
691
+ pdf_path = fileobj
692
+ elif hasattr(fileobj, "path"):
693
+ pdf_path = fileobj.path
694
+ elif hasattr(fileobj, "name"):
695
+ pdf_path = fileobj.name
696
+ else:
697
+ pdf_path = str(fileobj)
698
+
699
+ if not pdf_path or not os.path.exists(pdf_path):
700
+ return "", 0, f"Error: File not found at {pdf_path}"
701
+
702
+ try:
703
+ with fitz.open(pdf_path) as doc:
704
+ n = doc.page_count
705
+ return pdf_path, n, f"βœ“ Loaded: {os.path.basename(pdf_path)} ({n} pages)"
706
+ except Exception as e:
707
+ return "", 0, f"❌ Error loading PDF: {str(e)}"
708
 
709
  def analyze(pdf_path: str, page_num: int, dpi: int, order_mode: str, show_spans: bool, highlight_math: bool):
710
  if not pdf_path:
711
+ return None, {}, "Upload a PDF first.", ""
712
 
713
  # page_num is 1-based in UI
714
  page_index = max(0, int(page_num) - 1)
 
763
  # Generate formatted summary with icons and explanations
764
  summary = format_diagnostic_summary(diag, struct)
765
 
766
+ # Check for compatibility and prepend warning if needed
767
+ if not struct.get("has_struct_tree_root"):
768
+ summary = "### ⚠️ Accessibility Alert: Untagged Document\n\n" + \
769
+ "**This document is likely incompatible with screen readers.**\n\n" + \
770
+ "It lacks the 'structure tree' (tags) required for accessibility tools to understand headings, paragraphs, and reading order.\n\n" + \
771
+ "**What you can do:**\n" + \
772
+ "- **Remediate**: Open the original source file (Word, PowerPoint) and save as 'PDF (Best for electronic distribution and accessibility)'\n" + \
773
+ "- **Retrofit**: Use Adobe Acrobat Pro's 'Accessibility' tool to auto-tag the document.\n\n" + \
774
+ "---\n\n" + summary
775
+
776
+ if diag["likely_scanned_image_page"]:
777
+ summary = "### ❌ Critical Issue: Scanned Page\n\n" + \
778
+ "**This page appears to be an image with no readable text.**\n\n" + \
779
+ "Screen readers cannot read this content at all.\n\n" + \
780
+ "**Action Required**: Perform Optical Character Recognition (OCR) using Adobe Acrobat or an OCR tool to make the text selectable and readable.\n\n" + \
781
+ "---\n\n" + summary
782
+
783
+ return overlay, report, summary, preview
784
 
785
  def analyze_batch_with_progress(
786
  pdf_path: str,
 
811
  # UI
812
  # -----------------------------
813
 
814
+ # -----------------------------
815
+ # UI
816
+ # -----------------------------
817
+
818
  with gr.Blocks(title="PDF Structure Inspector") as demo:
819
  gr.Markdown(
820
  """
821
  # PDF Structure Inspector (screen reader / reading order / math debugging)
 
 
 
 
 
 
822
  """
823
  )
824
 
825
+ # 1. Top Bar: Loader & Global Stats
826
  with gr.Row():
827
+ pdf_file = gr.File(label="Upload PDF", file_types=[".pdf"], scale=1)
828
+ with gr.Column(scale=2):
829
+ status = gr.Textbox(label="Status", interactive=False)
830
+ # Hidden states
831
+ pdf_path = gr.Textbox(visible=False)
832
+ page_count = gr.Number(visible=False)
833
+
834
+ gr.Examples(
835
+ examples=["test_document.pdf", "18.1 Notes.pdf", "logic.pdf"],
836
+ inputs=pdf_file
837
+ )
838
 
839
+ # 2. Control Panel
840
+ with gr.Row(variant="panel"):
841
+ with gr.Column(scale=2):
842
+ page_num = gr.Slider(label="Page Number", minimum=1, maximum=1, value=1, step=1)
843
+ with gr.Column(scale=1):
844
+ dpi = gr.Slider(label="Zoom (DPI)", minimum=72, maximum=300, value=150, step=1)
845
+ with gr.Column(scale=1):
846
+ order_mode = gr.Dropdown(
847
+ ["raw", "tblr", "columns"], value="raw", label="Reading Order",
848
+ info="Strategy for untagged content"
849
+ )
850
+ with gr.Column(scale=2, min_width=200):
851
  with gr.Row():
852
+ show_spans = gr.Checkbox(label="Show Spans", value=False)
853
+ highlight_math = gr.Checkbox(label="Highlight Math", value=True)
854
+ run_btn = gr.Button("Forced Refresh", variant="secondary", size="sm")
855
 
856
+ # 3. Main Workspace (Split View)
857
+ with gr.Row():
858
+ # LEFT: Visualization (Persistent)
859
+ with gr.Column(scale=6):
860
+ gr.Markdown("### 1. Visual Inspection")
861
+ overlay_img = gr.Image(label="Page Analysis Overlay (Live)", type="pil", interactive=False, height=800)
862
+ summary = gr.Markdown(elem_classes=["result-markdown"])
863
+
864
+ # RIGHT: Tools (Contextual)
865
+ with gr.Column(scale=5):
866
+ gr.Markdown("### 2. Deep Dive Tools")
867
+
868
+ with gr.Tabs():
869
+ # --- TAB 1: DETAILS ---
870
+ with gr.Tab("Details & Structure"):
871
+ with gr.Accordion("Reading Order Preview", open=True):
872
+ reading_order_preview = gr.Textbox(
873
+ label="Detected text flow",
874
+ lines=20,
875
+ interactive=False,
876
+ info="This is the order text will be fed to accessibility tools (if untagged)."
877
+ )
878
+
879
+ with gr.Accordion("Full Technical Report (JSON)", open=False):
880
+ report = gr.JSON(label="Page Report")
881
+
882
+ with gr.Accordion("Help: Understanding Diagnostics", open=False):
883
+ gr.Markdown("""
884
  ### What Each Diagnostic Means
885
 
886
  **🏷️ Tagged PDF**: Tagged PDFs include structure tags (headings, lists, reading order) that screen readers use for navigation. Untagged PDFs force assistive technology to guess the reading order based on visual layout, often leading to incorrect results.
 
892
  - Screen readers cannot pronounce text correctly
893
  - Text search doesn't work
894
 
895
+ **πŸ”€ Garbled Text**: Replacement characters () indicate missing or incorrect ToUnicode mappings in the PDF. Screen readers will mispronounce affected text.
896
 
897
  **✏️ Text as Outlines**: When text is rendered as vector paths instead of actual text, screen readers cannot extract or read it. The document appears to have text visually but is inaccessible.
898
 
 
900
 
901
  ### Reading Order Modes
902
 
903
+ **Raw**: Extraction order, how PyMuPDF found blocks (often = creation order).
904
+ **TBLR**: Top-to-bottom, left-to-right geometric sorting.
905
+ **Columns**: Two-column heuristic (clusters by x-position).
906
+ """)
907
+
908
+ # --- TAB 2: ADVANCED ---
909
+ with gr.Tab("Advanced Tools"):
910
+ gr.Markdown("Power-user features for deep PDF inspection.")
911
+
912
+ # 1. Content Stream
913
+ with gr.Accordion("1. Content Stream Inspector", open=False):
914
+ gr.Markdown("**Inspect raw PDF content stream operators for a specific block**")
915
+ cs_block_dropdown = gr.Dropdown(label="Select Block", choices=[], info="Choose a block to inspect")
916
+ cs_inspect_btn = gr.Button("Extract Operators", size="sm")
917
+ with gr.Tabs():
918
+ with gr.Tab("Formatted"):
919
+ cs_operator_display = gr.Markdown()
920
+ with gr.Tab("Raw"):
921
+ cs_raw_stream = gr.Code(label="Raw Stream")
922
+
923
+ # 2. Screen Reader
924
+ with gr.Accordion("2. Screen Reader Simulator", open=True):
925
+ gr.Markdown("**Simulate how NVDA or JAWS would read this page**")
926
+ with gr.Row():
927
+ sr_reader = gr.Radio(["NVDA", "JAWS"], value="NVDA", label="Reader", scale=1)
928
+ sr_detail = gr.Radio(["minimal", "default", "verbose"], value="default", label="Detail", scale=1)
929
+ sr_order = gr.Radio(["raw", "tblr", "columns"], value="tblr", label="Fallback Order", scale=1)
930
+
931
+ sr_btn = gr.Button("Generate Transcript", variant="primary")
932
+ with gr.Tabs():
933
+ with gr.Tab("Transcript"):
934
+ sr_transcript = gr.Textbox(lines=15, label="Output", interactive=False)
935
+ with gr.Tab("Analysis"):
936
+ sr_analysis = gr.Markdown()
937
+
938
+ # 3. Paragraph Detection
939
+ with gr.Accordion("3. Paragraph Detection", open=False):
940
+ gr.Markdown("**Compare visual paragraphs vs semantic paragraph tags**")
941
+ para_threshold = gr.Slider(label="Gap Threshold", minimum=5, maximum=30, value=15, step=1)
942
+ para_btn = gr.Button("Analyze Paragraphs")
943
+
944
+ para_overlay = gr.Image(label="Paragraph Visualization", type="pil", height=400)
945
+
946
+ with gr.Row():
947
+ para_visual = gr.Number(label="Visual", interactive=False)
948
+ para_semantic = gr.Number(label="Semantic <P>", interactive=False)
949
+ para_score = gr.Number(label="Match Quality", interactive=False)
950
+ para_mismatches = gr.Markdown()
951
+
952
+ # 4. Structure Tree
953
+ with gr.Accordion("4. Structure Tree Visualizer", open=False):
954
+ gr.Markdown("**Display the complete PDF tag hierarchy**")
955
+ struct_btn = gr.Button("Extract Tree")
956
+ with gr.Tabs():
957
+ with gr.Tab("Diagram"):
958
+ struct_plot = gr.Plot()
959
+ with gr.Tab("Text View"):
960
+ struct_text = gr.Textbox(lines=20)
961
+ with gr.Tab("Stats"):
962
+ struct_stats = gr.Markdown()
963
+
964
+ # 5. Mapping
965
+ with gr.Accordion("5. Block-to-Tag Mapping", open=False):
966
+ gr.Markdown("**Link visual blocks to structure tree elements**")
967
+ map_btn = gr.Button("Map Blocks")
968
+ map_message = gr.Markdown()
969
+ map_table = gr.DataFrame(headers=["Block #", "Tag Type", "MCID", "Alt Text"])
970
+
971
+
972
+ # --- TAB 3: BATCH ---
973
+ with gr.Tab("Batch Analysis"):
974
+ with gr.Row():
975
+ batch_max_pages = gr.Slider(label="Max pages", minimum=1, maximum=500, value=100)
976
+ batch_sample_rate = gr.Slider(label="Sample rate", minimum=1, maximum=10, value=1)
977
+ batch_run_btn = gr.Button("Analyze All Pages", variant="primary")
978
+ batch_progress = gr.Textbox(label="Progress", interactive=False)
979
+
980
+ with gr.Accordion("Summary", open=True):
981
+ batch_summary_md = gr.Markdown()
982
+ with gr.Accordion("Details", open=False):
983
+ batch_chart = gr.Plot()
984
+ batch_table = gr.HTML()
985
+ batch_json = gr.JSON(visible=False)
986
+
987
+ # --- CALLBACKS & WIRING ---
988
+
989
+ def _on_file_change(f):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
990
  path, n, msg = load_pdf(f)
991
+ if not path:
992
+ return path, n, msg, gr.update(maximum=1, value=1)
993
  return path, n, msg, gr.update(maximum=n, value=1)
994
+
995
+ # Main Analysis Inputs/Outputs
996
+ # Note: analyze() now returns (overlay, report, summary, preview)
997
+ analysis_inputs = [pdf_path, page_num, dpi, order_mode, show_spans, highlight_math]
998
+ analysis_outputs = [overlay_img, report, summary, reading_order_preview]
999
+
1000
+ # Upload & Example Triggers
1001
+ pdf_file.change(_on_file_change, inputs=[pdf_file], outputs=[pdf_path, page_count, status, page_num]) \
1002
+ .then(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1003
+
1004
+ # Reactive Event Listeners
1005
+ # Note: page_num.change is strictly better for 'Exploration' than release,
1006
+ # as it updates while typing or stepping.
1007
+ page_num.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1008
+ dpi.release(analyze, inputs=analysis_inputs, outputs=analysis_outputs) # DPI is heavy, use release
1009
+ order_mode.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1010
+ show_spans.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1011
+ highlight_math.change(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1012
+ run_btn.click(analyze, inputs=analysis_inputs, outputs=analysis_outputs)
1013
+
1014
+ # Advanced Analysis Helper Functions (Closures to capture inputs if needed, or just pure)
1015
+
1016
  def update_block_dropdown(pdf_path_val, page_num_val):
1017
  """Update block dropdown when page changes."""
1018
  if not pdf_path_val:
1019
  return gr.update(choices=[], value=None)
 
1020
  try:
1021
+ with fitz.open(pdf_path_val) as doc:
1022
+ blocks = extract_blocks_spans(doc, page_num_val - 1)
1023
+ if not blocks:
1024
+ return gr.update(choices=[], value=None)
1025
+ choices = create_block_choices(blocks)
1026
+ return gr.update(choices=choices, value=0 if choices else None)
1027
  except:
1028
  return gr.update(choices=[], value=None)
1029
 
1030
  def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
 
1031
  if not pdf_path_val or block_idx is None:
1032
  return "Please select a block", ""
 
1033
  try:
1034
+ with fitz.open(pdf_path_val) as doc:
1035
+ blocks = extract_blocks_spans(doc, page_num_val - 1)
1036
+ result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
1037
  if result.get('error'):
1038
  return result['message'], ""
 
1039
  return result['formatted'], result['raw']
1040
  except Exception as e:
1041
  return f"## Error\n\n{str(e)}", ""
1042
 
1043
  def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
 
1044
  if not pdf_path_val:
1045
  return "Please upload a PDF first", ""
 
1046
  try:
1047
+ with fitz.open(pdf_path_val) as doc:
1048
+ blocks = extract_blocks_spans(doc, page_num_val - 1)
1049
+ result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
1050
  if result.get('error'):
1051
  return result.get('message', 'Error'), ""
 
1052
  return result['transcript'], result['analysis']
1053
  except Exception as e:
1054
  return f"## Error\n\n{str(e)}", ""
1055
 
1056
  def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
 
1057
  if not pdf_path_val:
1058
  return None, 0, 0, 0.0, "Please upload a PDF first"
 
1059
  try:
1060
+ with fitz.open(pdf_path_val) as doc:
1061
+ blocks = extract_blocks_spans(doc, page_num_val - 1)
1062
+ result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
1063
  if result.get('error'):
1064
  return None, 0, 0, 0.0, result.get('message', 'Error')
 
 
1065
  overlay = render_paragraph_overlay(
1066
  pdf_path_val, page_num_val - 1, dpi_val,
1067
  result['visual_paragraphs'], result['semantic_paragraphs']
1068
  )
 
1069
  return (
1070
+ overlay, result['visual_count'], result['semantic_count'],
1071
+ result['match_score'], result['mismatches']
 
 
 
1072
  )
1073
  except Exception as e:
1074
  return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
1075
 
1076
  def run_structure_tree_extraction(pdf_path_val):
 
1077
  if not pdf_path_val:
1078
  return None, "Please upload a PDF first", ""
 
1079
  try:
1080
  result = analyze_structure_tree(pdf_path_val)
 
1081
  if result.get('error'):
1082
  return None, result['message'], ""
 
1083
  return result['plot_data'], result['text_view'], result['statistics']
1084
  except Exception as e:
1085
  return None, f"## Error\n\n{str(e)}", ""
1086
 
1087
  def run_block_tag_mapping(pdf_path_val, page_num_val):
 
1088
  if not pdf_path_val:
1089
  return "Please upload a PDF first", []
 
1090
  try:
1091
+ with fitz.open(pdf_path_val) as doc:
1092
+ blocks = extract_blocks_spans(doc, page_num_val - 1)
1093
+ result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
1094
  if result.get('error'):
1095
  return result.get('message', 'Error'), []
 
1096
  return result['message'], result['mappings']
1097
  except Exception as e:
1098
  return f"## Error\n\n{str(e)}", []
1099
+
1100
+ # 5. Advanced Tool Wiring
1101
+
1102
+ # Update dropdown when page changes
1103
+ page_num.change(update_block_dropdown, inputs=[pdf_path, page_num], outputs=[cs_block_dropdown])
1104
+
1105
+ # Clear stale results when page changes (User Request: "Did it reset?")
1106
+ # We clear the outputs of advanced tools so users know they need to regenerate
1107
+ def clear_stale():
1108
+ return None, "", None, "", None, 0, 0, 0, "", None, "", ""
1109
+
1110
+ # Actually, let's keep it simple. Just clearing the main ones users look at.
1111
  page_num.change(
1112
+ lambda: ("", ""),
1113
+ outputs=[sr_transcript, sr_analysis]
1114
+ )
1115
+ # Also clear paragraph overlay?
1116
+ page_num.change(
1117
+ lambda: None,
1118
+ outputs=[para_overlay]
1119
  )
1120
 
1121
  cs_inspect_btn.click(
 
1148
  outputs=[map_message, map_table]
1149
  )
1150
 
1151
+ batch_run_btn.click(
1152
+ analyze_batch_with_progress,
1153
+ inputs=[pdf_path, batch_max_pages, batch_sample_rate],
1154
+ outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
1155
+ )
1156
+
1157
  if __name__ == "__main__":
1158
+ demo.launch(css=".result-markdown { font-size: 14px; } .help-md { font-size: 12px; color: #666; }")
1159
 
layout_utils.py ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Layout Utilities Module
3
+
4
+ Contains shared logic for block extraction, ordering, and data structures
5
+ to avoid circular dependencies between app.py and other modules.
6
+ """
7
+
8
+ from dataclasses import dataclass
9
+ from typing import List, Tuple, Any, Dict, Optional
10
+ import pymupdf as fitz
11
+ import re
12
+
13
+ @dataclass
14
+ class SpanInfo:
15
+ bbox: Tuple[float, float, float, float]
16
+ text: str
17
+ font: str
18
+ size: float
19
+
20
+ @dataclass
21
+ class BlockInfo:
22
+ bbox: Tuple[float, float, float, float]
23
+ text: str
24
+ block_type: int # 0 text, 1 image, 2 drawing in PyMuPDF terms for some outputs
25
+ spans: List[SpanInfo]
26
+
27
+ @dataclass
28
+ class PageDiagnostic:
29
+ """Extended diagnostic for batch processing."""
30
+ page_num: int
31
+ tagged_pdf: bool
32
+ text_len: int
33
+ image_block_count: int
34
+ font_count: int
35
+ has_type3_fonts: bool
36
+ suspicious_garbled_text: bool
37
+ likely_scanned_image_page: bool
38
+ likely_text_as_vector_outlines: bool
39
+ multi_column_guess: bool
40
+ processing_time_ms: Optional[int] = None
41
+
42
+ @dataclass
43
+ class BatchAnalysisResult:
44
+ """Aggregate results from all pages."""
45
+ total_pages: int
46
+ pages_analyzed: int
47
+ summary_stats: Dict[str, int]
48
+ per_page_results: List[PageDiagnostic]
49
+ common_issues: List[str]
50
+ critical_pages: List[int]
51
+ processing_time_sec: float
52
+
53
+ def to_dict(self) -> Dict[str, Any]:
54
+ """Convert to JSON-serializable dict."""
55
+ return {
56
+ "total_pages": self.total_pages,
57
+ "pages_analyzed": self.pages_analyzed,
58
+ "summary_stats": self.summary_stats,
59
+ "per_page_results": [
60
+ {
61
+ "page_num": p.page_num,
62
+ "tagged_pdf": p.tagged_pdf,
63
+ "text_len": p.text_len,
64
+ "image_block_count": p.image_block_count,
65
+ "font_count": p.font_count,
66
+ "has_type3_fonts": p.has_type3_fonts,
67
+ "suspicious_garbled_text": p.suspicious_garbled_text,
68
+ "likely_scanned_image_page": p.likely_scanned_image_page,
69
+ "likely_text_as_vector_outlines": p.likely_text_as_vector_outlines,
70
+ "multi_column_guess": p.multi_column_guess,
71
+ "processing_time_ms": p.processing_time_ms,
72
+ }
73
+ for p in self.per_page_results
74
+ ],
75
+ "common_issues": self.common_issues,
76
+ "critical_pages": self.critical_pages,
77
+ "processing_time_sec": self.processing_time_sec,
78
+ }
79
+
80
+ def _safe_str(x: Any, max_len: int = 400) -> str:
81
+ s = str(x)
82
+ if len(s) > max_len:
83
+ s = s[:max_len] + "…"
84
+ return s
85
+
86
+ def _looks_like_math(text: str) -> bool:
87
+ # Heuristic: mathy glyphs/symbols and patterns
88
+ if not text:
89
+ return False
90
+ math_syms = r"[βˆ‘βˆ«βˆšβ‰ˆβ‰ β‰€β‰₯βˆžΒ±Γ—Γ·βˆ‚βˆ‡βˆˆβˆ©βˆͺβŠ‚βŠ†βŠ‡βŠƒβ†’β†¦βˆ€βˆƒβ„β„€β„šβ„•]"
91
+ latexy = r"(\\frac|\\sqrt|\\sum|\\int|_|\^|\b(?:sin|cos|tan|log|ln)\b)"
92
+ return bool(re.search(math_syms, text) or re.search(latexy, text))
93
+
94
+ def extract_blocks_spans(doc: fitz.Document, page_index: int) -> List[BlockInfo]:
95
+ page = doc[page_index]
96
+ raw = page.get_text("dict") # includes blocks/lines/spans with bboxes
97
+ mat = page.rotation_matrix
98
+ blocks: List[BlockInfo] = []
99
+ for b in raw.get("blocks", []):
100
+ btype = int(b.get("type", -1))
101
+
102
+ # Transform block bbox to visual coordinates
103
+ bbox_rect = fitz.Rect(b.get("bbox", (0, 0, 0, 0))) * mat
104
+ bbox = tuple(bbox_rect)
105
+
106
+ text_parts: List[str] = []
107
+ spans: List[SpanInfo] = []
108
+ if btype == 0: # text
109
+ for line in b.get("lines", []):
110
+ for sp in line.get("spans", []):
111
+ t = sp.get("text", "")
112
+ if t:
113
+ text_parts.append(t)
114
+
115
+ # Transform span bbox to visual coordinates
116
+ sp_bbox_rect = fitz.Rect(sp.get("bbox", (0, 0, 0, 0))) * mat
117
+
118
+ spans.append(
119
+ SpanInfo(
120
+ bbox=tuple(sp_bbox_rect),
121
+ text=t,
122
+ font=_safe_str(sp.get("font", "")),
123
+ size=float(sp.get("size", 0.0)),
124
+ )
125
+ )
126
+ text = "".join(text_parts).strip()
127
+ blocks.append(BlockInfo(bbox=bbox, text=text, block_type=btype, spans=spans))
128
+ return blocks
129
+
130
+ def order_blocks(blocks: List[BlockInfo], mode: str) -> List[Tuple[int, BlockInfo]]:
131
+ """
132
+ Return list of (idx, block) in chosen order.
133
+ """
134
+ indexed = list(enumerate(blocks))
135
+ if mode == "raw":
136
+ return indexed
137
+
138
+ def key_tblr(item: Tuple[int, BlockInfo]) -> Tuple[int, int]:
139
+ _, b = item
140
+ x0, y0, x1, y1 = b.bbox
141
+ return (int(y0), int(x0))
142
+
143
+ if mode == "tblr":
144
+ return sorted(indexed, key=key_tblr)
145
+
146
+ if mode == "columns":
147
+ # Simple 2-column heuristic:
148
+ # cluster by x-center around midline, then sort within each column.
149
+ # This is a heuristic; tagged PDFs should make this unnecessary.
150
+ xs = []
151
+ for _, b in indexed:
152
+ x0, y0, x1, y1 = b.bbox
153
+ if (x1 - x0) > 5:
154
+ xs.append((x0 + x1) / 2.0)
155
+ if not xs:
156
+ return sorted(indexed, key=key_tblr)
157
+ mid = sorted(xs)[len(xs) // 2]
158
+
159
+ left = []
160
+ right = []
161
+ for it in indexed:
162
+ _, b = it
163
+ x0, y0, x1, y1 = b.bbox
164
+ cx = (x0 + x1) / 2.0
165
+ (left if cx < mid else right).append(it)
166
+
167
+ left = sorted(left, key=key_tblr)
168
+ right = sorted(right, key=key_tblr)
169
+
170
+ # Read left column first, then right
171
+ return left + right
172
+
173
+ # Fallback
174
+ return sorted(indexed, key=key_tblr)
screen_reader_sim.py CHANGED
@@ -167,7 +167,7 @@ def _simulate_untagged(
167
  Returns:
168
  Tuple of (transcript, analysis)
169
  """
170
- from app import order_blocks # Import the ordering function
171
 
172
  # Order blocks according to mode
173
  ordered_blocks = order_blocks(blocks, order_mode)
@@ -177,7 +177,7 @@ def _simulate_untagged(
177
  text_block_count = 0
178
  image_block_count = 0
179
 
180
- for block in ordered_blocks:
181
  if block.block_type == 0: # Text block
182
  # Infer heading from font size
183
  is_heading = False
 
167
  Returns:
168
  Tuple of (transcript, analysis)
169
  """
170
+ from layout_utils import order_blocks # Import the ordering function
171
 
172
  # Order blocks according to mode
173
  ordered_blocks = order_blocks(blocks, order_mode)
 
177
  text_block_count = 0
178
  image_block_count = 0
179
 
180
+ for idx, block in ordered_blocks:
181
  if block.block_type == 0: # Text block
182
  # Infer heading from font size
183
  is_heading = False