rianders commited on
Commit
27fda3f
·
1 Parent(s): 9d99474

advanced featrues

Browse files
Files changed (6) hide show
  1. CLAUDE.md +111 -1
  2. advanced_analysis.py +430 -0
  3. app.py +353 -1
  4. content_stream_parser.py +322 -0
  5. screen_reader_sim.py +398 -0
  6. structure_tree.py +493 -0
CLAUDE.md CHANGED
@@ -107,6 +107,43 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
107
  - `format_batch_results_table()`: Color-coded HTML table per page
108
  - `format_batch_results_chart()`: Plotly bar chart of issue distribution
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ### Key Data Structures
111
 
112
  **Single Page Analysis**:
@@ -122,13 +159,24 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
122
  - `critical_pages`: Pages with 3+ issues
123
  - `to_dict()`: Method to convert to JSON-serializable format
124
 
 
 
 
 
 
 
 
 
 
 
 
125
  **UI State**:
126
  - The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
127
  - Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
128
 
129
  ### Gradio UI Flow
130
 
131
- The UI is organized into two tabs: **Single Page Analysis** and **Batch Analysis**.
132
 
133
  #### Single Page Tab
134
  1. User uploads PDF → `_on_upload` → extracts path and page count
@@ -151,6 +199,51 @@ The UI is organized into two tabs: **Single Page Analysis** and **Batch Analysis
151
  - Color-coded HTML table of per-page results
152
  - Full JSON report
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  #### Help & Documentation
155
  - All UI controls have `info` parameters with inline tooltips
156
  - Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
@@ -233,6 +326,23 @@ Multi-page document analysis with aggregate statistics:
233
  - Plotly charts via `gr.Plot()` for interactive visualizations
234
  - All batch results have `.to_dict()` method for JSON export
235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
  ## Testing
237
 
238
  ### Manual Testing Checklist
 
107
  - `format_batch_results_table()`: Color-coded HTML table per page
108
  - `format_batch_results_chart()`: Plotly bar chart of issue distribution
109
 
110
+ ### Advanced Analysis Modules
111
+
112
+ The application includes specialized modules for advanced PDF accessibility analysis:
113
+
114
+ **advanced_analysis.py** - Coordinator module
115
+ - Provides facade functions with error handling
116
+ - `require_structure_tree` decorator: checks for tagged PDFs before execution
117
+ - `safe_execute` decorator: comprehensive error handling with user-friendly messages
118
+ - Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
119
+
120
+ **content_stream_parser.py** - PDF operator extraction
121
+ - `extract_content_stream_for_block()`: Gets operators for a specific block
122
+ - `_parse_text_objects()`: Extracts BT...ET blocks from content stream
123
+ - `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
124
+ - `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
125
+ - Returns formatted markdown and raw stream text
126
+
127
+ **screen_reader_sim.py** - Accessibility simulation
128
+ - `simulate_screen_reader()`: Main simulation function
129
+ - `_simulate_tagged()`: Follows structure tree for tagged PDFs
130
+ - `_simulate_untagged()`: Falls back to visual order for untagged PDFs
131
+ - `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
132
+ - Supports heading levels, paragraphs, figures, formulas, lists, tables, links
133
+ - Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
134
+
135
+ **structure_tree.py** - Structure tree analysis
136
+ - `StructureNode` dataclass: represents PDF tag hierarchy
137
+ - `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
138
+ - `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
139
+ - `format_tree_text()`: Creates indented text view with box-drawing characters
140
+ - `get_tree_statistics()`: Counts nodes, tags, alt text coverage
141
+ - `extract_mcid_for_page()`: Finds marked content IDs in page content stream
142
+ - `map_blocks_to_tags()`: Correlates visual blocks with structure elements
143
+ - `detect_visual_paragraphs()`: Spacing-based paragraph detection
144
+ - `detect_semantic_paragraphs()`: Extracts <P> tags for a page
145
+ - `compare_paragraphs()`: Calculates match quality between visual and semantic
146
+
147
  ### Key Data Structures
148
 
149
  **Single Page Analysis**:
 
159
  - `critical_pages`: Pages with 3+ issues
160
  - `to_dict()`: Method to convert to JSON-serializable format
161
 
162
+ **Advanced Analysis**:
163
+ - `StructureNode`: Represents a node in the PDF structure tree with:
164
+ - `tag_type`: Tag name (P, H1, Document, Figure, etc.)
165
+ - `depth`: Nesting level in the tree
166
+ - `mcid`: Marked Content ID (links to page content)
167
+ - `alt_text`: Alternative text for accessibility
168
+ - `actual_text`: Actual text content or replacement text
169
+ - `page_ref`: 0-based page index
170
+ - `children`: List of child StructureNode objects
171
+ - `to_dict()`: Convert to JSON-serializable format
172
+
173
  **UI State**:
174
  - The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
175
  - Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
176
 
177
  ### Gradio UI Flow
178
 
179
+ The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
180
 
181
  #### Single Page Tab
182
  1. User uploads PDF → `_on_upload` → extracts path and page count
 
199
  - Color-coded HTML table of per-page results
200
  - Full JSON report
201
 
202
+ #### Advanced Analysis Tab
203
+
204
+ Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
205
+
206
+ 1. **Content Stream Inspector**:
207
+ - Extracts raw PDF content stream operators for a specific block
208
+ - Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
209
+ - Useful for debugging text extraction, font issues, and positioning problems
210
+ - Provides both formatted view and raw stream
211
+ - Uses regex parsing of content streams (approximate for complex PDFs)
212
+
213
+ 2. **Screen Reader Simulator**:
214
+ - Simulates NVDA or JAWS reading behavior for the current page
215
+ - Two modes:
216
+ - **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
217
+ - **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
218
+ - Three detail levels: minimal (text only), default (element announcements), verbose (full context)
219
+ - Generates transcript + analysis with alt text coverage statistics
220
+ - Reading order configurable for untagged fallback (raw/tblr/columns)
221
+
222
+ 3. **Paragraph Detection**:
223
+ - Compares visual paragraphs (detected by spacing) vs semantic <P> tags
224
+ - Visual detection: groups blocks with vertical gap < threshold (default 15pt)
225
+ - Semantic detection: extracts &lt;P&gt; tags from structure tree
226
+ - Generates color-coded overlay (green = visual paragraphs)
227
+ - Reports match quality score and mismatches
228
+ - Requires tagged PDF for semantic comparison
229
+
230
+ 4. **Structure Tree Visualizer**:
231
+ - Extracts complete PDF tag hierarchy from StructTreeRoot
232
+ - Three visualization formats:
233
+ - **Tree Diagram**: Interactive Plotly sunburst chart
234
+ - **Text View**: Indented text with box-drawing characters
235
+ - **Statistics**: Node counts, tag distribution, alt text coverage
236
+ - Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
237
+ - Displays alt text, actual text, page references, and MCID markers
238
+ - Only works for tagged PDFs
239
+
240
+ 5. **Block-to-Tag Mapping**:
241
+ - Maps visual blocks to structure tree elements via MCID (Marked Content ID)
242
+ - Shows which blocks have proper semantic tagging
243
+ - DataFrame output with block index, tag type, MCID, alt text
244
+ - Helps identify untagged content
245
+ - Requires tagged PDF with MCID references
246
+
247
  #### Help & Documentation
248
  - All UI controls have `info` parameters with inline tooltips
249
  - Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
 
326
  - Plotly charts via `gr.Plot()` for interactive visualizations
327
  - All batch results have `.to_dict()` method for JSON export
328
 
329
+ ### Advanced Analysis Error Handling
330
+ - **Graceful Degradation**: All advanced features check for requirements before execution
331
+ - **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
332
+ - `@require_structure_tree` decorator checks for StructTreeRoot
333
+ - Returns user-friendly error message if not found
334
+ - Explains what tagging is and why it's needed
335
+ - **Safe Execution**: All features wrapped in `@safe_execute` decorator
336
+ - Catches all exceptions with traceback
337
+ - Returns formatted error messages instead of crashing
338
+ - **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
339
+ - Returns "not matched" status if text object not found
340
+ - Shows raw stream even if parsing fails
341
+ - **MCID Extraction**: May fail if content stream uses non-standard encoding
342
+ - Returns empty list on failure
343
+ - Block-to-tag mapping shows "No mappings found" message
344
+ - **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
345
+
346
  ## Testing
347
 
348
  ### Manual Testing Checklist
advanced_analysis.py ADDED
@@ -0,0 +1,430 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Advanced Analysis Coordinator Module
3
+
4
+ Provides high-level facade functions for advanced PDF accessibility features,
5
+ with error handling and graceful degradation.
6
+ """
7
+
8
+ from typing import Dict, List, Any, Optional, Callable
9
+ from functools import wraps
10
+ import pikepdf
11
+ import traceback
12
+
13
+ # Import feature modules
14
+ from content_stream_parser import (
15
+ extract_content_stream_for_block,
16
+ format_operators_markdown,
17
+ format_raw_stream
18
+ )
19
+ from screen_reader_sim import (
20
+ simulate_screen_reader,
21
+ format_transcript
22
+ )
23
+ from structure_tree import (
24
+ extract_structure_tree,
25
+ format_tree_text,
26
+ get_tree_statistics,
27
+ format_statistics_markdown,
28
+ map_blocks_to_tags,
29
+ detect_visual_paragraphs,
30
+ detect_semantic_paragraphs,
31
+ compare_paragraphs
32
+ )
33
+
34
+
35
+ def require_structure_tree(func: Callable) -> Callable:
36
+ """
37
+ Decorator to check for structure tree before executing function.
38
+
39
+ Functions decorated with this will return an error message if the PDF
40
+ does not have a tagged structure tree.
41
+ """
42
+ @wraps(func)
43
+ def wrapper(pdf_path: str, *args, **kwargs):
44
+ try:
45
+ with pikepdf.open(pdf_path) as pdf:
46
+ if '/StructTreeRoot' not in pdf.Root:
47
+ return {
48
+ 'error': True,
49
+ 'message': '## No Structure Tree Found\n\n'
50
+ 'This PDF does not have a tagged structure tree. '
51
+ 'This feature requires a tagged PDF.\n\n'
52
+ '**What this means**: The PDF was not created with '
53
+ 'accessibility tagging, so semantic structure information '
54
+ '(headings, paragraphs, alt text) is not available.\n\n'
55
+ '**Recommendation**: Use authoring tools that support '
56
+ 'PDF/UA tagging (Adobe Acrobat, MS Word with "Save as Tagged PDF").'
57
+ }
58
+ except Exception as e:
59
+ return {
60
+ 'error': True,
61
+ 'message': f'## Error\n\nCould not open PDF: {str(e)}'
62
+ }
63
+
64
+ return func(pdf_path, *args, **kwargs)
65
+
66
+ return wrapper
67
+
68
+
69
+ def safe_execute(func: Callable) -> Callable:
70
+ """
71
+ Decorator for safe execution with comprehensive error handling.
72
+
73
+ Catches all exceptions and returns user-friendly error messages.
74
+ """
75
+ @wraps(func)
76
+ def wrapper(*args, **kwargs):
77
+ try:
78
+ return func(*args, **kwargs)
79
+ except Exception as e:
80
+ error_trace = traceback.format_exc()
81
+ return {
82
+ 'error': True,
83
+ 'message': f'## Error\n\n{str(e)}\n\n**Details**:\n```\n{error_trace}\n```'
84
+ }
85
+
86
+ return wrapper
87
+
88
+
89
+ # Feature 1: Content Stream Inspector
90
+
91
+ @safe_execute
92
+ def analyze_content_stream(
93
+ pdf_path: str,
94
+ page_index: int,
95
+ block_index: int,
96
+ blocks: List[Any]
97
+ ) -> Dict[str, Any]:
98
+ """
99
+ Analyze content stream operators for a specific block.
100
+
101
+ Args:
102
+ pdf_path: Path to PDF file
103
+ page_index: 0-based page index
104
+ block_index: Index of block to analyze
105
+ blocks: List of BlockInfo objects
106
+
107
+ Returns:
108
+ Dictionary with formatted operators and raw stream
109
+ """
110
+ result = extract_content_stream_for_block(pdf_path, page_index, block_index, blocks)
111
+
112
+ if 'error' in result:
113
+ return {
114
+ 'error': True,
115
+ 'message': f"## Error\n\n{result['error']}"
116
+ }
117
+
118
+ return {
119
+ 'error': False,
120
+ 'formatted': format_operators_markdown(result),
121
+ 'raw': format_raw_stream(result.get('raw_stream', '')),
122
+ 'matched': result.get('matched', False)
123
+ }
124
+
125
+
126
+ # Feature 2: Screen Reader Simulator
127
+
128
+ @safe_execute
129
+ def analyze_screen_reader(
130
+ pdf_path: str,
131
+ page_index: int,
132
+ blocks: List[Any],
133
+ reader_type: str = "NVDA",
134
+ detail_level: str = "default",
135
+ order_mode: str = "tblr"
136
+ ) -> Dict[str, Any]:
137
+ """
138
+ Simulate screen reader output for a page.
139
+
140
+ Args:
141
+ pdf_path: Path to PDF file
142
+ page_index: 0-based page index
143
+ blocks: List of BlockInfo objects
144
+ reader_type: "NVDA" or "JAWS"
145
+ detail_level: "minimal", "default", or "verbose"
146
+ order_mode: Reading order for untagged fallback
147
+
148
+ Returns:
149
+ Dictionary with transcript and analysis
150
+ """
151
+ result = simulate_screen_reader(
152
+ pdf_path, page_index, blocks, reader_type, detail_level, order_mode
153
+ )
154
+
155
+ return {
156
+ 'error': False,
157
+ 'transcript': format_transcript(result),
158
+ 'analysis': result['analysis'],
159
+ 'mode': result['mode']
160
+ }
161
+
162
+
163
+ # Feature 3: Paragraph Detection
164
+
165
+ @safe_execute
166
+ def analyze_paragraphs(
167
+ pdf_path: str,
168
+ page_index: int,
169
+ blocks: List[Any],
170
+ vertical_gap_threshold: float = 15.0
171
+ ) -> Dict[str, Any]:
172
+ """
173
+ Compare visual and semantic paragraph detection.
174
+
175
+ Args:
176
+ pdf_path: Path to PDF file
177
+ page_index: 0-based page index
178
+ blocks: List of BlockInfo objects
179
+ vertical_gap_threshold: Spacing threshold for visual paragraphs
180
+
181
+ Returns:
182
+ Dictionary with comparison results
183
+ """
184
+ # Detect visual paragraphs
185
+ visual_paragraphs = detect_visual_paragraphs(blocks, vertical_gap_threshold)
186
+
187
+ # Detect semantic paragraphs
188
+ semantic_paragraphs = detect_semantic_paragraphs(pdf_path, page_index)
189
+
190
+ # Compare
191
+ comparison = compare_paragraphs(visual_paragraphs, semantic_paragraphs)
192
+
193
+ # Format mismatches
194
+ mismatch_lines = [
195
+ "## Paragraph Comparison",
196
+ "",
197
+ f"**Visual Paragraphs Detected**: {comparison['visual_count']}",
198
+ f"**Semantic &lt;P&gt; Tags Found**: {comparison['semantic_count']}",
199
+ f"**Match Quality Score**: {comparison['match_score']:.2%}",
200
+ ""
201
+ ]
202
+
203
+ if comparison['count_mismatch'] == 0:
204
+ mismatch_lines.append("✓ Count matches between visual and semantic paragraphs")
205
+ else:
206
+ mismatch_lines.append(f"⚠️ Count mismatch: {comparison['count_mismatch']} difference")
207
+
208
+ if comparison['visual_count'] > comparison['semantic_count']:
209
+ mismatch_lines.extend([
210
+ "",
211
+ "**Issue**: More visual paragraphs than semantic tags",
212
+ "- Some paragraphs may be missing &lt;P&gt; tags",
213
+ "- Screen readers may not announce paragraph boundaries properly"
214
+ ])
215
+ elif comparison['semantic_count'] > comparison['visual_count']:
216
+ mismatch_lines.extend([
217
+ "",
218
+ "**Issue**: More semantic tags than visual paragraphs",
219
+ "- Tags may not correspond to actual visual layout",
220
+ "- May cause confusion for users comparing visual and audio presentation"
221
+ ])
222
+
223
+ if semantic_paragraphs == 0 and visual_paragraphs:
224
+ mismatch_lines.extend([
225
+ "",
226
+ "❌ **No semantic tagging found**",
227
+ "This page has no &lt;P&gt; tags. Screen readers will not announce paragraphs."
228
+ ])
229
+
230
+ return {
231
+ 'error': False,
232
+ 'visual_count': comparison['visual_count'],
233
+ 'semantic_count': comparison['semantic_count'],
234
+ 'match_score': comparison['match_score'],
235
+ 'mismatches': '\n'.join(mismatch_lines),
236
+ 'visual_paragraphs': visual_paragraphs,
237
+ 'semantic_paragraphs': semantic_paragraphs
238
+ }
239
+
240
+
241
+ # Feature 4: Structure Tree Visualizer
242
+
243
+ @require_structure_tree
244
+ @safe_execute
245
+ def analyze_structure_tree(pdf_path: str) -> Dict[str, Any]:
246
+ """
247
+ Extract and visualize the PDF structure tree.
248
+
249
+ Args:
250
+ pdf_path: Path to PDF file
251
+
252
+ Returns:
253
+ Dictionary with tree visualization and statistics
254
+ """
255
+ root = extract_structure_tree(pdf_path)
256
+
257
+ if not root:
258
+ return {
259
+ 'error': True,
260
+ 'message': '## Error\n\nCould not extract structure tree'
261
+ }
262
+
263
+ # Generate text view
264
+ text_view = format_tree_text(root, max_nodes=500)
265
+
266
+ # Generate statistics
267
+ stats = get_tree_statistics(root)
268
+ stats_markdown = format_statistics_markdown(stats)
269
+
270
+ # Generate plotly diagram
271
+ plot_data = _create_tree_plot(root)
272
+
273
+ return {
274
+ 'error': False,
275
+ 'text_view': text_view,
276
+ 'statistics': stats_markdown,
277
+ 'plot_data': plot_data,
278
+ 'stats': stats
279
+ }
280
+
281
+
282
+ def _create_tree_plot(root):
283
+ """
284
+ Create Plotly sunburst diagram data from structure tree.
285
+
286
+ Args:
287
+ root: Root StructureNode
288
+
289
+ Returns:
290
+ Plotly figure
291
+ """
292
+ import plotly.graph_objects as go
293
+
294
+ labels = []
295
+ parents = []
296
+ values = []
297
+ colors = []
298
+
299
+ # Color map for common tag types
300
+ color_map = {
301
+ 'Document': '#1f77b4',
302
+ 'Part': '#ff7f0e',
303
+ 'Sect': '#2ca02c',
304
+ 'H1': '#d62728',
305
+ 'H2': '#9467bd',
306
+ 'H3': '#8c564b',
307
+ 'H4': '#e377c2',
308
+ 'H5': '#7f7f7f',
309
+ 'H6': '#bcbd22',
310
+ 'P': '#17becf',
311
+ 'Figure': '#ff9896',
312
+ 'Table': '#c5b0d5',
313
+ 'L': '#c49c94',
314
+ 'LI': '#f7b6d2',
315
+ 'Link': '#c7c7c7',
316
+ }
317
+
318
+ def _traverse(node, parent_label=None):
319
+ # Create unique label
320
+ if node.depth == 0:
321
+ label = node.tag_type
322
+ else:
323
+ label = f"{node.tag_type}_{len(labels)}"
324
+
325
+ labels.append(label)
326
+ parents.append(parent_label if parent_label else "")
327
+ values.append(1)
328
+
329
+ # Assign color
330
+ base_tag = node.tag_type.split('_')[0]
331
+ color = color_map.get(base_tag, '#d3d3d3')
332
+ colors.append(color)
333
+
334
+ # Process children
335
+ for child in node.children:
336
+ _traverse(child, label)
337
+
338
+ _traverse(root)
339
+
340
+ fig = go.Figure(go.Sunburst(
341
+ labels=labels,
342
+ parents=parents,
343
+ values=values,
344
+ marker=dict(colors=colors),
345
+ branchvalues="total"
346
+ ))
347
+
348
+ fig.update_layout(
349
+ title="PDF Structure Tree Hierarchy",
350
+ height=600,
351
+ margin=dict(t=50, l=0, r=0, b=0)
352
+ )
353
+
354
+ return fig
355
+
356
+
357
+ # Feature 5: Block-to-Tag Mapping
358
+
359
+ @require_structure_tree
360
+ @safe_execute
361
+ def analyze_block_tag_mapping(
362
+ pdf_path: str,
363
+ page_index: int,
364
+ blocks: List[Any]
365
+ ) -> Dict[str, Any]:
366
+ """
367
+ Map visual blocks to structure tree tags.
368
+
369
+ Args:
370
+ pdf_path: Path to PDF file
371
+ page_index: 0-based page index
372
+ blocks: List of BlockInfo objects
373
+
374
+ Returns:
375
+ Dictionary with mapping table
376
+ """
377
+ mappings = map_blocks_to_tags(pdf_path, page_index, blocks)
378
+
379
+ if not mappings:
380
+ return {
381
+ 'error': False,
382
+ 'mappings': [],
383
+ 'message': '## No Mappings Found\n\n'
384
+ 'Could not find block-to-tag correlations for this page. '
385
+ 'This may occur if:\n'
386
+ '- The page has no marked content IDs (MCIDs)\n'
387
+ '- The structure tree is not properly linked to content\n'
388
+ '- The page uses a non-standard tagging approach'
389
+ }
390
+
391
+ # Format as table data
392
+ table_data = []
393
+ for m in mappings:
394
+ table_data.append([
395
+ str(m['block_index']),
396
+ m['tag_type'],
397
+ str(m['mcid']),
398
+ m['alt_text'][:50] if m['alt_text'] else ""
399
+ ])
400
+
401
+ return {
402
+ 'error': False,
403
+ 'mappings': table_data,
404
+ 'count': len(mappings),
405
+ 'message': f'## Block-to-Tag Mapping\n\nFound {len(mappings)} correlations'
406
+ }
407
+
408
+
409
+ # Utility function for creating block dropdown choices
410
+
411
+ def create_block_choices(blocks: List[Any]) -> List[tuple]:
412
+ """
413
+ Create dropdown choices from blocks for UI.
414
+
415
+ Args:
416
+ blocks: List of BlockInfo objects
417
+
418
+ Returns:
419
+ List of (label, value) tuples
420
+ """
421
+ choices = []
422
+ for i, block in enumerate(blocks):
423
+ text_preview = block.text[:50].replace('\n', ' ').strip()
424
+ if len(block.text) > 50:
425
+ text_preview += "..."
426
+
427
+ label = f"Block {i}: {text_preview}" if text_preview else f"Block {i} [Image]"
428
+ choices.append((label, i))
429
+
430
+ return choices
app.py CHANGED
@@ -13,6 +13,16 @@ import pymupdf as fitz # PyMuPDF
13
  import pikepdf
14
  from PIL import Image, ImageDraw, ImageFont
15
 
 
 
 
 
 
 
 
 
 
 
16
  # -----------------------------
17
  # Color Palettes for Adaptive Contrast
18
  # -----------------------------
@@ -413,8 +423,73 @@ def render_page_with_overlay(
413
 
414
  return img
415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  # -----------------------------
417
- # Heuristic problems report
418
  # -----------------------------
419
 
420
  def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
@@ -929,6 +1004,141 @@ Upload a PDF and inspect:
929
 
930
  batch_json = gr.JSON(label="Full Batch Report", visible=False)
931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
932
  def _on_upload(f):
933
  path, n, msg = load_pdf(f)
934
  return path, n, msg, gr.update(maximum=n, value=1)
@@ -947,6 +1157,148 @@ Upload a PDF and inspect:
947
  outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
948
  )
949
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
950
  if __name__ == "__main__":
951
  demo.launch()
952
 
 
13
  import pikepdf
14
  from PIL import Image, ImageDraw, ImageFont
15
 
16
+ # Advanced analysis modules
17
+ from advanced_analysis import (
18
+ analyze_content_stream,
19
+ analyze_screen_reader,
20
+ analyze_paragraphs,
21
+ analyze_structure_tree,
22
+ analyze_block_tag_mapping,
23
+ create_block_choices
24
+ )
25
+
26
  # -----------------------------
27
  # Color Palettes for Adaptive Contrast
28
  # -----------------------------
 
423
 
424
  return img
425
 
426
+
427
+ def render_paragraph_overlay(
428
+ pdf_path: str,
429
+ page_index: int,
430
+ dpi: int,
431
+ visual_paragraphs: List[List[int]],
432
+ semantic_paragraphs: List[Any]
433
+ ) -> Image.Image:
434
+ """
435
+ Render page with color-coded paragraph visualizations.
436
+
437
+ Args:
438
+ pdf_path: Path to PDF file
439
+ page_index: 0-based page index
440
+ dpi: Rendering DPI
441
+ visual_paragraphs: List of visual paragraph groups (block indices)
442
+ semantic_paragraphs: List of semantic paragraph StructureNodes
443
+
444
+ Returns:
445
+ PIL Image with paragraph overlays
446
+ """
447
+ doc = fitz.open(pdf_path)
448
+ page = doc[page_index]
449
+
450
+ # Render base image
451
+ pix = page.get_pixmap(dpi=dpi)
452
+ img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
453
+ draw = ImageDraw.Draw(img, 'RGBA')
454
+
455
+ # Extract blocks for bounding boxes
456
+ blocks = extract_blocks_spans(pdf_path, page_index)
457
+
458
+ # Scale factor from PDF points to pixels
459
+ scale = dpi / 72.0
460
+
461
+ def _rect_i(bbox):
462
+ """Convert PDF bbox to pixel coordinates."""
463
+ x0, y0, x1, y1 = bbox
464
+ return (int(x0 * scale), int(y0 * scale), int(x1 * scale), int(y1 * scale))
465
+
466
+ # Draw visual paragraphs (green = matched, red = unmatched)
467
+ # For simplicity, we'll draw all visual paragraphs in green with transparency
468
+ for para_blocks in visual_paragraphs:
469
+ # Calculate bounding box for entire paragraph
470
+ if not para_blocks:
471
+ continue
472
+
473
+ min_x0 = min(blocks[i].bbox[0] for i in para_blocks if i < len(blocks))
474
+ min_y0 = min(blocks[i].bbox[1] for i in para_blocks if i < len(blocks))
475
+ max_x1 = max(blocks[i].bbox[2] for i in para_blocks if i < len(blocks))
476
+ max_y1 = max(blocks[i].bbox[3] for i in para_blocks if i < len(blocks))
477
+
478
+ r = _rect_i((min_x0, min_y0, max_x1, max_y1))
479
+
480
+ # Green with transparency for visual paragraphs
481
+ draw.rectangle(r, outline=(0, 255, 0, 255), width=3, fill=(0, 255, 0, 30))
482
+
483
+ # Draw semantic paragraph indicators (blue borders)
484
+ # Note: semantic_paragraphs don't have direct bboxes, so we'll just count them
485
+ # In a more complete implementation, we'd map MCIDs to blocks
486
+
487
+ doc.close()
488
+ return img
489
+
490
+
491
  # -----------------------------
492
+ # Heuristic "problems" report
493
  # -----------------------------
494
 
495
  def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
 
1004
 
1005
  batch_json = gr.JSON(label="Full Batch Report", visible=False)
1006
 
1007
+ # Advanced Analysis Tab
1008
+ with gr.Tab("Advanced Analysis"):
1009
+ gr.Markdown("""
1010
+ # Advanced PDF Accessibility Analysis
1011
+
1012
+ Power-user features for deep PDF inspection and accessibility debugging.
1013
+ These tools help diagnose complex accessibility issues and understand internal PDF structure.
1014
+ """)
1015
+
1016
+ with gr.Accordion("1. Content Stream Inspector", open=False):
1017
+ gr.Markdown("""
1018
+ **Purpose**: Inspect raw PDF content stream operators for a specific block
1019
+
1020
+ Shows the low-level PDF commands that render text and graphics. Useful for debugging
1021
+ text extraction issues, font problems, and positioning.
1022
+ """)
1023
+
1024
+ cs_block_dropdown = gr.Dropdown(
1025
+ label="Select Block",
1026
+ choices=[],
1027
+ info="Choose a text or image block to inspect"
1028
+ )
1029
+ cs_inspect_btn = gr.Button("Extract Operators", variant="primary")
1030
+
1031
+ with gr.Tabs():
1032
+ with gr.Tab("Formatted"):
1033
+ cs_operator_display = gr.Markdown()
1034
+ with gr.Tab("Raw Stream"):
1035
+ cs_raw_stream = gr.Code(label="Raw PDF Content Stream")
1036
+
1037
+ with gr.Accordion("2. Screen Reader Simulator", open=False):
1038
+ gr.Markdown("""
1039
+ **Purpose**: Simulate how NVDA or JAWS would read this page
1040
+
1041
+ Generates a transcript showing what a screen reader user would hear, including
1042
+ element announcements and reading order. Works with both tagged and untagged PDFs.
1043
+ """)
1044
+
1045
+ with gr.Row():
1046
+ sr_reader = gr.Radio(
1047
+ ["NVDA", "JAWS"],
1048
+ value="NVDA",
1049
+ label="Screen Reader",
1050
+ info="Choose which screen reader to simulate"
1051
+ )
1052
+ sr_detail = gr.Radio(
1053
+ ["minimal", "default", "verbose"],
1054
+ value="default",
1055
+ label="Detail Level",
1056
+ info="How much context information to include"
1057
+ )
1058
+ sr_order = gr.Radio(
1059
+ ["raw", "tblr", "columns"],
1060
+ value="tblr",
1061
+ label="Reading Order (untagged fallback)",
1062
+ info="Used only if PDF has no structure tree"
1063
+ )
1064
+
1065
+ sr_btn = gr.Button("Generate Transcript", variant="primary")
1066
+
1067
+ with gr.Tabs():
1068
+ with gr.Tab("Transcript"):
1069
+ sr_transcript = gr.Textbox(
1070
+ lines=20,
1071
+ label="Screen Reader Output",
1072
+ interactive=False
1073
+ )
1074
+ with gr.Tab("Analysis"):
1075
+ sr_analysis = gr.Markdown()
1076
+
1077
+ with gr.Accordion("3. Paragraph Detection", open=False):
1078
+ gr.Markdown("""
1079
+ **Purpose**: Compare visual paragraphs vs semantic paragraph tags
1080
+
1081
+ Identifies paragraphs based on spacing (visual) and compares them to &lt;P&gt; tags
1082
+ in the structure tree (semantic). Mismatches can cause confusion for screen reader users.
1083
+ """)
1084
+
1085
+ para_threshold = gr.Slider(
1086
+ label="Vertical Gap Threshold (points)",
1087
+ minimum=5,
1088
+ maximum=30,
1089
+ value=15,
1090
+ step=1,
1091
+ info="Minimum vertical spacing to consider a paragraph break"
1092
+ )
1093
+
1094
+ para_btn = gr.Button("Analyze Paragraphs", variant="primary")
1095
+ para_overlay = gr.Image(label="Paragraph Visualization", type="pil")
1096
+
1097
+ with gr.Row():
1098
+ para_visual = gr.Number(label="Visual Paragraphs", interactive=False)
1099
+ para_semantic = gr.Number(label="Semantic <P> Tags", interactive=False)
1100
+ para_score = gr.Number(label="Match Quality", interactive=False)
1101
+
1102
+ para_mismatches = gr.Markdown()
1103
+
1104
+ with gr.Accordion("4. Structure Tree Visualizer", open=False):
1105
+ gr.Markdown("""
1106
+ **Purpose**: Display the complete PDF tag hierarchy
1107
+
1108
+ Shows the entire structure tree for tagged PDFs, including tag types, alt text,
1109
+ and page references. Only works for PDFs with accessibility tagging.
1110
+ """)
1111
+
1112
+ struct_btn = gr.Button("Extract Structure Tree", variant="primary")
1113
+
1114
+ with gr.Tabs():
1115
+ with gr.Tab("Tree Diagram"):
1116
+ struct_plot = gr.Plot(label="Interactive Hierarchy")
1117
+ with gr.Tab("Text View"):
1118
+ struct_text = gr.Textbox(
1119
+ lines=30,
1120
+ label="Structure Tree",
1121
+ interactive=False
1122
+ )
1123
+ with gr.Tab("Statistics"):
1124
+ struct_stats = gr.Markdown()
1125
+
1126
+ with gr.Accordion("5. Block-to-Tag Mapping", open=False):
1127
+ gr.Markdown("""
1128
+ **Purpose**: Link visual blocks to structure tree elements
1129
+
1130
+ Maps each visual block to its corresponding tag in the structure tree via
1131
+ MCID (Marked Content ID) references. Shows which content is properly tagged.
1132
+ """)
1133
+
1134
+ map_btn = gr.Button("Map Blocks to Tags", variant="primary")
1135
+ map_message = gr.Markdown()
1136
+ map_table = gr.DataFrame(
1137
+ headers=["Block #", "Tag Type", "MCID", "Alt Text"],
1138
+ label="Block-to-Tag Correlations",
1139
+ interactive=False
1140
+ )
1141
+
1142
  def _on_upload(f):
1143
  path, n, msg = load_pdf(f)
1144
  return path, n, msg, gr.update(maximum=n, value=1)
 
1157
  outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
1158
  )
1159
 
1160
+ # Advanced Analysis Callbacks
1161
+
1162
+ def update_block_dropdown(pdf_path_val, page_num_val):
1163
+ """Update block dropdown when page changes."""
1164
+ if not pdf_path_val:
1165
+ return gr.update(choices=[], value=None)
1166
+
1167
+ try:
1168
+ blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1169
+ choices = create_block_choices(blocks)
1170
+ return gr.update(choices=choices, value=0 if choices else None)
1171
+ except:
1172
+ return gr.update(choices=[], value=None)
1173
+
1174
+ def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
1175
+ """Run content stream analysis for selected block."""
1176
+ if not pdf_path_val or block_idx is None:
1177
+ return "Please select a block", ""
1178
+
1179
+ try:
1180
+ blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1181
+ result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
1182
+
1183
+ if result.get('error'):
1184
+ return result['message'], ""
1185
+
1186
+ return result['formatted'], result['raw']
1187
+ except Exception as e:
1188
+ return f"## Error\n\n{str(e)}", ""
1189
+
1190
+ def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
1191
+ """Run screen reader simulation."""
1192
+ if not pdf_path_val:
1193
+ return "Please upload a PDF first", ""
1194
+
1195
+ try:
1196
+ blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1197
+ result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
1198
+
1199
+ if result.get('error'):
1200
+ return result.get('message', 'Error'), ""
1201
+
1202
+ return result['transcript'], result['analysis']
1203
+ except Exception as e:
1204
+ return f"## Error\n\n{str(e)}", ""
1205
+
1206
+ def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
1207
+ """Run paragraph detection and comparison."""
1208
+ if not pdf_path_val:
1209
+ return None, 0, 0, 0.0, "Please upload a PDF first"
1210
+
1211
+ try:
1212
+ blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1213
+ result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
1214
+
1215
+ if result.get('error'):
1216
+ return None, 0, 0, 0.0, result.get('message', 'Error')
1217
+
1218
+ # Create visualization overlay
1219
+ overlay = render_paragraph_overlay(
1220
+ pdf_path_val, page_num_val - 1, dpi_val,
1221
+ result['visual_paragraphs'], result['semantic_paragraphs']
1222
+ )
1223
+
1224
+ return (
1225
+ overlay,
1226
+ result['visual_count'],
1227
+ result['semantic_count'],
1228
+ result['match_score'],
1229
+ result['mismatches']
1230
+ )
1231
+ except Exception as e:
1232
+ return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
1233
+
1234
+ def run_structure_tree_extraction(pdf_path_val):
1235
+ """Extract and visualize structure tree."""
1236
+ if not pdf_path_val:
1237
+ return None, "Please upload a PDF first", ""
1238
+
1239
+ try:
1240
+ result = analyze_structure_tree(pdf_path_val)
1241
+
1242
+ if result.get('error'):
1243
+ return None, result['message'], ""
1244
+
1245
+ return result['plot_data'], result['text_view'], result['statistics']
1246
+ except Exception as e:
1247
+ return None, f"## Error\n\n{str(e)}", ""
1248
+
1249
+ def run_block_tag_mapping(pdf_path_val, page_num_val):
1250
+ """Map blocks to structure tags."""
1251
+ if not pdf_path_val:
1252
+ return "Please upload a PDF first", []
1253
+
1254
+ try:
1255
+ blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
1256
+ result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
1257
+
1258
+ if result.get('error'):
1259
+ return result.get('message', 'Error'), []
1260
+
1261
+ return result['message'], result['mappings']
1262
+ except Exception as e:
1263
+ return f"## Error\n\n{str(e)}", []
1264
+
1265
+ # Wire up Advanced Analysis callbacks
1266
+ page_num.change(
1267
+ update_block_dropdown,
1268
+ inputs=[pdf_path, page_num],
1269
+ outputs=[cs_block_dropdown]
1270
+ )
1271
+
1272
+ cs_inspect_btn.click(
1273
+ run_content_stream_inspector,
1274
+ inputs=[pdf_path, page_num, cs_block_dropdown],
1275
+ outputs=[cs_operator_display, cs_raw_stream]
1276
+ )
1277
+
1278
+ sr_btn.click(
1279
+ run_screen_reader_sim,
1280
+ inputs=[pdf_path, page_num, sr_reader, sr_detail, sr_order],
1281
+ outputs=[sr_transcript, sr_analysis]
1282
+ )
1283
+
1284
+ para_btn.click(
1285
+ run_paragraph_detection,
1286
+ inputs=[pdf_path, page_num, dpi, para_threshold],
1287
+ outputs=[para_overlay, para_visual, para_semantic, para_score, para_mismatches]
1288
+ )
1289
+
1290
+ struct_btn.click(
1291
+ run_structure_tree_extraction,
1292
+ inputs=[pdf_path],
1293
+ outputs=[struct_plot, struct_text, struct_stats]
1294
+ )
1295
+
1296
+ map_btn.click(
1297
+ run_block_tag_mapping,
1298
+ inputs=[pdf_path, page_num],
1299
+ outputs=[map_message, map_table]
1300
+ )
1301
+
1302
  if __name__ == "__main__":
1303
  demo.launch()
1304
 
content_stream_parser.py ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Content Stream Parser Module
3
+
4
+ Provides functionality for extracting and analyzing PDF content stream operators,
5
+ correlating them with visual blocks.
6
+ """
7
+
8
+ import re
9
+ from typing import Dict, List, Optional, Any, Tuple
10
+ import fitz # PyMuPDF
11
+
12
+
13
+ def extract_content_stream_for_block(
14
+ pdf_path: str,
15
+ page_index: int,
16
+ block_index: int,
17
+ blocks: List[Any]
18
+ ) -> Dict[str, Any]:
19
+ """
20
+ Extract content stream operators for a specific block.
21
+
22
+ Args:
23
+ pdf_path: Path to the PDF file
24
+ page_index: 0-based page index
25
+ block_index: Index of the block to analyze
26
+ blocks: List of BlockInfo objects from extract_blocks_spans
27
+
28
+ Returns:
29
+ Dictionary with operators, raw stream, and metadata
30
+ """
31
+ if block_index < 0 or block_index >= len(blocks):
32
+ return {
33
+ 'error': 'Invalid block index',
34
+ 'operators': [],
35
+ 'raw_stream': ''
36
+ }
37
+
38
+ target_block = blocks[block_index]
39
+
40
+ try:
41
+ doc = fitz.open(pdf_path)
42
+ page = doc[page_index]
43
+
44
+ # Clean and consolidate content streams
45
+ page.clean_contents()
46
+
47
+ # Get the page's content stream xref
48
+ xref = page.get_contents()[0] # Get first content stream xref
49
+
50
+ # Extract raw stream data
51
+ stream_data = doc.xref_stream(xref)
52
+ try:
53
+ raw_stream = stream_data.decode('latin-1')
54
+ except:
55
+ raw_stream = stream_data.decode('utf-8', errors='ignore')
56
+
57
+ # Parse text objects from the stream
58
+ text_objects = _parse_text_objects(raw_stream)
59
+
60
+ # Find the text object that matches our target block
61
+ matching_object = _find_matching_text_object(text_objects, target_block)
62
+
63
+ doc.close()
64
+
65
+ if matching_object:
66
+ return {
67
+ 'operators': matching_object['operators'],
68
+ 'raw_stream': raw_stream,
69
+ 'matched': True,
70
+ 'block_text': target_block.text[:100]
71
+ }
72
+ else:
73
+ return {
74
+ 'operators': [],
75
+ 'raw_stream': raw_stream,
76
+ 'matched': False,
77
+ 'block_text': target_block.text[:100],
78
+ 'message': 'Could not find matching text object in content stream'
79
+ }
80
+
81
+ except Exception as e:
82
+ return {
83
+ 'error': str(e),
84
+ 'operators': [],
85
+ 'raw_stream': ''
86
+ }
87
+
88
+
89
+ def _parse_text_objects(content_stream: str) -> List[Dict[str, Any]]:
90
+ """
91
+ Parse text objects (BT...ET blocks) from content stream.
92
+
93
+ Args:
94
+ content_stream: Raw PDF content stream text
95
+
96
+ Returns:
97
+ List of text objects with their operators
98
+ """
99
+ text_objects = []
100
+
101
+ # Find all BT...ET blocks
102
+ bt_et_pattern = r'BT\s+(.*?)\s+ET'
103
+ matches = re.finditer(bt_et_pattern, content_stream, re.DOTALL)
104
+
105
+ for match in matches:
106
+ text_block = match.group(1)
107
+ operators = _parse_operators(text_block)
108
+ text_objects.append({
109
+ 'operators': operators,
110
+ 'text': _extract_text_from_operators(operators)
111
+ })
112
+
113
+ return text_objects
114
+
115
+
116
+ def _parse_operators(text_block: str) -> List[Dict[str, str]]:
117
+ """
118
+ Parse individual operators from a text block.
119
+
120
+ Args:
121
+ text_block: Text between BT and ET
122
+
123
+ Returns:
124
+ List of operator dictionaries with type and value
125
+ """
126
+ operators = []
127
+
128
+ # Text matrix (Tm)
129
+ tm_pattern = r'([\d.\-\s]+)\s+Tm'
130
+ for match in re.finditer(tm_pattern, text_block):
131
+ operators.append({
132
+ 'type': 'Tm',
133
+ 'value': match.group(1).strip(),
134
+ 'description': 'Text Matrix'
135
+ })
136
+
137
+ # Font (Tf)
138
+ tf_pattern = r'/(\S+)\s+([\d.]+)\s+Tf'
139
+ for match in re.finditer(tf_pattern, text_block):
140
+ operators.append({
141
+ 'type': 'Tf',
142
+ 'value': f'/{match.group(1)} {match.group(2)}',
143
+ 'description': f'Font: {match.group(1)}, Size: {match.group(2)}'
144
+ })
145
+
146
+ # Text positioning (Td, TD)
147
+ td_pattern = r'([\d.\-]+)\s+([\d.\-]+)\s+T[dD]'
148
+ for match in re.finditer(td_pattern, text_block):
149
+ operators.append({
150
+ 'type': 'Td',
151
+ 'value': f'{match.group(1)} {match.group(2)}',
152
+ 'description': f'Move text position ({match.group(1)}, {match.group(2)})'
153
+ })
154
+
155
+ # Text showing (Tj)
156
+ tj_pattern = r'\((.*?)\)\s*Tj'
157
+ for match in re.finditer(tj_pattern, text_block):
158
+ text = match.group(1)
159
+ operators.append({
160
+ 'type': 'Tj',
161
+ 'value': f'({text})',
162
+ 'description': f'Show text: {text[:50]}'
163
+ })
164
+
165
+ # Text showing (TJ - array)
166
+ tj_array_pattern = r'\[(.*?)\]\s*TJ'
167
+ for match in re.finditer(tj_array_pattern, text_block, re.DOTALL):
168
+ array_content = match.group(1)
169
+ operators.append({
170
+ 'type': 'TJ',
171
+ 'value': f'[{array_content[:100]}]',
172
+ 'description': 'Show text array'
173
+ })
174
+
175
+ # Text leading (TL)
176
+ tl_pattern = r'([\d.\-]+)\s+TL'
177
+ for match in re.finditer(tl_pattern, text_block):
178
+ operators.append({
179
+ 'type': 'TL',
180
+ 'value': match.group(1),
181
+ 'description': f'Text leading: {match.group(1)}'
182
+ })
183
+
184
+ # Color operators (rg, RG, g, G)
185
+ color_pattern = r'([\d.\s]+)\s+(rg|RG|g|G)'
186
+ for match in re.finditer(color_pattern, text_block):
187
+ operators.append({
188
+ 'type': match.group(2),
189
+ 'value': match.group(1).strip(),
190
+ 'description': f'Color: {match.group(1).strip()}'
191
+ })
192
+
193
+ return operators
194
+
195
+
196
+ def _extract_text_from_operators(operators: List[Dict[str, str]]) -> str:
197
+ """
198
+ Extract visible text from operator list.
199
+
200
+ Args:
201
+ operators: List of operator dictionaries
202
+
203
+ Returns:
204
+ Concatenated text content
205
+ """
206
+ text_parts = []
207
+
208
+ for op in operators:
209
+ if op['type'] in ['Tj', 'TJ']:
210
+ # Extract text from parentheses or array
211
+ value = op['value']
212
+ # Simple extraction - just get content in parentheses
213
+ matches = re.findall(r'\((.*?)\)', value)
214
+ text_parts.extend(matches)
215
+
216
+ return ' '.join(text_parts)
217
+
218
+
219
+ def _find_matching_text_object(
220
+ text_objects: List[Dict[str, Any]],
221
+ target_block: Any
222
+ ) -> Optional[Dict[str, Any]]:
223
+ """
224
+ Find the text object that best matches the target block.
225
+
226
+ Args:
227
+ text_objects: List of parsed text objects
228
+ target_block: BlockInfo object to match
229
+
230
+ Returns:
231
+ Matching text object or None
232
+ """
233
+ target_text = target_block.text.strip()
234
+ if not target_text:
235
+ return None
236
+
237
+ best_match = None
238
+ best_score = 0
239
+
240
+ for text_obj in text_objects:
241
+ obj_text = text_obj['text'].strip()
242
+ if not obj_text:
243
+ continue
244
+
245
+ # Calculate similarity score (simple substring matching)
246
+ # Check if either text contains the other
247
+ if target_text in obj_text or obj_text in target_text:
248
+ score = min(len(target_text), len(obj_text)) / max(len(target_text), len(obj_text))
249
+ if score > best_score:
250
+ best_score = score
251
+ best_match = text_obj
252
+
253
+ # Only return match if score is reasonable
254
+ if best_score > 0.3:
255
+ return best_match
256
+
257
+ return None
258
+
259
+
260
+ def format_operators_markdown(result: Dict[str, Any]) -> str:
261
+ """
262
+ Format operators as readable Markdown.
263
+
264
+ Args:
265
+ result: Result dictionary from extract_content_stream_for_block
266
+
267
+ Returns:
268
+ Formatted Markdown string
269
+ """
270
+ if 'error' in result:
271
+ return f"## Error\n\n{result['error']}"
272
+
273
+ lines = [
274
+ "## Content Stream Operators",
275
+ "",
276
+ f"**Block Text**: {result.get('block_text', 'N/A')}",
277
+ ""
278
+ ]
279
+
280
+ if not result.get('matched'):
281
+ lines.extend([
282
+ "⚠️ **Warning**: Could not find exact matching text object in content stream.",
283
+ "",
284
+ result.get('message', ''),
285
+ ""
286
+ ])
287
+
288
+ operators = result.get('operators', [])
289
+ if operators:
290
+ lines.extend([
291
+ "### Operators Found",
292
+ ""
293
+ ])
294
+
295
+ for i, op in enumerate(operators, 1):
296
+ lines.append(f"**{i}. {op['type']}**")
297
+ lines.append(f" - Value: `{op['value']}`")
298
+ lines.append(f" - {op['description']}")
299
+ lines.append("")
300
+ else:
301
+ lines.append("No operators found.")
302
+
303
+ return "\n".join(lines)
304
+
305
+
306
+ def format_raw_stream(raw_stream: str, max_lines: int = 100) -> str:
307
+ """
308
+ Format raw content stream for display.
309
+
310
+ Args:
311
+ raw_stream: Raw PDF content stream text
312
+ max_lines: Maximum number of lines to display
313
+
314
+ Returns:
315
+ Formatted string
316
+ """
317
+ lines = raw_stream.split('\n')
318
+ if len(lines) > max_lines:
319
+ lines = lines[:max_lines]
320
+ lines.append(f"\n... (truncated, {len(raw_stream.split('\n')) - max_lines} more lines)")
321
+
322
+ return '\n'.join(lines)
screen_reader_sim.py ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Screen Reader Simulator Module
3
+
4
+ Simulates how NVDA and JAWS would read a PDF page, supporting both
5
+ tagged (structure tree) and untagged (visual order fallback) PDFs.
6
+ """
7
+
8
+ from typing import Dict, List, Any, Optional, Tuple
9
+ import pikepdf
10
+ from structure_tree import extract_structure_tree, StructureNode
11
+
12
+
13
+ def simulate_screen_reader(
14
+ pdf_path: str,
15
+ page_index: int,
16
+ blocks: List[Any],
17
+ reader_type: str = "NVDA",
18
+ detail_level: str = "default",
19
+ order_mode: str = "tblr"
20
+ ) -> Dict[str, Any]:
21
+ """
22
+ Simulate screen reader output for a PDF page.
23
+
24
+ Args:
25
+ pdf_path: Path to PDF file
26
+ page_index: 0-based page index
27
+ blocks: List of BlockInfo objects from extract_blocks_spans
28
+ reader_type: "NVDA" or "JAWS"
29
+ detail_level: "minimal", "default", or "verbose"
30
+ order_mode: Reading order mode for untagged fallback ("raw", "tblr", "columns")
31
+
32
+ Returns:
33
+ Dictionary with transcript, analysis, and metadata
34
+ """
35
+ # Try tagged approach first
36
+ root = extract_structure_tree(pdf_path)
37
+
38
+ if root:
39
+ # Use structure tree
40
+ transcript, analysis = _simulate_tagged(
41
+ root, page_index, reader_type, detail_level
42
+ )
43
+ mode = "tagged"
44
+ else:
45
+ # Fallback to visual order
46
+ transcript, analysis = _simulate_untagged(
47
+ blocks, reader_type, detail_level, order_mode
48
+ )
49
+ mode = "untagged"
50
+
51
+ return {
52
+ 'transcript': transcript,
53
+ 'analysis': analysis,
54
+ 'mode': mode,
55
+ 'reader_type': reader_type,
56
+ 'detail_level': detail_level
57
+ }
58
+
59
+
60
+ def _simulate_tagged(
61
+ root: StructureNode,
62
+ page_index: int,
63
+ reader_type: str,
64
+ detail_level: str
65
+ ) -> Tuple[str, str]:
66
+ """
67
+ Simulate screen reader for tagged PDF using structure tree.
68
+
69
+ Args:
70
+ root: Root StructureNode
71
+ page_index: Page to simulate (0-based)
72
+ reader_type: "NVDA" or "JAWS"
73
+ detail_level: Detail level
74
+
75
+ Returns:
76
+ Tuple of (transcript, analysis)
77
+ """
78
+ # Collect structure elements for this page
79
+ page_elements = []
80
+
81
+ def _collect_page_elements(node: StructureNode):
82
+ # Include node if it's for this page or has no page ref (document-level)
83
+ if node.page_ref is None or node.page_ref == page_index:
84
+ if node.tag_type not in ['StructTreeRoot', 'MCID']:
85
+ page_elements.append(node)
86
+
87
+ for child in node.children:
88
+ _collect_page_elements(child)
89
+
90
+ _collect_page_elements(root)
91
+
92
+ # Generate transcript
93
+ transcript_lines = []
94
+ element_count = 0
95
+
96
+ for element in page_elements:
97
+ announcement = _format_element_announcement(
98
+ element, reader_type, detail_level
99
+ )
100
+ if announcement:
101
+ transcript_lines.append(announcement)
102
+ element_count += 1
103
+
104
+ transcript = '\n\n'.join(transcript_lines)
105
+
106
+ # Generate analysis
107
+ analysis_lines = [
108
+ "## Screen Reader Analysis (Tagged Mode)",
109
+ "",
110
+ f"**Structure**: This page uses PDF tagging (accessible structure tree)",
111
+ f"**Elements Found**: {element_count}",
112
+ ""
113
+ ]
114
+
115
+ # Count element types
116
+ tag_counts = {}
117
+ for element in page_elements:
118
+ tag_counts[element.tag_type] = tag_counts.get(element.tag_type, 0) + 1
119
+
120
+ if tag_counts:
121
+ analysis_lines.extend([
122
+ "### Element Types",
123
+ ""
124
+ ])
125
+ for tag, count in sorted(tag_counts.items()):
126
+ analysis_lines.append(f"- **{tag}**: {count}")
127
+
128
+ # Check for alt text coverage
129
+ elements_needing_alt = [e for e in page_elements if e.tag_type in ['Figure', 'Formula', 'Artifact']]
130
+ elements_with_alt = [e for e in elements_needing_alt if e.alt_text]
131
+
132
+ if elements_needing_alt:
133
+ coverage = len(elements_with_alt) / len(elements_needing_alt) * 100
134
+ analysis_lines.extend([
135
+ "",
136
+ "### Alt Text Coverage",
137
+ "",
138
+ f"**Elements needing alt text**: {len(elements_needing_alt)}",
139
+ f"**Elements with alt text**: {len(elements_with_alt)}",
140
+ f"**Coverage**: {coverage:.1f}%",
141
+ ""
142
+ ])
143
+
144
+ if coverage < 100:
145
+ analysis_lines.append("⚠️ Some elements are missing alt text")
146
+
147
+ analysis = '\n'.join(analysis_lines)
148
+
149
+ return transcript, analysis
150
+
151
+
152
+ def _simulate_untagged(
153
+ blocks: List[Any],
154
+ reader_type: str,
155
+ detail_level: str,
156
+ order_mode: str
157
+ ) -> Tuple[str, str]:
158
+ """
159
+ Simulate screen reader for untagged PDF using visual order.
160
+
161
+ Args:
162
+ blocks: List of BlockInfo objects
163
+ reader_type: "NVDA" or "JAWS"
164
+ detail_level: Detail level
165
+ order_mode: Reading order mode
166
+
167
+ Returns:
168
+ Tuple of (transcript, analysis)
169
+ """
170
+ from app import order_blocks # Import the ordering function
171
+
172
+ # Order blocks according to mode
173
+ ordered_blocks = order_blocks(blocks, order_mode)
174
+
175
+ # Generate transcript
176
+ transcript_lines = []
177
+ text_block_count = 0
178
+ image_block_count = 0
179
+
180
+ for block in ordered_blocks:
181
+ if block.block_type == 0: # Text block
182
+ # Infer heading from font size
183
+ is_heading = False
184
+ heading_level = None
185
+
186
+ if block.spans:
187
+ avg_size = sum(s.size for s in block.spans) / len(block.spans)
188
+ if avg_size > 18:
189
+ is_heading = True
190
+ heading_level = 1
191
+ elif avg_size > 14:
192
+ is_heading = True
193
+ heading_level = 2
194
+
195
+ # Format announcement
196
+ if is_heading and detail_level != "minimal":
197
+ if reader_type == "NVDA":
198
+ transcript_lines.append(f"Heading level {heading_level}")
199
+ transcript_lines.append(block.text.strip())
200
+ else: # JAWS
201
+ transcript_lines.append(f"Heading {heading_level}: {block.text.strip()}")
202
+ else:
203
+ transcript_lines.append(block.text.strip())
204
+
205
+ text_block_count += 1
206
+
207
+ elif block.block_type == 1: # Image block
208
+ if detail_level != "minimal":
209
+ transcript_lines.append("[Image - no alt text available]")
210
+ image_block_count += 1
211
+
212
+ transcript = '\n\n'.join(transcript_lines)
213
+
214
+ # Generate analysis
215
+ analysis_lines = [
216
+ "## Screen Reader Analysis (Untagged Mode)",
217
+ "",
218
+ "⚠️ **No Structure**: This page does not use PDF tagging",
219
+ "",
220
+ "Screen readers will read text in visual order with limited context.",
221
+ "",
222
+ f"**Reading Order Mode**: {order_mode}",
223
+ f"**Text Blocks**: {text_block_count}",
224
+ f"**Images**: {image_block_count}",
225
+ "",
226
+ "### Limitations",
227
+ "",
228
+ "- No semantic information (headings, lists, tables)",
229
+ "- No alt text for images",
230
+ "- Reading order may not match intended flow",
231
+ "- Navigation by elements not possible",
232
+ "",
233
+ "**Recommendation**: Add PDF tagging for better accessibility"
234
+ ]
235
+
236
+ analysis = '\n'.join(analysis_lines)
237
+
238
+ return transcript, analysis
239
+
240
+
241
+ def _format_element_announcement(
242
+ element: StructureNode,
243
+ reader_type: str,
244
+ detail_level: str
245
+ ) -> Optional[str]:
246
+ """
247
+ Format a structure element as a screen reader announcement.
248
+
249
+ Args:
250
+ element: StructureNode to announce
251
+ reader_type: "NVDA" or "JAWS"
252
+ detail_level: "minimal", "default", or "verbose"
253
+
254
+ Returns:
255
+ Formatted announcement string or None
256
+ """
257
+ tag = element.tag_type
258
+ lines = []
259
+
260
+ # Map PDF tag types to screen reader announcements
261
+ if tag.startswith('H'):
262
+ # Heading
263
+ level = tag[1:] if len(tag) > 1 else '1'
264
+ text = element.actual_text or "[Heading]"
265
+
266
+ if detail_level == "minimal":
267
+ return text
268
+
269
+ if reader_type == "NVDA":
270
+ lines.append(f"Heading level {level}")
271
+ lines.append(text)
272
+ else: # JAWS
273
+ lines.append(f"Heading {level}: {text}")
274
+
275
+ elif tag == 'P':
276
+ # Paragraph
277
+ text = element.actual_text or "[Paragraph]"
278
+
279
+ if detail_level == "minimal":
280
+ return text
281
+
282
+ if detail_level == "verbose":
283
+ if reader_type == "NVDA":
284
+ lines.append("Paragraph")
285
+ lines.append(text)
286
+ if reader_type == "NVDA" and detail_level == "verbose":
287
+ lines.append("Out of paragraph")
288
+ else:
289
+ lines.append(text)
290
+
291
+ elif tag == 'Figure':
292
+ # Figure/Image
293
+ alt_text = element.alt_text or "[Image - no alt text]"
294
+
295
+ if detail_level == "minimal":
296
+ return None
297
+
298
+ if reader_type == "NVDA":
299
+ lines.append("Graphic")
300
+ lines.append(alt_text)
301
+ else: # JAWS
302
+ lines.append(f"Graphic: {alt_text}")
303
+
304
+ elif tag == 'Formula':
305
+ # Math formula
306
+ alt_text = element.alt_text or element.actual_text or "[Formula]"
307
+
308
+ if detail_level == "minimal":
309
+ return alt_text
310
+
311
+ if reader_type == "NVDA":
312
+ lines.append("Formula")
313
+ lines.append(alt_text)
314
+ else: # JAWS
315
+ lines.append(f"Formula: {alt_text}")
316
+
317
+ elif tag in ['L', 'LI']:
318
+ # List/List Item
319
+ text = element.actual_text or "[List item]"
320
+
321
+ if detail_level == "minimal":
322
+ return text
323
+
324
+ if tag == 'L' and detail_level == "verbose":
325
+ lines.append("List start")
326
+ else:
327
+ if reader_type == "NVDA":
328
+ lines.append("List item")
329
+ lines.append(text)
330
+ else: # JAWS
331
+ lines.append(f"Bullet: {text}")
332
+
333
+ elif tag == 'Table':
334
+ # Table
335
+ if detail_level != "minimal":
336
+ if reader_type == "NVDA":
337
+ lines.append("Table")
338
+ else: # JAWS
339
+ lines.append("Table start")
340
+
341
+ elif tag in ['TR', 'TD', 'TH']:
342
+ # Table row/cell
343
+ text = element.actual_text or ""
344
+ if text and detail_level != "minimal":
345
+ lines.append(text)
346
+
347
+ elif tag == 'Link':
348
+ # Link
349
+ text = element.actual_text or "[Link]"
350
+
351
+ if detail_level == "minimal":
352
+ return text
353
+
354
+ if reader_type == "NVDA":
355
+ lines.append("Link")
356
+ lines.append(text)
357
+ else: # JAWS
358
+ lines.append(f"Link: {text}")
359
+
360
+ elif tag == 'Span':
361
+ # Inline text
362
+ text = element.actual_text or ""
363
+ if text:
364
+ return text
365
+
366
+ elif tag in ['Document', 'Part', 'Sect', 'Div', 'Art']:
367
+ # Container elements - usually not announced
368
+ return None
369
+
370
+ else:
371
+ # Unknown tag type
372
+ if element.actual_text:
373
+ return element.actual_text
374
+
375
+ if lines:
376
+ return '\n'.join(lines)
377
+
378
+ return None
379
+
380
+
381
+ def format_transcript(result: Dict[str, Any]) -> str:
382
+ """
383
+ Format screen reader transcript for display.
384
+
385
+ Args:
386
+ result: Result from simulate_screen_reader
387
+
388
+ Returns:
389
+ Formatted transcript string
390
+ """
391
+ header = f"# {result['reader_type']} Transcript ({result['detail_level']} detail)\n\n"
392
+
393
+ if result['mode'] == 'untagged':
394
+ header += "⚠️ Simulated from visual order (PDF not tagged)\n\n"
395
+
396
+ header += "---\n\n"
397
+
398
+ return header + result['transcript']
structure_tree.py ADDED
@@ -0,0 +1,493 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Structure Tree Analysis Module
3
+
4
+ Provides functionality for extracting and analyzing PDF structure trees,
5
+ including paragraph detection and block-to-tag mapping.
6
+ """
7
+
8
+ from dataclasses import dataclass, field
9
+ from typing import List, Optional, Dict, Any, Tuple
10
+ import pikepdf
11
+ import statistics
12
+ from collections import Counter
13
+
14
+
15
+ @dataclass
16
+ class StructureNode:
17
+ """Represents a node in the PDF structure tree."""
18
+ tag_type: str # P, H1, Document, etc.
19
+ depth: int
20
+ mcid: Optional[int] = None
21
+ alt_text: Optional[str] = None
22
+ actual_text: Optional[str] = None
23
+ page_ref: Optional[int] = None
24
+ children: List['StructureNode'] = field(default_factory=list)
25
+
26
+ def to_dict(self) -> Dict[str, Any]:
27
+ """Convert to dictionary for JSON serialization."""
28
+ return {
29
+ 'tag_type': self.tag_type,
30
+ 'depth': self.depth,
31
+ 'mcid': self.mcid,
32
+ 'alt_text': self.alt_text,
33
+ 'actual_text': self.actual_text,
34
+ 'page_ref': self.page_ref,
35
+ 'children': [child.to_dict() for child in self.children]
36
+ }
37
+
38
+
39
+ def extract_structure_tree(pdf_path: str) -> Optional[StructureNode]:
40
+ """
41
+ Extract the complete structure tree from a PDF.
42
+
43
+ Args:
44
+ pdf_path: Path to the PDF file
45
+
46
+ Returns:
47
+ Root StructureNode or None if no structure tree exists
48
+ """
49
+ try:
50
+ with pikepdf.open(pdf_path) as pdf:
51
+ if '/StructTreeRoot' not in pdf.Root:
52
+ return None
53
+
54
+ struct_root = pdf.Root.StructTreeRoot
55
+
56
+ # Create root node
57
+ root_node = StructureNode(
58
+ tag_type="StructTreeRoot",
59
+ depth=0
60
+ )
61
+
62
+ # Recursively parse the tree
63
+ if '/K' in struct_root:
64
+ _parse_structure_element(struct_root.K, root_node, 1, pdf)
65
+
66
+ return root_node
67
+
68
+ except Exception as e:
69
+ print(f"Error extracting structure tree: {e}")
70
+ return None
71
+
72
+
73
+ def _parse_structure_element(element, parent_node: StructureNode, depth: int, pdf: pikepdf.Pdf, max_depth: int = 20):
74
+ """
75
+ Recursively parse a structure element and its children.
76
+
77
+ Args:
78
+ element: pikepdf object representing the element
79
+ parent_node: Parent StructureNode to attach children to
80
+ depth: Current depth in the tree
81
+ pdf: pikepdf.Pdf object for resolving references
82
+ max_depth: Maximum recursion depth to prevent infinite loops
83
+ """
84
+ if depth > max_depth:
85
+ return
86
+
87
+ # Handle arrays of elements
88
+ if isinstance(element, pikepdf.Array):
89
+ for item in element:
90
+ _parse_structure_element(item, parent_node, depth, pdf, max_depth)
91
+ return
92
+
93
+ # Handle MCID (Marked Content ID) - leaf node
94
+ if isinstance(element, int):
95
+ node = StructureNode(
96
+ tag_type="MCID",
97
+ depth=depth,
98
+ mcid=element
99
+ )
100
+ parent_node.children.append(node)
101
+ return
102
+
103
+ # Handle dictionary (structure element)
104
+ if isinstance(element, pikepdf.Dictionary):
105
+ # Extract tag type
106
+ tag_type = str(element.get('/S', 'Unknown'))
107
+ if tag_type.startswith('/'):
108
+ tag_type = tag_type[1:] # Remove leading slash
109
+
110
+ # Extract attributes
111
+ mcid = None
112
+ if '/MCID' in element:
113
+ mcid = int(element.MCID)
114
+
115
+ alt_text = None
116
+ if '/Alt' in element:
117
+ try:
118
+ alt_text = str(element.Alt)
119
+ except:
120
+ pass
121
+
122
+ actual_text = None
123
+ if '/ActualText' in element:
124
+ try:
125
+ actual_text = str(element.ActualText)
126
+ except:
127
+ pass
128
+
129
+ page_ref = None
130
+ if '/Pg' in element:
131
+ try:
132
+ # Find the page number
133
+ page_obj = element.Pg
134
+ for i, page in enumerate(pdf.pages):
135
+ if page.obj == page_obj:
136
+ page_ref = i
137
+ break
138
+ except:
139
+ pass
140
+
141
+ # Create node
142
+ node = StructureNode(
143
+ tag_type=tag_type,
144
+ depth=depth,
145
+ mcid=mcid,
146
+ alt_text=alt_text,
147
+ actual_text=actual_text,
148
+ page_ref=page_ref
149
+ )
150
+ parent_node.children.append(node)
151
+
152
+ # Recursively process children
153
+ if '/K' in element:
154
+ _parse_structure_element(element.K, node, depth + 1, pdf, max_depth)
155
+
156
+
157
+ def format_tree_text(root: StructureNode, max_nodes: int = 500) -> str:
158
+ """
159
+ Format structure tree as indented text with box-drawing characters.
160
+
161
+ Args:
162
+ root: Root StructureNode
163
+ max_nodes: Maximum number of nodes to display
164
+
165
+ Returns:
166
+ Formatted text representation
167
+ """
168
+ lines = []
169
+ node_count = [0] # Use list to allow modification in nested function
170
+
171
+ def _format_node(node: StructureNode, prefix: str = "", is_last: bool = True):
172
+ if node_count[0] >= max_nodes:
173
+ if node_count[0] == max_nodes:
174
+ lines.append(f"{prefix}... (truncated, tree too large)")
175
+ node_count[0] += 1
176
+ return
177
+
178
+ # Format node info
179
+ info = node.tag_type
180
+ if node.mcid is not None:
181
+ info += f" [MCID: {node.mcid}]"
182
+ if node.alt_text:
183
+ info += f" (Alt: {node.alt_text[:30]}...)" if len(node.alt_text) > 30 else f" (Alt: {node.alt_text})"
184
+ if node.actual_text:
185
+ info += f" (Text: {node.actual_text[:30]}...)" if len(node.actual_text) > 30 else f" (Text: {node.actual_text})"
186
+ if node.page_ref is not None:
187
+ info += f" [Page {node.page_ref + 1}]"
188
+
189
+ # Add line with appropriate prefix
190
+ if node.depth == 0:
191
+ lines.append(info)
192
+ else:
193
+ connector = "└── " if is_last else "├── "
194
+ lines.append(f"{prefix}{connector}{info}")
195
+
196
+ node_count[0] += 1
197
+
198
+ # Process children
199
+ if node.children:
200
+ extension = " " if is_last else "│ "
201
+ new_prefix = prefix + extension if node.depth > 0 else ""
202
+
203
+ for i, child in enumerate(node.children):
204
+ is_last_child = (i == len(node.children) - 1)
205
+ _format_node(child, new_prefix, is_last_child)
206
+
207
+ _format_node(root)
208
+ return "\n".join(lines)
209
+
210
+
211
+ def get_tree_statistics(root: StructureNode) -> Dict[str, Any]:
212
+ """
213
+ Calculate statistics about the structure tree.
214
+
215
+ Args:
216
+ root: Root StructureNode
217
+
218
+ Returns:
219
+ Dictionary of statistics
220
+ """
221
+ node_count = 0
222
+ max_depth = 0
223
+ tag_counts = Counter()
224
+ pages_with_structure = set()
225
+ nodes_with_alt_text = 0
226
+ nodes_with_actual_text = 0
227
+ mcid_count = 0
228
+
229
+ def _traverse(node: StructureNode):
230
+ nonlocal node_count, max_depth, nodes_with_alt_text, nodes_with_actual_text, mcid_count
231
+
232
+ node_count += 1
233
+ max_depth = max(max_depth, node.depth)
234
+ tag_counts[node.tag_type] += 1
235
+
236
+ if node.page_ref is not None:
237
+ pages_with_structure.add(node.page_ref)
238
+ if node.alt_text:
239
+ nodes_with_alt_text += 1
240
+ if node.actual_text:
241
+ nodes_with_actual_text += 1
242
+ if node.mcid is not None:
243
+ mcid_count += 1
244
+
245
+ for child in node.children:
246
+ _traverse(child)
247
+
248
+ _traverse(root)
249
+
250
+ return {
251
+ 'total_nodes': node_count,
252
+ 'max_depth': max_depth,
253
+ 'tag_type_counts': dict(tag_counts.most_common()),
254
+ 'pages_with_structure': sorted(list(pages_with_structure)),
255
+ 'nodes_with_alt_text': nodes_with_alt_text,
256
+ 'nodes_with_actual_text': nodes_with_actual_text,
257
+ 'mcid_count': mcid_count
258
+ }
259
+
260
+
261
+ def format_statistics_markdown(stats: Dict[str, Any]) -> str:
262
+ """Format tree statistics as Markdown."""
263
+ lines = [
264
+ "## Structure Tree Statistics",
265
+ "",
266
+ f"**Total Nodes**: {stats['total_nodes']}",
267
+ f"**Maximum Depth**: {stats['max_depth']}",
268
+ f"**Nodes with Alt Text**: {stats['nodes_with_alt_text']}",
269
+ f"**Nodes with Actual Text**: {stats['nodes_with_actual_text']}",
270
+ f"**MCID References**: {stats['mcid_count']}",
271
+ "",
272
+ "### Tag Type Distribution",
273
+ ""
274
+ ]
275
+
276
+ for tag, count in stats['tag_type_counts'].items():
277
+ lines.append(f"- **{tag}**: {count}")
278
+
279
+ lines.extend([
280
+ "",
281
+ f"**Pages with Structure**: {len(stats['pages_with_structure'])}"
282
+ ])
283
+
284
+ if stats['pages_with_structure']:
285
+ page_list = ", ".join([str(p + 1) for p in stats['pages_with_structure'][:20]])
286
+ if len(stats['pages_with_structure']) > 20:
287
+ page_list += f" ... ({len(stats['pages_with_structure']) - 20} more)"
288
+ lines.append(f"({page_list})")
289
+
290
+ return "\n".join(lines)
291
+
292
+
293
+ def extract_mcid_for_page(pdf_path: str, page_index: int) -> List[int]:
294
+ """
295
+ Extract all MCIDs (Marked Content IDs) from a page's content stream.
296
+
297
+ Args:
298
+ pdf_path: Path to PDF file
299
+ page_index: 0-based page index
300
+
301
+ Returns:
302
+ List of MCIDs found in the page
303
+ """
304
+ import re
305
+
306
+ try:
307
+ with pikepdf.open(pdf_path) as pdf:
308
+ page = pdf.pages[page_index]
309
+
310
+ # Get the content stream
311
+ if '/Contents' not in page:
312
+ return []
313
+
314
+ # Extract content stream as text
315
+ contents = page.Contents
316
+ if isinstance(contents, pikepdf.Array):
317
+ # Multiple content streams
318
+ stream_data = b""
319
+ for stream in contents:
320
+ stream_data += bytes(stream.read_bytes())
321
+ else:
322
+ stream_data = bytes(contents.read_bytes())
323
+
324
+ # Decode content stream
325
+ try:
326
+ content_text = stream_data.decode('latin-1')
327
+ except:
328
+ content_text = stream_data.decode('utf-8', errors='ignore')
329
+
330
+ # Extract MCIDs using regex
331
+ # Pattern: /MCID <number> BDC or /MCID <number> >> BDC
332
+ mcid_pattern = r'/MCID\s+(\d+)'
333
+ matches = re.findall(mcid_pattern, content_text)
334
+
335
+ return [int(m) for m in matches]
336
+
337
+ except Exception as e:
338
+ print(f"Error extracting MCIDs: {e}")
339
+ return []
340
+
341
+
342
+ def map_blocks_to_tags(pdf_path: str, page_index: int, blocks) -> List[Dict[str, Any]]:
343
+ """
344
+ Map visual blocks to structure tree tags via MCIDs.
345
+
346
+ Args:
347
+ pdf_path: Path to PDF file
348
+ page_index: 0-based page index
349
+ blocks: List of BlockInfo objects from extract_blocks_spans
350
+
351
+ Returns:
352
+ List of mappings with block index, tag info, MCID, alt text
353
+ """
354
+ # Extract structure tree
355
+ root = extract_structure_tree(pdf_path)
356
+ if not root:
357
+ return []
358
+
359
+ # Get MCIDs from page
360
+ page_mcids = extract_mcid_for_page(pdf_path, page_index)
361
+
362
+ # Build MCID to structure node mapping
363
+ mcid_to_node = {}
364
+
365
+ def _find_mcids(node: StructureNode):
366
+ if node.mcid is not None and (node.page_ref is None or node.page_ref == page_index):
367
+ mcid_to_node[node.mcid] = node
368
+ for child in node.children:
369
+ _find_mcids(child)
370
+
371
+ _find_mcids(root)
372
+
373
+ # Create mappings
374
+ mappings = []
375
+ for i, mcid in enumerate(page_mcids):
376
+ if i < len(blocks) and mcid in mcid_to_node:
377
+ node = mcid_to_node[mcid]
378
+ mappings.append({
379
+ 'block_index': i,
380
+ 'tag_type': node.tag_type,
381
+ 'mcid': mcid,
382
+ 'alt_text': node.alt_text or "",
383
+ 'actual_text': node.actual_text or ""
384
+ })
385
+
386
+ return mappings
387
+
388
+
389
+ def detect_visual_paragraphs(blocks, vertical_gap_threshold: float = 15.0) -> List[List[int]]:
390
+ """
391
+ Detect visual paragraphs based on spacing heuristics.
392
+
393
+ Args:
394
+ blocks: List of BlockInfo objects
395
+ vertical_gap_threshold: Minimum vertical gap to consider a paragraph break
396
+
397
+ Returns:
398
+ List of paragraph groups, where each group is a list of block indices
399
+ """
400
+ # Filter text blocks and sort by position
401
+ text_blocks = [(i, b) for i, b in enumerate(blocks) if b.block_type == 0 and b.text.strip()]
402
+ if not text_blocks:
403
+ return []
404
+
405
+ # Sort by vertical position (top to bottom)
406
+ text_blocks.sort(key=lambda x: x[1].bbox[1])
407
+
408
+ paragraphs = []
409
+ current_paragraph = [text_blocks[0][0]]
410
+ prev_bbox = text_blocks[0][1].bbox
411
+
412
+ for idx, block in text_blocks[1:]:
413
+ bbox = block.bbox
414
+
415
+ # Calculate vertical gap
416
+ vertical_gap = bbox[1] - prev_bbox[3]
417
+
418
+ # Check if blocks are roughly aligned horizontally (same column)
419
+ horizontal_overlap = min(bbox[2], prev_bbox[2]) - max(bbox[0], prev_bbox[0])
420
+
421
+ if vertical_gap < vertical_gap_threshold and horizontal_overlap > 0:
422
+ # Same paragraph
423
+ current_paragraph.append(idx)
424
+ else:
425
+ # New paragraph
426
+ paragraphs.append(current_paragraph)
427
+ current_paragraph = [idx]
428
+
429
+ prev_bbox = bbox
430
+
431
+ # Add last paragraph
432
+ if current_paragraph:
433
+ paragraphs.append(current_paragraph)
434
+
435
+ return paragraphs
436
+
437
+
438
+ def detect_semantic_paragraphs(pdf_path: str, page_index: int) -> List[StructureNode]:
439
+ """
440
+ Extract semantic paragraph tags (<P>) from structure tree.
441
+
442
+ Args:
443
+ pdf_path: Path to PDF file
444
+ page_index: 0-based page index
445
+
446
+ Returns:
447
+ List of StructureNode objects with tag_type='P' for the page
448
+ """
449
+ root = extract_structure_tree(pdf_path)
450
+ if not root:
451
+ return []
452
+
453
+ paragraphs = []
454
+
455
+ def _find_paragraphs(node: StructureNode):
456
+ if node.tag_type == 'P' and (node.page_ref is None or node.page_ref == page_index):
457
+ paragraphs.append(node)
458
+ for child in node.children:
459
+ _find_paragraphs(child)
460
+
461
+ _find_paragraphs(root)
462
+ return paragraphs
463
+
464
+
465
+ def compare_paragraphs(visual_paragraphs: List[List[int]], semantic_paragraphs: List[StructureNode]) -> Dict[str, Any]:
466
+ """
467
+ Compare visual and semantic paragraph detection.
468
+
469
+ Args:
470
+ visual_paragraphs: List of visual paragraph groups (block indices)
471
+ semantic_paragraphs: List of semantic <P> tags
472
+
473
+ Returns:
474
+ Dictionary with comparison statistics
475
+ """
476
+ visual_count = len(visual_paragraphs)
477
+ semantic_count = len(semantic_paragraphs)
478
+
479
+ # Calculate match quality score (simple heuristic)
480
+ if visual_count == 0 and semantic_count == 0:
481
+ match_score = 1.0
482
+ elif visual_count == 0 or semantic_count == 0:
483
+ match_score = 0.0
484
+ else:
485
+ # Score based on count similarity
486
+ match_score = min(visual_count, semantic_count) / max(visual_count, semantic_count)
487
+
488
+ return {
489
+ 'visual_count': visual_count,
490
+ 'semantic_count': semantic_count,
491
+ 'match_score': match_score,
492
+ 'count_mismatch': abs(visual_count - semantic_count)
493
+ }