Spaces:
Sleeping
Sleeping
advanced featrues
Browse files- CLAUDE.md +111 -1
- advanced_analysis.py +430 -0
- app.py +353 -1
- content_stream_parser.py +322 -0
- screen_reader_sim.py +398 -0
- structure_tree.py +493 -0
CLAUDE.md
CHANGED
|
@@ -107,6 +107,43 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
|
|
| 107 |
- `format_batch_results_table()`: Color-coded HTML table per page
|
| 108 |
- `format_batch_results_chart()`: Plotly bar chart of issue distribution
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
### Key Data Structures
|
| 111 |
|
| 112 |
**Single Page Analysis**:
|
|
@@ -122,13 +159,24 @@ The application has two main modes: **Single Page Analysis** and **Batch Analysi
|
|
| 122 |
- `critical_pages`: Pages with 3+ issues
|
| 123 |
- `to_dict()`: Method to convert to JSON-serializable format
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
**UI State**:
|
| 126 |
- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
|
| 127 |
- Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
|
| 128 |
|
| 129 |
### Gradio UI Flow
|
| 130 |
|
| 131 |
-
The UI is organized into
|
| 132 |
|
| 133 |
#### Single Page Tab
|
| 134 |
1. User uploads PDF → `_on_upload` → extracts path and page count
|
|
@@ -151,6 +199,51 @@ The UI is organized into two tabs: **Single Page Analysis** and **Batch Analysis
|
|
| 151 |
- Color-coded HTML table of per-page results
|
| 152 |
- Full JSON report
|
| 153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
#### Help & Documentation
|
| 155 |
- All UI controls have `info` parameters with inline tooltips
|
| 156 |
- Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
|
|
@@ -233,6 +326,23 @@ Multi-page document analysis with aggregate statistics:
|
|
| 233 |
- Plotly charts via `gr.Plot()` for interactive visualizations
|
| 234 |
- All batch results have `.to_dict()` method for JSON export
|
| 235 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
## Testing
|
| 237 |
|
| 238 |
### Manual Testing Checklist
|
|
|
|
| 107 |
- `format_batch_results_table()`: Color-coded HTML table per page
|
| 108 |
- `format_batch_results_chart()`: Plotly bar chart of issue distribution
|
| 109 |
|
| 110 |
+
### Advanced Analysis Modules
|
| 111 |
+
|
| 112 |
+
The application includes specialized modules for advanced PDF accessibility analysis:
|
| 113 |
+
|
| 114 |
+
**advanced_analysis.py** - Coordinator module
|
| 115 |
+
- Provides facade functions with error handling
|
| 116 |
+
- `require_structure_tree` decorator: checks for tagged PDFs before execution
|
| 117 |
+
- `safe_execute` decorator: comprehensive error handling with user-friendly messages
|
| 118 |
+
- Exports high-level functions: `analyze_content_stream`, `analyze_screen_reader`, etc.
|
| 119 |
+
|
| 120 |
+
**content_stream_parser.py** - PDF operator extraction
|
| 121 |
+
- `extract_content_stream_for_block()`: Gets operators for a specific block
|
| 122 |
+
- `_parse_text_objects()`: Extracts BT...ET blocks from content stream
|
| 123 |
+
- `_parse_operators()`: Regex-based parsing of Tm, Tf, Tj, TJ, Td, color operators
|
| 124 |
+
- `_find_matching_text_object()`: Correlates text objects with BlockInfo via text matching
|
| 125 |
+
- Returns formatted markdown and raw stream text
|
| 126 |
+
|
| 127 |
+
**screen_reader_sim.py** - Accessibility simulation
|
| 128 |
+
- `simulate_screen_reader()`: Main simulation function
|
| 129 |
+
- `_simulate_tagged()`: Follows structure tree for tagged PDFs
|
| 130 |
+
- `_simulate_untagged()`: Falls back to visual order for untagged PDFs
|
| 131 |
+
- `_format_element_announcement()`: Generates NVDA/JAWS-style announcements
|
| 132 |
+
- Supports heading levels, paragraphs, figures, formulas, lists, tables, links
|
| 133 |
+
- Infers headings from font size (>18pt = H1, >14pt = H2) for untagged PDFs
|
| 134 |
+
|
| 135 |
+
**structure_tree.py** - Structure tree analysis
|
| 136 |
+
- `StructureNode` dataclass: represents PDF tag hierarchy
|
| 137 |
+
- `extract_structure_tree()`: Recursively parses StructTreeRoot with pikepdf
|
| 138 |
+
- `_parse_structure_element()`: Handles Dictionary, Array, and MCID elements
|
| 139 |
+
- `format_tree_text()`: Creates indented text view with box-drawing characters
|
| 140 |
+
- `get_tree_statistics()`: Counts nodes, tags, alt text coverage
|
| 141 |
+
- `extract_mcid_for_page()`: Finds marked content IDs in page content stream
|
| 142 |
+
- `map_blocks_to_tags()`: Correlates visual blocks with structure elements
|
| 143 |
+
- `detect_visual_paragraphs()`: Spacing-based paragraph detection
|
| 144 |
+
- `detect_semantic_paragraphs()`: Extracts <P> tags for a page
|
| 145 |
+
- `compare_paragraphs()`: Calculates match quality between visual and semantic
|
| 146 |
+
|
| 147 |
### Key Data Structures
|
| 148 |
|
| 149 |
**Single Page Analysis**:
|
|
|
|
| 159 |
- `critical_pages`: Pages with 3+ issues
|
| 160 |
- `to_dict()`: Method to convert to JSON-serializable format
|
| 161 |
|
| 162 |
+
**Advanced Analysis**:
|
| 163 |
+
- `StructureNode`: Represents a node in the PDF structure tree with:
|
| 164 |
+
- `tag_type`: Tag name (P, H1, Document, Figure, etc.)
|
| 165 |
+
- `depth`: Nesting level in the tree
|
| 166 |
+
- `mcid`: Marked Content ID (links to page content)
|
| 167 |
+
- `alt_text`: Alternative text for accessibility
|
| 168 |
+
- `actual_text`: Actual text content or replacement text
|
| 169 |
+
- `page_ref`: 0-based page index
|
| 170 |
+
- `children`: List of child StructureNode objects
|
| 171 |
+
- `to_dict()`: Convert to JSON-serializable format
|
| 172 |
+
|
| 173 |
**UI State**:
|
| 174 |
- The app maintains state through Gradio components (pdf_path, page_count stored in hidden/non-interactive UI elements)
|
| 175 |
- Background color cache: `_bg_color_cache` dict keyed by (document_path, page_index)
|
| 176 |
|
| 177 |
### Gradio UI Flow
|
| 178 |
|
| 179 |
+
The UI is organized into three tabs: **Single Page Analysis**, **Batch Analysis**, and **Advanced Analysis**.
|
| 180 |
|
| 181 |
#### Single Page Tab
|
| 182 |
1. User uploads PDF → `_on_upload` → extracts path and page count
|
|
|
|
| 199 |
- Color-coded HTML table of per-page results
|
| 200 |
- Full JSON report
|
| 201 |
|
| 202 |
+
#### Advanced Analysis Tab
|
| 203 |
+
|
| 204 |
+
Power-user features for deep PDF inspection and accessibility debugging. Each feature is in its own accordion:
|
| 205 |
+
|
| 206 |
+
1. **Content Stream Inspector**:
|
| 207 |
+
- Extracts raw PDF content stream operators for a specific block
|
| 208 |
+
- Shows low-level commands: text positioning (Tm, Td), fonts (Tf), text display (Tj, TJ)
|
| 209 |
+
- Useful for debugging text extraction, font issues, and positioning problems
|
| 210 |
+
- Provides both formatted view and raw stream
|
| 211 |
+
- Uses regex parsing of content streams (approximate for complex PDFs)
|
| 212 |
+
|
| 213 |
+
2. **Screen Reader Simulator**:
|
| 214 |
+
- Simulates NVDA or JAWS reading behavior for the current page
|
| 215 |
+
- Two modes:
|
| 216 |
+
- **Tagged PDFs**: Follows structure tree, announces headings/paragraphs/figures with proper semantics
|
| 217 |
+
- **Untagged PDFs**: Falls back to visual reading order, infers headings from font size
|
| 218 |
+
- Three detail levels: minimal (text only), default (element announcements), verbose (full context)
|
| 219 |
+
- Generates transcript + analysis with alt text coverage statistics
|
| 220 |
+
- Reading order configurable for untagged fallback (raw/tblr/columns)
|
| 221 |
+
|
| 222 |
+
3. **Paragraph Detection**:
|
| 223 |
+
- Compares visual paragraphs (detected by spacing) vs semantic <P> tags
|
| 224 |
+
- Visual detection: groups blocks with vertical gap < threshold (default 15pt)
|
| 225 |
+
- Semantic detection: extracts <P> tags from structure tree
|
| 226 |
+
- Generates color-coded overlay (green = visual paragraphs)
|
| 227 |
+
- Reports match quality score and mismatches
|
| 228 |
+
- Requires tagged PDF for semantic comparison
|
| 229 |
+
|
| 230 |
+
4. **Structure Tree Visualizer**:
|
| 231 |
+
- Extracts complete PDF tag hierarchy from StructTreeRoot
|
| 232 |
+
- Three visualization formats:
|
| 233 |
+
- **Tree Diagram**: Interactive Plotly sunburst chart
|
| 234 |
+
- **Text View**: Indented text with box-drawing characters
|
| 235 |
+
- **Statistics**: Node counts, tag distribution, alt text coverage
|
| 236 |
+
- Shows tag types (H1-H6, P, Figure, Table, L, LI, etc.)
|
| 237 |
+
- Displays alt text, actual text, page references, and MCID markers
|
| 238 |
+
- Only works for tagged PDFs
|
| 239 |
+
|
| 240 |
+
5. **Block-to-Tag Mapping**:
|
| 241 |
+
- Maps visual blocks to structure tree elements via MCID (Marked Content ID)
|
| 242 |
+
- Shows which blocks have proper semantic tagging
|
| 243 |
+
- DataFrame output with block index, tag type, MCID, alt text
|
| 244 |
+
- Helps identify untagged content
|
| 245 |
+
- Requires tagged PDF with MCID references
|
| 246 |
+
|
| 247 |
#### Help & Documentation
|
| 248 |
- All UI controls have `info` parameters with inline tooltips
|
| 249 |
- Expandable "📖 Understanding the Diagnostics" accordion with detailed explanations
|
|
|
|
| 326 |
- Plotly charts via `gr.Plot()` for interactive visualizations
|
| 327 |
- All batch results have `.to_dict()` method for JSON export
|
| 328 |
|
| 329 |
+
### Advanced Analysis Error Handling
|
| 330 |
+
- **Graceful Degradation**: All advanced features check for requirements before execution
|
| 331 |
+
- **Structure Tree Required**: Features 2, 4, 5 require tagged PDFs
|
| 332 |
+
- `@require_structure_tree` decorator checks for StructTreeRoot
|
| 333 |
+
- Returns user-friendly error message if not found
|
| 334 |
+
- Explains what tagging is and why it's needed
|
| 335 |
+
- **Safe Execution**: All features wrapped in `@safe_execute` decorator
|
| 336 |
+
- Catches all exceptions with traceback
|
| 337 |
+
- Returns formatted error messages instead of crashing
|
| 338 |
+
- **Content Stream Parsing**: Regex-based, may fail on complex/malformed PDFs
|
| 339 |
+
- Returns "not matched" status if text object not found
|
| 340 |
+
- Shows raw stream even if parsing fails
|
| 341 |
+
- **MCID Extraction**: May fail if content stream uses non-standard encoding
|
| 342 |
+
- Returns empty list on failure
|
| 343 |
+
- Block-to-tag mapping shows "No mappings found" message
|
| 344 |
+
- **Performance Limits**: Structure tree extraction has max_depth=20 to prevent infinite loops
|
| 345 |
+
|
| 346 |
## Testing
|
| 347 |
|
| 348 |
### Manual Testing Checklist
|
advanced_analysis.py
ADDED
|
@@ -0,0 +1,430 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Advanced Analysis Coordinator Module
|
| 3 |
+
|
| 4 |
+
Provides high-level facade functions for advanced PDF accessibility features,
|
| 5 |
+
with error handling and graceful degradation.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, List, Any, Optional, Callable
|
| 9 |
+
from functools import wraps
|
| 10 |
+
import pikepdf
|
| 11 |
+
import traceback
|
| 12 |
+
|
| 13 |
+
# Import feature modules
|
| 14 |
+
from content_stream_parser import (
|
| 15 |
+
extract_content_stream_for_block,
|
| 16 |
+
format_operators_markdown,
|
| 17 |
+
format_raw_stream
|
| 18 |
+
)
|
| 19 |
+
from screen_reader_sim import (
|
| 20 |
+
simulate_screen_reader,
|
| 21 |
+
format_transcript
|
| 22 |
+
)
|
| 23 |
+
from structure_tree import (
|
| 24 |
+
extract_structure_tree,
|
| 25 |
+
format_tree_text,
|
| 26 |
+
get_tree_statistics,
|
| 27 |
+
format_statistics_markdown,
|
| 28 |
+
map_blocks_to_tags,
|
| 29 |
+
detect_visual_paragraphs,
|
| 30 |
+
detect_semantic_paragraphs,
|
| 31 |
+
compare_paragraphs
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def require_structure_tree(func: Callable) -> Callable:
|
| 36 |
+
"""
|
| 37 |
+
Decorator to check for structure tree before executing function.
|
| 38 |
+
|
| 39 |
+
Functions decorated with this will return an error message if the PDF
|
| 40 |
+
does not have a tagged structure tree.
|
| 41 |
+
"""
|
| 42 |
+
@wraps(func)
|
| 43 |
+
def wrapper(pdf_path: str, *args, **kwargs):
|
| 44 |
+
try:
|
| 45 |
+
with pikepdf.open(pdf_path) as pdf:
|
| 46 |
+
if '/StructTreeRoot' not in pdf.Root:
|
| 47 |
+
return {
|
| 48 |
+
'error': True,
|
| 49 |
+
'message': '## No Structure Tree Found\n\n'
|
| 50 |
+
'This PDF does not have a tagged structure tree. '
|
| 51 |
+
'This feature requires a tagged PDF.\n\n'
|
| 52 |
+
'**What this means**: The PDF was not created with '
|
| 53 |
+
'accessibility tagging, so semantic structure information '
|
| 54 |
+
'(headings, paragraphs, alt text) is not available.\n\n'
|
| 55 |
+
'**Recommendation**: Use authoring tools that support '
|
| 56 |
+
'PDF/UA tagging (Adobe Acrobat, MS Word with "Save as Tagged PDF").'
|
| 57 |
+
}
|
| 58 |
+
except Exception as e:
|
| 59 |
+
return {
|
| 60 |
+
'error': True,
|
| 61 |
+
'message': f'## Error\n\nCould not open PDF: {str(e)}'
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
return func(pdf_path, *args, **kwargs)
|
| 65 |
+
|
| 66 |
+
return wrapper
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def safe_execute(func: Callable) -> Callable:
|
| 70 |
+
"""
|
| 71 |
+
Decorator for safe execution with comprehensive error handling.
|
| 72 |
+
|
| 73 |
+
Catches all exceptions and returns user-friendly error messages.
|
| 74 |
+
"""
|
| 75 |
+
@wraps(func)
|
| 76 |
+
def wrapper(*args, **kwargs):
|
| 77 |
+
try:
|
| 78 |
+
return func(*args, **kwargs)
|
| 79 |
+
except Exception as e:
|
| 80 |
+
error_trace = traceback.format_exc()
|
| 81 |
+
return {
|
| 82 |
+
'error': True,
|
| 83 |
+
'message': f'## Error\n\n{str(e)}\n\n**Details**:\n```\n{error_trace}\n```'
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
return wrapper
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
# Feature 1: Content Stream Inspector
|
| 90 |
+
|
| 91 |
+
@safe_execute
|
| 92 |
+
def analyze_content_stream(
|
| 93 |
+
pdf_path: str,
|
| 94 |
+
page_index: int,
|
| 95 |
+
block_index: int,
|
| 96 |
+
blocks: List[Any]
|
| 97 |
+
) -> Dict[str, Any]:
|
| 98 |
+
"""
|
| 99 |
+
Analyze content stream operators for a specific block.
|
| 100 |
+
|
| 101 |
+
Args:
|
| 102 |
+
pdf_path: Path to PDF file
|
| 103 |
+
page_index: 0-based page index
|
| 104 |
+
block_index: Index of block to analyze
|
| 105 |
+
blocks: List of BlockInfo objects
|
| 106 |
+
|
| 107 |
+
Returns:
|
| 108 |
+
Dictionary with formatted operators and raw stream
|
| 109 |
+
"""
|
| 110 |
+
result = extract_content_stream_for_block(pdf_path, page_index, block_index, blocks)
|
| 111 |
+
|
| 112 |
+
if 'error' in result:
|
| 113 |
+
return {
|
| 114 |
+
'error': True,
|
| 115 |
+
'message': f"## Error\n\n{result['error']}"
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
return {
|
| 119 |
+
'error': False,
|
| 120 |
+
'formatted': format_operators_markdown(result),
|
| 121 |
+
'raw': format_raw_stream(result.get('raw_stream', '')),
|
| 122 |
+
'matched': result.get('matched', False)
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
# Feature 2: Screen Reader Simulator
|
| 127 |
+
|
| 128 |
+
@safe_execute
|
| 129 |
+
def analyze_screen_reader(
|
| 130 |
+
pdf_path: str,
|
| 131 |
+
page_index: int,
|
| 132 |
+
blocks: List[Any],
|
| 133 |
+
reader_type: str = "NVDA",
|
| 134 |
+
detail_level: str = "default",
|
| 135 |
+
order_mode: str = "tblr"
|
| 136 |
+
) -> Dict[str, Any]:
|
| 137 |
+
"""
|
| 138 |
+
Simulate screen reader output for a page.
|
| 139 |
+
|
| 140 |
+
Args:
|
| 141 |
+
pdf_path: Path to PDF file
|
| 142 |
+
page_index: 0-based page index
|
| 143 |
+
blocks: List of BlockInfo objects
|
| 144 |
+
reader_type: "NVDA" or "JAWS"
|
| 145 |
+
detail_level: "minimal", "default", or "verbose"
|
| 146 |
+
order_mode: Reading order for untagged fallback
|
| 147 |
+
|
| 148 |
+
Returns:
|
| 149 |
+
Dictionary with transcript and analysis
|
| 150 |
+
"""
|
| 151 |
+
result = simulate_screen_reader(
|
| 152 |
+
pdf_path, page_index, blocks, reader_type, detail_level, order_mode
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
return {
|
| 156 |
+
'error': False,
|
| 157 |
+
'transcript': format_transcript(result),
|
| 158 |
+
'analysis': result['analysis'],
|
| 159 |
+
'mode': result['mode']
|
| 160 |
+
}
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
# Feature 3: Paragraph Detection
|
| 164 |
+
|
| 165 |
+
@safe_execute
|
| 166 |
+
def analyze_paragraphs(
|
| 167 |
+
pdf_path: str,
|
| 168 |
+
page_index: int,
|
| 169 |
+
blocks: List[Any],
|
| 170 |
+
vertical_gap_threshold: float = 15.0
|
| 171 |
+
) -> Dict[str, Any]:
|
| 172 |
+
"""
|
| 173 |
+
Compare visual and semantic paragraph detection.
|
| 174 |
+
|
| 175 |
+
Args:
|
| 176 |
+
pdf_path: Path to PDF file
|
| 177 |
+
page_index: 0-based page index
|
| 178 |
+
blocks: List of BlockInfo objects
|
| 179 |
+
vertical_gap_threshold: Spacing threshold for visual paragraphs
|
| 180 |
+
|
| 181 |
+
Returns:
|
| 182 |
+
Dictionary with comparison results
|
| 183 |
+
"""
|
| 184 |
+
# Detect visual paragraphs
|
| 185 |
+
visual_paragraphs = detect_visual_paragraphs(blocks, vertical_gap_threshold)
|
| 186 |
+
|
| 187 |
+
# Detect semantic paragraphs
|
| 188 |
+
semantic_paragraphs = detect_semantic_paragraphs(pdf_path, page_index)
|
| 189 |
+
|
| 190 |
+
# Compare
|
| 191 |
+
comparison = compare_paragraphs(visual_paragraphs, semantic_paragraphs)
|
| 192 |
+
|
| 193 |
+
# Format mismatches
|
| 194 |
+
mismatch_lines = [
|
| 195 |
+
"## Paragraph Comparison",
|
| 196 |
+
"",
|
| 197 |
+
f"**Visual Paragraphs Detected**: {comparison['visual_count']}",
|
| 198 |
+
f"**Semantic <P> Tags Found**: {comparison['semantic_count']}",
|
| 199 |
+
f"**Match Quality Score**: {comparison['match_score']:.2%}",
|
| 200 |
+
""
|
| 201 |
+
]
|
| 202 |
+
|
| 203 |
+
if comparison['count_mismatch'] == 0:
|
| 204 |
+
mismatch_lines.append("✓ Count matches between visual and semantic paragraphs")
|
| 205 |
+
else:
|
| 206 |
+
mismatch_lines.append(f"⚠️ Count mismatch: {comparison['count_mismatch']} difference")
|
| 207 |
+
|
| 208 |
+
if comparison['visual_count'] > comparison['semantic_count']:
|
| 209 |
+
mismatch_lines.extend([
|
| 210 |
+
"",
|
| 211 |
+
"**Issue**: More visual paragraphs than semantic tags",
|
| 212 |
+
"- Some paragraphs may be missing <P> tags",
|
| 213 |
+
"- Screen readers may not announce paragraph boundaries properly"
|
| 214 |
+
])
|
| 215 |
+
elif comparison['semantic_count'] > comparison['visual_count']:
|
| 216 |
+
mismatch_lines.extend([
|
| 217 |
+
"",
|
| 218 |
+
"**Issue**: More semantic tags than visual paragraphs",
|
| 219 |
+
"- Tags may not correspond to actual visual layout",
|
| 220 |
+
"- May cause confusion for users comparing visual and audio presentation"
|
| 221 |
+
])
|
| 222 |
+
|
| 223 |
+
if semantic_paragraphs == 0 and visual_paragraphs:
|
| 224 |
+
mismatch_lines.extend([
|
| 225 |
+
"",
|
| 226 |
+
"❌ **No semantic tagging found**",
|
| 227 |
+
"This page has no <P> tags. Screen readers will not announce paragraphs."
|
| 228 |
+
])
|
| 229 |
+
|
| 230 |
+
return {
|
| 231 |
+
'error': False,
|
| 232 |
+
'visual_count': comparison['visual_count'],
|
| 233 |
+
'semantic_count': comparison['semantic_count'],
|
| 234 |
+
'match_score': comparison['match_score'],
|
| 235 |
+
'mismatches': '\n'.join(mismatch_lines),
|
| 236 |
+
'visual_paragraphs': visual_paragraphs,
|
| 237 |
+
'semantic_paragraphs': semantic_paragraphs
|
| 238 |
+
}
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
# Feature 4: Structure Tree Visualizer
|
| 242 |
+
|
| 243 |
+
@require_structure_tree
|
| 244 |
+
@safe_execute
|
| 245 |
+
def analyze_structure_tree(pdf_path: str) -> Dict[str, Any]:
|
| 246 |
+
"""
|
| 247 |
+
Extract and visualize the PDF structure tree.
|
| 248 |
+
|
| 249 |
+
Args:
|
| 250 |
+
pdf_path: Path to PDF file
|
| 251 |
+
|
| 252 |
+
Returns:
|
| 253 |
+
Dictionary with tree visualization and statistics
|
| 254 |
+
"""
|
| 255 |
+
root = extract_structure_tree(pdf_path)
|
| 256 |
+
|
| 257 |
+
if not root:
|
| 258 |
+
return {
|
| 259 |
+
'error': True,
|
| 260 |
+
'message': '## Error\n\nCould not extract structure tree'
|
| 261 |
+
}
|
| 262 |
+
|
| 263 |
+
# Generate text view
|
| 264 |
+
text_view = format_tree_text(root, max_nodes=500)
|
| 265 |
+
|
| 266 |
+
# Generate statistics
|
| 267 |
+
stats = get_tree_statistics(root)
|
| 268 |
+
stats_markdown = format_statistics_markdown(stats)
|
| 269 |
+
|
| 270 |
+
# Generate plotly diagram
|
| 271 |
+
plot_data = _create_tree_plot(root)
|
| 272 |
+
|
| 273 |
+
return {
|
| 274 |
+
'error': False,
|
| 275 |
+
'text_view': text_view,
|
| 276 |
+
'statistics': stats_markdown,
|
| 277 |
+
'plot_data': plot_data,
|
| 278 |
+
'stats': stats
|
| 279 |
+
}
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
def _create_tree_plot(root):
|
| 283 |
+
"""
|
| 284 |
+
Create Plotly sunburst diagram data from structure tree.
|
| 285 |
+
|
| 286 |
+
Args:
|
| 287 |
+
root: Root StructureNode
|
| 288 |
+
|
| 289 |
+
Returns:
|
| 290 |
+
Plotly figure
|
| 291 |
+
"""
|
| 292 |
+
import plotly.graph_objects as go
|
| 293 |
+
|
| 294 |
+
labels = []
|
| 295 |
+
parents = []
|
| 296 |
+
values = []
|
| 297 |
+
colors = []
|
| 298 |
+
|
| 299 |
+
# Color map for common tag types
|
| 300 |
+
color_map = {
|
| 301 |
+
'Document': '#1f77b4',
|
| 302 |
+
'Part': '#ff7f0e',
|
| 303 |
+
'Sect': '#2ca02c',
|
| 304 |
+
'H1': '#d62728',
|
| 305 |
+
'H2': '#9467bd',
|
| 306 |
+
'H3': '#8c564b',
|
| 307 |
+
'H4': '#e377c2',
|
| 308 |
+
'H5': '#7f7f7f',
|
| 309 |
+
'H6': '#bcbd22',
|
| 310 |
+
'P': '#17becf',
|
| 311 |
+
'Figure': '#ff9896',
|
| 312 |
+
'Table': '#c5b0d5',
|
| 313 |
+
'L': '#c49c94',
|
| 314 |
+
'LI': '#f7b6d2',
|
| 315 |
+
'Link': '#c7c7c7',
|
| 316 |
+
}
|
| 317 |
+
|
| 318 |
+
def _traverse(node, parent_label=None):
|
| 319 |
+
# Create unique label
|
| 320 |
+
if node.depth == 0:
|
| 321 |
+
label = node.tag_type
|
| 322 |
+
else:
|
| 323 |
+
label = f"{node.tag_type}_{len(labels)}"
|
| 324 |
+
|
| 325 |
+
labels.append(label)
|
| 326 |
+
parents.append(parent_label if parent_label else "")
|
| 327 |
+
values.append(1)
|
| 328 |
+
|
| 329 |
+
# Assign color
|
| 330 |
+
base_tag = node.tag_type.split('_')[0]
|
| 331 |
+
color = color_map.get(base_tag, '#d3d3d3')
|
| 332 |
+
colors.append(color)
|
| 333 |
+
|
| 334 |
+
# Process children
|
| 335 |
+
for child in node.children:
|
| 336 |
+
_traverse(child, label)
|
| 337 |
+
|
| 338 |
+
_traverse(root)
|
| 339 |
+
|
| 340 |
+
fig = go.Figure(go.Sunburst(
|
| 341 |
+
labels=labels,
|
| 342 |
+
parents=parents,
|
| 343 |
+
values=values,
|
| 344 |
+
marker=dict(colors=colors),
|
| 345 |
+
branchvalues="total"
|
| 346 |
+
))
|
| 347 |
+
|
| 348 |
+
fig.update_layout(
|
| 349 |
+
title="PDF Structure Tree Hierarchy",
|
| 350 |
+
height=600,
|
| 351 |
+
margin=dict(t=50, l=0, r=0, b=0)
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
return fig
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
# Feature 5: Block-to-Tag Mapping
|
| 358 |
+
|
| 359 |
+
@require_structure_tree
|
| 360 |
+
@safe_execute
|
| 361 |
+
def analyze_block_tag_mapping(
|
| 362 |
+
pdf_path: str,
|
| 363 |
+
page_index: int,
|
| 364 |
+
blocks: List[Any]
|
| 365 |
+
) -> Dict[str, Any]:
|
| 366 |
+
"""
|
| 367 |
+
Map visual blocks to structure tree tags.
|
| 368 |
+
|
| 369 |
+
Args:
|
| 370 |
+
pdf_path: Path to PDF file
|
| 371 |
+
page_index: 0-based page index
|
| 372 |
+
blocks: List of BlockInfo objects
|
| 373 |
+
|
| 374 |
+
Returns:
|
| 375 |
+
Dictionary with mapping table
|
| 376 |
+
"""
|
| 377 |
+
mappings = map_blocks_to_tags(pdf_path, page_index, blocks)
|
| 378 |
+
|
| 379 |
+
if not mappings:
|
| 380 |
+
return {
|
| 381 |
+
'error': False,
|
| 382 |
+
'mappings': [],
|
| 383 |
+
'message': '## No Mappings Found\n\n'
|
| 384 |
+
'Could not find block-to-tag correlations for this page. '
|
| 385 |
+
'This may occur if:\n'
|
| 386 |
+
'- The page has no marked content IDs (MCIDs)\n'
|
| 387 |
+
'- The structure tree is not properly linked to content\n'
|
| 388 |
+
'- The page uses a non-standard tagging approach'
|
| 389 |
+
}
|
| 390 |
+
|
| 391 |
+
# Format as table data
|
| 392 |
+
table_data = []
|
| 393 |
+
for m in mappings:
|
| 394 |
+
table_data.append([
|
| 395 |
+
str(m['block_index']),
|
| 396 |
+
m['tag_type'],
|
| 397 |
+
str(m['mcid']),
|
| 398 |
+
m['alt_text'][:50] if m['alt_text'] else ""
|
| 399 |
+
])
|
| 400 |
+
|
| 401 |
+
return {
|
| 402 |
+
'error': False,
|
| 403 |
+
'mappings': table_data,
|
| 404 |
+
'count': len(mappings),
|
| 405 |
+
'message': f'## Block-to-Tag Mapping\n\nFound {len(mappings)} correlations'
|
| 406 |
+
}
|
| 407 |
+
|
| 408 |
+
|
| 409 |
+
# Utility function for creating block dropdown choices
|
| 410 |
+
|
| 411 |
+
def create_block_choices(blocks: List[Any]) -> List[tuple]:
|
| 412 |
+
"""
|
| 413 |
+
Create dropdown choices from blocks for UI.
|
| 414 |
+
|
| 415 |
+
Args:
|
| 416 |
+
blocks: List of BlockInfo objects
|
| 417 |
+
|
| 418 |
+
Returns:
|
| 419 |
+
List of (label, value) tuples
|
| 420 |
+
"""
|
| 421 |
+
choices = []
|
| 422 |
+
for i, block in enumerate(blocks):
|
| 423 |
+
text_preview = block.text[:50].replace('\n', ' ').strip()
|
| 424 |
+
if len(block.text) > 50:
|
| 425 |
+
text_preview += "..."
|
| 426 |
+
|
| 427 |
+
label = f"Block {i}: {text_preview}" if text_preview else f"Block {i} [Image]"
|
| 428 |
+
choices.append((label, i))
|
| 429 |
+
|
| 430 |
+
return choices
|
app.py
CHANGED
|
@@ -13,6 +13,16 @@ import pymupdf as fitz # PyMuPDF
|
|
| 13 |
import pikepdf
|
| 14 |
from PIL import Image, ImageDraw, ImageFont
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
# -----------------------------
|
| 17 |
# Color Palettes for Adaptive Contrast
|
| 18 |
# -----------------------------
|
|
@@ -413,8 +423,73 @@ def render_page_with_overlay(
|
|
| 413 |
|
| 414 |
return img
|
| 415 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 416 |
# -----------------------------
|
| 417 |
-
# Heuristic
|
| 418 |
# -----------------------------
|
| 419 |
|
| 420 |
def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
|
|
@@ -929,6 +1004,141 @@ Upload a PDF and inspect:
|
|
| 929 |
|
| 930 |
batch_json = gr.JSON(label="Full Batch Report", visible=False)
|
| 931 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 932 |
def _on_upload(f):
|
| 933 |
path, n, msg = load_pdf(f)
|
| 934 |
return path, n, msg, gr.update(maximum=n, value=1)
|
|
@@ -947,6 +1157,148 @@ Upload a PDF and inspect:
|
|
| 947 |
outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
|
| 948 |
)
|
| 949 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 950 |
if __name__ == "__main__":
|
| 951 |
demo.launch()
|
| 952 |
|
|
|
|
| 13 |
import pikepdf
|
| 14 |
from PIL import Image, ImageDraw, ImageFont
|
| 15 |
|
| 16 |
+
# Advanced analysis modules
|
| 17 |
+
from advanced_analysis import (
|
| 18 |
+
analyze_content_stream,
|
| 19 |
+
analyze_screen_reader,
|
| 20 |
+
analyze_paragraphs,
|
| 21 |
+
analyze_structure_tree,
|
| 22 |
+
analyze_block_tag_mapping,
|
| 23 |
+
create_block_choices
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
# -----------------------------
|
| 27 |
# Color Palettes for Adaptive Contrast
|
| 28 |
# -----------------------------
|
|
|
|
| 423 |
|
| 424 |
return img
|
| 425 |
|
| 426 |
+
|
| 427 |
+
def render_paragraph_overlay(
|
| 428 |
+
pdf_path: str,
|
| 429 |
+
page_index: int,
|
| 430 |
+
dpi: int,
|
| 431 |
+
visual_paragraphs: List[List[int]],
|
| 432 |
+
semantic_paragraphs: List[Any]
|
| 433 |
+
) -> Image.Image:
|
| 434 |
+
"""
|
| 435 |
+
Render page with color-coded paragraph visualizations.
|
| 436 |
+
|
| 437 |
+
Args:
|
| 438 |
+
pdf_path: Path to PDF file
|
| 439 |
+
page_index: 0-based page index
|
| 440 |
+
dpi: Rendering DPI
|
| 441 |
+
visual_paragraphs: List of visual paragraph groups (block indices)
|
| 442 |
+
semantic_paragraphs: List of semantic paragraph StructureNodes
|
| 443 |
+
|
| 444 |
+
Returns:
|
| 445 |
+
PIL Image with paragraph overlays
|
| 446 |
+
"""
|
| 447 |
+
doc = fitz.open(pdf_path)
|
| 448 |
+
page = doc[page_index]
|
| 449 |
+
|
| 450 |
+
# Render base image
|
| 451 |
+
pix = page.get_pixmap(dpi=dpi)
|
| 452 |
+
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
|
| 453 |
+
draw = ImageDraw.Draw(img, 'RGBA')
|
| 454 |
+
|
| 455 |
+
# Extract blocks for bounding boxes
|
| 456 |
+
blocks = extract_blocks_spans(pdf_path, page_index)
|
| 457 |
+
|
| 458 |
+
# Scale factor from PDF points to pixels
|
| 459 |
+
scale = dpi / 72.0
|
| 460 |
+
|
| 461 |
+
def _rect_i(bbox):
|
| 462 |
+
"""Convert PDF bbox to pixel coordinates."""
|
| 463 |
+
x0, y0, x1, y1 = bbox
|
| 464 |
+
return (int(x0 * scale), int(y0 * scale), int(x1 * scale), int(y1 * scale))
|
| 465 |
+
|
| 466 |
+
# Draw visual paragraphs (green = matched, red = unmatched)
|
| 467 |
+
# For simplicity, we'll draw all visual paragraphs in green with transparency
|
| 468 |
+
for para_blocks in visual_paragraphs:
|
| 469 |
+
# Calculate bounding box for entire paragraph
|
| 470 |
+
if not para_blocks:
|
| 471 |
+
continue
|
| 472 |
+
|
| 473 |
+
min_x0 = min(blocks[i].bbox[0] for i in para_blocks if i < len(blocks))
|
| 474 |
+
min_y0 = min(blocks[i].bbox[1] for i in para_blocks if i < len(blocks))
|
| 475 |
+
max_x1 = max(blocks[i].bbox[2] for i in para_blocks if i < len(blocks))
|
| 476 |
+
max_y1 = max(blocks[i].bbox[3] for i in para_blocks if i < len(blocks))
|
| 477 |
+
|
| 478 |
+
r = _rect_i((min_x0, min_y0, max_x1, max_y1))
|
| 479 |
+
|
| 480 |
+
# Green with transparency for visual paragraphs
|
| 481 |
+
draw.rectangle(r, outline=(0, 255, 0, 255), width=3, fill=(0, 255, 0, 30))
|
| 482 |
+
|
| 483 |
+
# Draw semantic paragraph indicators (blue borders)
|
| 484 |
+
# Note: semantic_paragraphs don't have direct bboxes, so we'll just count them
|
| 485 |
+
# In a more complete implementation, we'd map MCIDs to blocks
|
| 486 |
+
|
| 487 |
+
doc.close()
|
| 488 |
+
return img
|
| 489 |
+
|
| 490 |
+
|
| 491 |
# -----------------------------
|
| 492 |
+
# Heuristic "problems" report
|
| 493 |
# -----------------------------
|
| 494 |
|
| 495 |
def diagnose_page(doc: fitz.Document, page_index: int, struct: Dict[str, Any]) -> Dict[str, Any]:
|
|
|
|
| 1004 |
|
| 1005 |
batch_json = gr.JSON(label="Full Batch Report", visible=False)
|
| 1006 |
|
| 1007 |
+
# Advanced Analysis Tab
|
| 1008 |
+
with gr.Tab("Advanced Analysis"):
|
| 1009 |
+
gr.Markdown("""
|
| 1010 |
+
# Advanced PDF Accessibility Analysis
|
| 1011 |
+
|
| 1012 |
+
Power-user features for deep PDF inspection and accessibility debugging.
|
| 1013 |
+
These tools help diagnose complex accessibility issues and understand internal PDF structure.
|
| 1014 |
+
""")
|
| 1015 |
+
|
| 1016 |
+
with gr.Accordion("1. Content Stream Inspector", open=False):
|
| 1017 |
+
gr.Markdown("""
|
| 1018 |
+
**Purpose**: Inspect raw PDF content stream operators for a specific block
|
| 1019 |
+
|
| 1020 |
+
Shows the low-level PDF commands that render text and graphics. Useful for debugging
|
| 1021 |
+
text extraction issues, font problems, and positioning.
|
| 1022 |
+
""")
|
| 1023 |
+
|
| 1024 |
+
cs_block_dropdown = gr.Dropdown(
|
| 1025 |
+
label="Select Block",
|
| 1026 |
+
choices=[],
|
| 1027 |
+
info="Choose a text or image block to inspect"
|
| 1028 |
+
)
|
| 1029 |
+
cs_inspect_btn = gr.Button("Extract Operators", variant="primary")
|
| 1030 |
+
|
| 1031 |
+
with gr.Tabs():
|
| 1032 |
+
with gr.Tab("Formatted"):
|
| 1033 |
+
cs_operator_display = gr.Markdown()
|
| 1034 |
+
with gr.Tab("Raw Stream"):
|
| 1035 |
+
cs_raw_stream = gr.Code(label="Raw PDF Content Stream")
|
| 1036 |
+
|
| 1037 |
+
with gr.Accordion("2. Screen Reader Simulator", open=False):
|
| 1038 |
+
gr.Markdown("""
|
| 1039 |
+
**Purpose**: Simulate how NVDA or JAWS would read this page
|
| 1040 |
+
|
| 1041 |
+
Generates a transcript showing what a screen reader user would hear, including
|
| 1042 |
+
element announcements and reading order. Works with both tagged and untagged PDFs.
|
| 1043 |
+
""")
|
| 1044 |
+
|
| 1045 |
+
with gr.Row():
|
| 1046 |
+
sr_reader = gr.Radio(
|
| 1047 |
+
["NVDA", "JAWS"],
|
| 1048 |
+
value="NVDA",
|
| 1049 |
+
label="Screen Reader",
|
| 1050 |
+
info="Choose which screen reader to simulate"
|
| 1051 |
+
)
|
| 1052 |
+
sr_detail = gr.Radio(
|
| 1053 |
+
["minimal", "default", "verbose"],
|
| 1054 |
+
value="default",
|
| 1055 |
+
label="Detail Level",
|
| 1056 |
+
info="How much context information to include"
|
| 1057 |
+
)
|
| 1058 |
+
sr_order = gr.Radio(
|
| 1059 |
+
["raw", "tblr", "columns"],
|
| 1060 |
+
value="tblr",
|
| 1061 |
+
label="Reading Order (untagged fallback)",
|
| 1062 |
+
info="Used only if PDF has no structure tree"
|
| 1063 |
+
)
|
| 1064 |
+
|
| 1065 |
+
sr_btn = gr.Button("Generate Transcript", variant="primary")
|
| 1066 |
+
|
| 1067 |
+
with gr.Tabs():
|
| 1068 |
+
with gr.Tab("Transcript"):
|
| 1069 |
+
sr_transcript = gr.Textbox(
|
| 1070 |
+
lines=20,
|
| 1071 |
+
label="Screen Reader Output",
|
| 1072 |
+
interactive=False
|
| 1073 |
+
)
|
| 1074 |
+
with gr.Tab("Analysis"):
|
| 1075 |
+
sr_analysis = gr.Markdown()
|
| 1076 |
+
|
| 1077 |
+
with gr.Accordion("3. Paragraph Detection", open=False):
|
| 1078 |
+
gr.Markdown("""
|
| 1079 |
+
**Purpose**: Compare visual paragraphs vs semantic paragraph tags
|
| 1080 |
+
|
| 1081 |
+
Identifies paragraphs based on spacing (visual) and compares them to <P> tags
|
| 1082 |
+
in the structure tree (semantic). Mismatches can cause confusion for screen reader users.
|
| 1083 |
+
""")
|
| 1084 |
+
|
| 1085 |
+
para_threshold = gr.Slider(
|
| 1086 |
+
label="Vertical Gap Threshold (points)",
|
| 1087 |
+
minimum=5,
|
| 1088 |
+
maximum=30,
|
| 1089 |
+
value=15,
|
| 1090 |
+
step=1,
|
| 1091 |
+
info="Minimum vertical spacing to consider a paragraph break"
|
| 1092 |
+
)
|
| 1093 |
+
|
| 1094 |
+
para_btn = gr.Button("Analyze Paragraphs", variant="primary")
|
| 1095 |
+
para_overlay = gr.Image(label="Paragraph Visualization", type="pil")
|
| 1096 |
+
|
| 1097 |
+
with gr.Row():
|
| 1098 |
+
para_visual = gr.Number(label="Visual Paragraphs", interactive=False)
|
| 1099 |
+
para_semantic = gr.Number(label="Semantic <P> Tags", interactive=False)
|
| 1100 |
+
para_score = gr.Number(label="Match Quality", interactive=False)
|
| 1101 |
+
|
| 1102 |
+
para_mismatches = gr.Markdown()
|
| 1103 |
+
|
| 1104 |
+
with gr.Accordion("4. Structure Tree Visualizer", open=False):
|
| 1105 |
+
gr.Markdown("""
|
| 1106 |
+
**Purpose**: Display the complete PDF tag hierarchy
|
| 1107 |
+
|
| 1108 |
+
Shows the entire structure tree for tagged PDFs, including tag types, alt text,
|
| 1109 |
+
and page references. Only works for PDFs with accessibility tagging.
|
| 1110 |
+
""")
|
| 1111 |
+
|
| 1112 |
+
struct_btn = gr.Button("Extract Structure Tree", variant="primary")
|
| 1113 |
+
|
| 1114 |
+
with gr.Tabs():
|
| 1115 |
+
with gr.Tab("Tree Diagram"):
|
| 1116 |
+
struct_plot = gr.Plot(label="Interactive Hierarchy")
|
| 1117 |
+
with gr.Tab("Text View"):
|
| 1118 |
+
struct_text = gr.Textbox(
|
| 1119 |
+
lines=30,
|
| 1120 |
+
label="Structure Tree",
|
| 1121 |
+
interactive=False
|
| 1122 |
+
)
|
| 1123 |
+
with gr.Tab("Statistics"):
|
| 1124 |
+
struct_stats = gr.Markdown()
|
| 1125 |
+
|
| 1126 |
+
with gr.Accordion("5. Block-to-Tag Mapping", open=False):
|
| 1127 |
+
gr.Markdown("""
|
| 1128 |
+
**Purpose**: Link visual blocks to structure tree elements
|
| 1129 |
+
|
| 1130 |
+
Maps each visual block to its corresponding tag in the structure tree via
|
| 1131 |
+
MCID (Marked Content ID) references. Shows which content is properly tagged.
|
| 1132 |
+
""")
|
| 1133 |
+
|
| 1134 |
+
map_btn = gr.Button("Map Blocks to Tags", variant="primary")
|
| 1135 |
+
map_message = gr.Markdown()
|
| 1136 |
+
map_table = gr.DataFrame(
|
| 1137 |
+
headers=["Block #", "Tag Type", "MCID", "Alt Text"],
|
| 1138 |
+
label="Block-to-Tag Correlations",
|
| 1139 |
+
interactive=False
|
| 1140 |
+
)
|
| 1141 |
+
|
| 1142 |
def _on_upload(f):
|
| 1143 |
path, n, msg = load_pdf(f)
|
| 1144 |
return path, n, msg, gr.update(maximum=n, value=1)
|
|
|
|
| 1157 |
outputs=[batch_summary_md, batch_chart, batch_table, batch_json, batch_progress]
|
| 1158 |
)
|
| 1159 |
|
| 1160 |
+
# Advanced Analysis Callbacks
|
| 1161 |
+
|
| 1162 |
+
def update_block_dropdown(pdf_path_val, page_num_val):
|
| 1163 |
+
"""Update block dropdown when page changes."""
|
| 1164 |
+
if not pdf_path_val:
|
| 1165 |
+
return gr.update(choices=[], value=None)
|
| 1166 |
+
|
| 1167 |
+
try:
|
| 1168 |
+
blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
|
| 1169 |
+
choices = create_block_choices(blocks)
|
| 1170 |
+
return gr.update(choices=choices, value=0 if choices else None)
|
| 1171 |
+
except:
|
| 1172 |
+
return gr.update(choices=[], value=None)
|
| 1173 |
+
|
| 1174 |
+
def run_content_stream_inspector(pdf_path_val, page_num_val, block_idx):
|
| 1175 |
+
"""Run content stream analysis for selected block."""
|
| 1176 |
+
if not pdf_path_val or block_idx is None:
|
| 1177 |
+
return "Please select a block", ""
|
| 1178 |
+
|
| 1179 |
+
try:
|
| 1180 |
+
blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
|
| 1181 |
+
result = analyze_content_stream(pdf_path_val, page_num_val - 1, block_idx, blocks)
|
| 1182 |
+
|
| 1183 |
+
if result.get('error'):
|
| 1184 |
+
return result['message'], ""
|
| 1185 |
+
|
| 1186 |
+
return result['formatted'], result['raw']
|
| 1187 |
+
except Exception as e:
|
| 1188 |
+
return f"## Error\n\n{str(e)}", ""
|
| 1189 |
+
|
| 1190 |
+
def run_screen_reader_sim(pdf_path_val, page_num_val, reader, detail, order):
|
| 1191 |
+
"""Run screen reader simulation."""
|
| 1192 |
+
if not pdf_path_val:
|
| 1193 |
+
return "Please upload a PDF first", ""
|
| 1194 |
+
|
| 1195 |
+
try:
|
| 1196 |
+
blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
|
| 1197 |
+
result = analyze_screen_reader(pdf_path_val, page_num_val - 1, blocks, reader, detail, order)
|
| 1198 |
+
|
| 1199 |
+
if result.get('error'):
|
| 1200 |
+
return result.get('message', 'Error'), ""
|
| 1201 |
+
|
| 1202 |
+
return result['transcript'], result['analysis']
|
| 1203 |
+
except Exception as e:
|
| 1204 |
+
return f"## Error\n\n{str(e)}", ""
|
| 1205 |
+
|
| 1206 |
+
def run_paragraph_detection(pdf_path_val, page_num_val, dpi_val, threshold):
|
| 1207 |
+
"""Run paragraph detection and comparison."""
|
| 1208 |
+
if not pdf_path_val:
|
| 1209 |
+
return None, 0, 0, 0.0, "Please upload a PDF first"
|
| 1210 |
+
|
| 1211 |
+
try:
|
| 1212 |
+
blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
|
| 1213 |
+
result = analyze_paragraphs(pdf_path_val, page_num_val - 1, blocks, threshold)
|
| 1214 |
+
|
| 1215 |
+
if result.get('error'):
|
| 1216 |
+
return None, 0, 0, 0.0, result.get('message', 'Error')
|
| 1217 |
+
|
| 1218 |
+
# Create visualization overlay
|
| 1219 |
+
overlay = render_paragraph_overlay(
|
| 1220 |
+
pdf_path_val, page_num_val - 1, dpi_val,
|
| 1221 |
+
result['visual_paragraphs'], result['semantic_paragraphs']
|
| 1222 |
+
)
|
| 1223 |
+
|
| 1224 |
+
return (
|
| 1225 |
+
overlay,
|
| 1226 |
+
result['visual_count'],
|
| 1227 |
+
result['semantic_count'],
|
| 1228 |
+
result['match_score'],
|
| 1229 |
+
result['mismatches']
|
| 1230 |
+
)
|
| 1231 |
+
except Exception as e:
|
| 1232 |
+
return None, 0, 0, 0.0, f"## Error\n\n{str(e)}"
|
| 1233 |
+
|
| 1234 |
+
def run_structure_tree_extraction(pdf_path_val):
|
| 1235 |
+
"""Extract and visualize structure tree."""
|
| 1236 |
+
if not pdf_path_val:
|
| 1237 |
+
return None, "Please upload a PDF first", ""
|
| 1238 |
+
|
| 1239 |
+
try:
|
| 1240 |
+
result = analyze_structure_tree(pdf_path_val)
|
| 1241 |
+
|
| 1242 |
+
if result.get('error'):
|
| 1243 |
+
return None, result['message'], ""
|
| 1244 |
+
|
| 1245 |
+
return result['plot_data'], result['text_view'], result['statistics']
|
| 1246 |
+
except Exception as e:
|
| 1247 |
+
return None, f"## Error\n\n{str(e)}", ""
|
| 1248 |
+
|
| 1249 |
+
def run_block_tag_mapping(pdf_path_val, page_num_val):
|
| 1250 |
+
"""Map blocks to structure tags."""
|
| 1251 |
+
if not pdf_path_val:
|
| 1252 |
+
return "Please upload a PDF first", []
|
| 1253 |
+
|
| 1254 |
+
try:
|
| 1255 |
+
blocks = extract_blocks_spans(pdf_path_val, page_num_val - 1)
|
| 1256 |
+
result = analyze_block_tag_mapping(pdf_path_val, page_num_val - 1, blocks)
|
| 1257 |
+
|
| 1258 |
+
if result.get('error'):
|
| 1259 |
+
return result.get('message', 'Error'), []
|
| 1260 |
+
|
| 1261 |
+
return result['message'], result['mappings']
|
| 1262 |
+
except Exception as e:
|
| 1263 |
+
return f"## Error\n\n{str(e)}", []
|
| 1264 |
+
|
| 1265 |
+
# Wire up Advanced Analysis callbacks
|
| 1266 |
+
page_num.change(
|
| 1267 |
+
update_block_dropdown,
|
| 1268 |
+
inputs=[pdf_path, page_num],
|
| 1269 |
+
outputs=[cs_block_dropdown]
|
| 1270 |
+
)
|
| 1271 |
+
|
| 1272 |
+
cs_inspect_btn.click(
|
| 1273 |
+
run_content_stream_inspector,
|
| 1274 |
+
inputs=[pdf_path, page_num, cs_block_dropdown],
|
| 1275 |
+
outputs=[cs_operator_display, cs_raw_stream]
|
| 1276 |
+
)
|
| 1277 |
+
|
| 1278 |
+
sr_btn.click(
|
| 1279 |
+
run_screen_reader_sim,
|
| 1280 |
+
inputs=[pdf_path, page_num, sr_reader, sr_detail, sr_order],
|
| 1281 |
+
outputs=[sr_transcript, sr_analysis]
|
| 1282 |
+
)
|
| 1283 |
+
|
| 1284 |
+
para_btn.click(
|
| 1285 |
+
run_paragraph_detection,
|
| 1286 |
+
inputs=[pdf_path, page_num, dpi, para_threshold],
|
| 1287 |
+
outputs=[para_overlay, para_visual, para_semantic, para_score, para_mismatches]
|
| 1288 |
+
)
|
| 1289 |
+
|
| 1290 |
+
struct_btn.click(
|
| 1291 |
+
run_structure_tree_extraction,
|
| 1292 |
+
inputs=[pdf_path],
|
| 1293 |
+
outputs=[struct_plot, struct_text, struct_stats]
|
| 1294 |
+
)
|
| 1295 |
+
|
| 1296 |
+
map_btn.click(
|
| 1297 |
+
run_block_tag_mapping,
|
| 1298 |
+
inputs=[pdf_path, page_num],
|
| 1299 |
+
outputs=[map_message, map_table]
|
| 1300 |
+
)
|
| 1301 |
+
|
| 1302 |
if __name__ == "__main__":
|
| 1303 |
demo.launch()
|
| 1304 |
|
content_stream_parser.py
ADDED
|
@@ -0,0 +1,322 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Content Stream Parser Module
|
| 3 |
+
|
| 4 |
+
Provides functionality for extracting and analyzing PDF content stream operators,
|
| 5 |
+
correlating them with visual blocks.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import re
|
| 9 |
+
from typing import Dict, List, Optional, Any, Tuple
|
| 10 |
+
import fitz # PyMuPDF
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def extract_content_stream_for_block(
|
| 14 |
+
pdf_path: str,
|
| 15 |
+
page_index: int,
|
| 16 |
+
block_index: int,
|
| 17 |
+
blocks: List[Any]
|
| 18 |
+
) -> Dict[str, Any]:
|
| 19 |
+
"""
|
| 20 |
+
Extract content stream operators for a specific block.
|
| 21 |
+
|
| 22 |
+
Args:
|
| 23 |
+
pdf_path: Path to the PDF file
|
| 24 |
+
page_index: 0-based page index
|
| 25 |
+
block_index: Index of the block to analyze
|
| 26 |
+
blocks: List of BlockInfo objects from extract_blocks_spans
|
| 27 |
+
|
| 28 |
+
Returns:
|
| 29 |
+
Dictionary with operators, raw stream, and metadata
|
| 30 |
+
"""
|
| 31 |
+
if block_index < 0 or block_index >= len(blocks):
|
| 32 |
+
return {
|
| 33 |
+
'error': 'Invalid block index',
|
| 34 |
+
'operators': [],
|
| 35 |
+
'raw_stream': ''
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
target_block = blocks[block_index]
|
| 39 |
+
|
| 40 |
+
try:
|
| 41 |
+
doc = fitz.open(pdf_path)
|
| 42 |
+
page = doc[page_index]
|
| 43 |
+
|
| 44 |
+
# Clean and consolidate content streams
|
| 45 |
+
page.clean_contents()
|
| 46 |
+
|
| 47 |
+
# Get the page's content stream xref
|
| 48 |
+
xref = page.get_contents()[0] # Get first content stream xref
|
| 49 |
+
|
| 50 |
+
# Extract raw stream data
|
| 51 |
+
stream_data = doc.xref_stream(xref)
|
| 52 |
+
try:
|
| 53 |
+
raw_stream = stream_data.decode('latin-1')
|
| 54 |
+
except:
|
| 55 |
+
raw_stream = stream_data.decode('utf-8', errors='ignore')
|
| 56 |
+
|
| 57 |
+
# Parse text objects from the stream
|
| 58 |
+
text_objects = _parse_text_objects(raw_stream)
|
| 59 |
+
|
| 60 |
+
# Find the text object that matches our target block
|
| 61 |
+
matching_object = _find_matching_text_object(text_objects, target_block)
|
| 62 |
+
|
| 63 |
+
doc.close()
|
| 64 |
+
|
| 65 |
+
if matching_object:
|
| 66 |
+
return {
|
| 67 |
+
'operators': matching_object['operators'],
|
| 68 |
+
'raw_stream': raw_stream,
|
| 69 |
+
'matched': True,
|
| 70 |
+
'block_text': target_block.text[:100]
|
| 71 |
+
}
|
| 72 |
+
else:
|
| 73 |
+
return {
|
| 74 |
+
'operators': [],
|
| 75 |
+
'raw_stream': raw_stream,
|
| 76 |
+
'matched': False,
|
| 77 |
+
'block_text': target_block.text[:100],
|
| 78 |
+
'message': 'Could not find matching text object in content stream'
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
except Exception as e:
|
| 82 |
+
return {
|
| 83 |
+
'error': str(e),
|
| 84 |
+
'operators': [],
|
| 85 |
+
'raw_stream': ''
|
| 86 |
+
}
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def _parse_text_objects(content_stream: str) -> List[Dict[str, Any]]:
|
| 90 |
+
"""
|
| 91 |
+
Parse text objects (BT...ET blocks) from content stream.
|
| 92 |
+
|
| 93 |
+
Args:
|
| 94 |
+
content_stream: Raw PDF content stream text
|
| 95 |
+
|
| 96 |
+
Returns:
|
| 97 |
+
List of text objects with their operators
|
| 98 |
+
"""
|
| 99 |
+
text_objects = []
|
| 100 |
+
|
| 101 |
+
# Find all BT...ET blocks
|
| 102 |
+
bt_et_pattern = r'BT\s+(.*?)\s+ET'
|
| 103 |
+
matches = re.finditer(bt_et_pattern, content_stream, re.DOTALL)
|
| 104 |
+
|
| 105 |
+
for match in matches:
|
| 106 |
+
text_block = match.group(1)
|
| 107 |
+
operators = _parse_operators(text_block)
|
| 108 |
+
text_objects.append({
|
| 109 |
+
'operators': operators,
|
| 110 |
+
'text': _extract_text_from_operators(operators)
|
| 111 |
+
})
|
| 112 |
+
|
| 113 |
+
return text_objects
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def _parse_operators(text_block: str) -> List[Dict[str, str]]:
|
| 117 |
+
"""
|
| 118 |
+
Parse individual operators from a text block.
|
| 119 |
+
|
| 120 |
+
Args:
|
| 121 |
+
text_block: Text between BT and ET
|
| 122 |
+
|
| 123 |
+
Returns:
|
| 124 |
+
List of operator dictionaries with type and value
|
| 125 |
+
"""
|
| 126 |
+
operators = []
|
| 127 |
+
|
| 128 |
+
# Text matrix (Tm)
|
| 129 |
+
tm_pattern = r'([\d.\-\s]+)\s+Tm'
|
| 130 |
+
for match in re.finditer(tm_pattern, text_block):
|
| 131 |
+
operators.append({
|
| 132 |
+
'type': 'Tm',
|
| 133 |
+
'value': match.group(1).strip(),
|
| 134 |
+
'description': 'Text Matrix'
|
| 135 |
+
})
|
| 136 |
+
|
| 137 |
+
# Font (Tf)
|
| 138 |
+
tf_pattern = r'/(\S+)\s+([\d.]+)\s+Tf'
|
| 139 |
+
for match in re.finditer(tf_pattern, text_block):
|
| 140 |
+
operators.append({
|
| 141 |
+
'type': 'Tf',
|
| 142 |
+
'value': f'/{match.group(1)} {match.group(2)}',
|
| 143 |
+
'description': f'Font: {match.group(1)}, Size: {match.group(2)}'
|
| 144 |
+
})
|
| 145 |
+
|
| 146 |
+
# Text positioning (Td, TD)
|
| 147 |
+
td_pattern = r'([\d.\-]+)\s+([\d.\-]+)\s+T[dD]'
|
| 148 |
+
for match in re.finditer(td_pattern, text_block):
|
| 149 |
+
operators.append({
|
| 150 |
+
'type': 'Td',
|
| 151 |
+
'value': f'{match.group(1)} {match.group(2)}',
|
| 152 |
+
'description': f'Move text position ({match.group(1)}, {match.group(2)})'
|
| 153 |
+
})
|
| 154 |
+
|
| 155 |
+
# Text showing (Tj)
|
| 156 |
+
tj_pattern = r'\((.*?)\)\s*Tj'
|
| 157 |
+
for match in re.finditer(tj_pattern, text_block):
|
| 158 |
+
text = match.group(1)
|
| 159 |
+
operators.append({
|
| 160 |
+
'type': 'Tj',
|
| 161 |
+
'value': f'({text})',
|
| 162 |
+
'description': f'Show text: {text[:50]}'
|
| 163 |
+
})
|
| 164 |
+
|
| 165 |
+
# Text showing (TJ - array)
|
| 166 |
+
tj_array_pattern = r'\[(.*?)\]\s*TJ'
|
| 167 |
+
for match in re.finditer(tj_array_pattern, text_block, re.DOTALL):
|
| 168 |
+
array_content = match.group(1)
|
| 169 |
+
operators.append({
|
| 170 |
+
'type': 'TJ',
|
| 171 |
+
'value': f'[{array_content[:100]}]',
|
| 172 |
+
'description': 'Show text array'
|
| 173 |
+
})
|
| 174 |
+
|
| 175 |
+
# Text leading (TL)
|
| 176 |
+
tl_pattern = r'([\d.\-]+)\s+TL'
|
| 177 |
+
for match in re.finditer(tl_pattern, text_block):
|
| 178 |
+
operators.append({
|
| 179 |
+
'type': 'TL',
|
| 180 |
+
'value': match.group(1),
|
| 181 |
+
'description': f'Text leading: {match.group(1)}'
|
| 182 |
+
})
|
| 183 |
+
|
| 184 |
+
# Color operators (rg, RG, g, G)
|
| 185 |
+
color_pattern = r'([\d.\s]+)\s+(rg|RG|g|G)'
|
| 186 |
+
for match in re.finditer(color_pattern, text_block):
|
| 187 |
+
operators.append({
|
| 188 |
+
'type': match.group(2),
|
| 189 |
+
'value': match.group(1).strip(),
|
| 190 |
+
'description': f'Color: {match.group(1).strip()}'
|
| 191 |
+
})
|
| 192 |
+
|
| 193 |
+
return operators
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
def _extract_text_from_operators(operators: List[Dict[str, str]]) -> str:
|
| 197 |
+
"""
|
| 198 |
+
Extract visible text from operator list.
|
| 199 |
+
|
| 200 |
+
Args:
|
| 201 |
+
operators: List of operator dictionaries
|
| 202 |
+
|
| 203 |
+
Returns:
|
| 204 |
+
Concatenated text content
|
| 205 |
+
"""
|
| 206 |
+
text_parts = []
|
| 207 |
+
|
| 208 |
+
for op in operators:
|
| 209 |
+
if op['type'] in ['Tj', 'TJ']:
|
| 210 |
+
# Extract text from parentheses or array
|
| 211 |
+
value = op['value']
|
| 212 |
+
# Simple extraction - just get content in parentheses
|
| 213 |
+
matches = re.findall(r'\((.*?)\)', value)
|
| 214 |
+
text_parts.extend(matches)
|
| 215 |
+
|
| 216 |
+
return ' '.join(text_parts)
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
def _find_matching_text_object(
|
| 220 |
+
text_objects: List[Dict[str, Any]],
|
| 221 |
+
target_block: Any
|
| 222 |
+
) -> Optional[Dict[str, Any]]:
|
| 223 |
+
"""
|
| 224 |
+
Find the text object that best matches the target block.
|
| 225 |
+
|
| 226 |
+
Args:
|
| 227 |
+
text_objects: List of parsed text objects
|
| 228 |
+
target_block: BlockInfo object to match
|
| 229 |
+
|
| 230 |
+
Returns:
|
| 231 |
+
Matching text object or None
|
| 232 |
+
"""
|
| 233 |
+
target_text = target_block.text.strip()
|
| 234 |
+
if not target_text:
|
| 235 |
+
return None
|
| 236 |
+
|
| 237 |
+
best_match = None
|
| 238 |
+
best_score = 0
|
| 239 |
+
|
| 240 |
+
for text_obj in text_objects:
|
| 241 |
+
obj_text = text_obj['text'].strip()
|
| 242 |
+
if not obj_text:
|
| 243 |
+
continue
|
| 244 |
+
|
| 245 |
+
# Calculate similarity score (simple substring matching)
|
| 246 |
+
# Check if either text contains the other
|
| 247 |
+
if target_text in obj_text or obj_text in target_text:
|
| 248 |
+
score = min(len(target_text), len(obj_text)) / max(len(target_text), len(obj_text))
|
| 249 |
+
if score > best_score:
|
| 250 |
+
best_score = score
|
| 251 |
+
best_match = text_obj
|
| 252 |
+
|
| 253 |
+
# Only return match if score is reasonable
|
| 254 |
+
if best_score > 0.3:
|
| 255 |
+
return best_match
|
| 256 |
+
|
| 257 |
+
return None
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def format_operators_markdown(result: Dict[str, Any]) -> str:
|
| 261 |
+
"""
|
| 262 |
+
Format operators as readable Markdown.
|
| 263 |
+
|
| 264 |
+
Args:
|
| 265 |
+
result: Result dictionary from extract_content_stream_for_block
|
| 266 |
+
|
| 267 |
+
Returns:
|
| 268 |
+
Formatted Markdown string
|
| 269 |
+
"""
|
| 270 |
+
if 'error' in result:
|
| 271 |
+
return f"## Error\n\n{result['error']}"
|
| 272 |
+
|
| 273 |
+
lines = [
|
| 274 |
+
"## Content Stream Operators",
|
| 275 |
+
"",
|
| 276 |
+
f"**Block Text**: {result.get('block_text', 'N/A')}",
|
| 277 |
+
""
|
| 278 |
+
]
|
| 279 |
+
|
| 280 |
+
if not result.get('matched'):
|
| 281 |
+
lines.extend([
|
| 282 |
+
"⚠️ **Warning**: Could not find exact matching text object in content stream.",
|
| 283 |
+
"",
|
| 284 |
+
result.get('message', ''),
|
| 285 |
+
""
|
| 286 |
+
])
|
| 287 |
+
|
| 288 |
+
operators = result.get('operators', [])
|
| 289 |
+
if operators:
|
| 290 |
+
lines.extend([
|
| 291 |
+
"### Operators Found",
|
| 292 |
+
""
|
| 293 |
+
])
|
| 294 |
+
|
| 295 |
+
for i, op in enumerate(operators, 1):
|
| 296 |
+
lines.append(f"**{i}. {op['type']}**")
|
| 297 |
+
lines.append(f" - Value: `{op['value']}`")
|
| 298 |
+
lines.append(f" - {op['description']}")
|
| 299 |
+
lines.append("")
|
| 300 |
+
else:
|
| 301 |
+
lines.append("No operators found.")
|
| 302 |
+
|
| 303 |
+
return "\n".join(lines)
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
def format_raw_stream(raw_stream: str, max_lines: int = 100) -> str:
|
| 307 |
+
"""
|
| 308 |
+
Format raw content stream for display.
|
| 309 |
+
|
| 310 |
+
Args:
|
| 311 |
+
raw_stream: Raw PDF content stream text
|
| 312 |
+
max_lines: Maximum number of lines to display
|
| 313 |
+
|
| 314 |
+
Returns:
|
| 315 |
+
Formatted string
|
| 316 |
+
"""
|
| 317 |
+
lines = raw_stream.split('\n')
|
| 318 |
+
if len(lines) > max_lines:
|
| 319 |
+
lines = lines[:max_lines]
|
| 320 |
+
lines.append(f"\n... (truncated, {len(raw_stream.split('\n')) - max_lines} more lines)")
|
| 321 |
+
|
| 322 |
+
return '\n'.join(lines)
|
screen_reader_sim.py
ADDED
|
@@ -0,0 +1,398 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Screen Reader Simulator Module
|
| 3 |
+
|
| 4 |
+
Simulates how NVDA and JAWS would read a PDF page, supporting both
|
| 5 |
+
tagged (structure tree) and untagged (visual order fallback) PDFs.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import Dict, List, Any, Optional, Tuple
|
| 9 |
+
import pikepdf
|
| 10 |
+
from structure_tree import extract_structure_tree, StructureNode
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def simulate_screen_reader(
|
| 14 |
+
pdf_path: str,
|
| 15 |
+
page_index: int,
|
| 16 |
+
blocks: List[Any],
|
| 17 |
+
reader_type: str = "NVDA",
|
| 18 |
+
detail_level: str = "default",
|
| 19 |
+
order_mode: str = "tblr"
|
| 20 |
+
) -> Dict[str, Any]:
|
| 21 |
+
"""
|
| 22 |
+
Simulate screen reader output for a PDF page.
|
| 23 |
+
|
| 24 |
+
Args:
|
| 25 |
+
pdf_path: Path to PDF file
|
| 26 |
+
page_index: 0-based page index
|
| 27 |
+
blocks: List of BlockInfo objects from extract_blocks_spans
|
| 28 |
+
reader_type: "NVDA" or "JAWS"
|
| 29 |
+
detail_level: "minimal", "default", or "verbose"
|
| 30 |
+
order_mode: Reading order mode for untagged fallback ("raw", "tblr", "columns")
|
| 31 |
+
|
| 32 |
+
Returns:
|
| 33 |
+
Dictionary with transcript, analysis, and metadata
|
| 34 |
+
"""
|
| 35 |
+
# Try tagged approach first
|
| 36 |
+
root = extract_structure_tree(pdf_path)
|
| 37 |
+
|
| 38 |
+
if root:
|
| 39 |
+
# Use structure tree
|
| 40 |
+
transcript, analysis = _simulate_tagged(
|
| 41 |
+
root, page_index, reader_type, detail_level
|
| 42 |
+
)
|
| 43 |
+
mode = "tagged"
|
| 44 |
+
else:
|
| 45 |
+
# Fallback to visual order
|
| 46 |
+
transcript, analysis = _simulate_untagged(
|
| 47 |
+
blocks, reader_type, detail_level, order_mode
|
| 48 |
+
)
|
| 49 |
+
mode = "untagged"
|
| 50 |
+
|
| 51 |
+
return {
|
| 52 |
+
'transcript': transcript,
|
| 53 |
+
'analysis': analysis,
|
| 54 |
+
'mode': mode,
|
| 55 |
+
'reader_type': reader_type,
|
| 56 |
+
'detail_level': detail_level
|
| 57 |
+
}
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def _simulate_tagged(
|
| 61 |
+
root: StructureNode,
|
| 62 |
+
page_index: int,
|
| 63 |
+
reader_type: str,
|
| 64 |
+
detail_level: str
|
| 65 |
+
) -> Tuple[str, str]:
|
| 66 |
+
"""
|
| 67 |
+
Simulate screen reader for tagged PDF using structure tree.
|
| 68 |
+
|
| 69 |
+
Args:
|
| 70 |
+
root: Root StructureNode
|
| 71 |
+
page_index: Page to simulate (0-based)
|
| 72 |
+
reader_type: "NVDA" or "JAWS"
|
| 73 |
+
detail_level: Detail level
|
| 74 |
+
|
| 75 |
+
Returns:
|
| 76 |
+
Tuple of (transcript, analysis)
|
| 77 |
+
"""
|
| 78 |
+
# Collect structure elements for this page
|
| 79 |
+
page_elements = []
|
| 80 |
+
|
| 81 |
+
def _collect_page_elements(node: StructureNode):
|
| 82 |
+
# Include node if it's for this page or has no page ref (document-level)
|
| 83 |
+
if node.page_ref is None or node.page_ref == page_index:
|
| 84 |
+
if node.tag_type not in ['StructTreeRoot', 'MCID']:
|
| 85 |
+
page_elements.append(node)
|
| 86 |
+
|
| 87 |
+
for child in node.children:
|
| 88 |
+
_collect_page_elements(child)
|
| 89 |
+
|
| 90 |
+
_collect_page_elements(root)
|
| 91 |
+
|
| 92 |
+
# Generate transcript
|
| 93 |
+
transcript_lines = []
|
| 94 |
+
element_count = 0
|
| 95 |
+
|
| 96 |
+
for element in page_elements:
|
| 97 |
+
announcement = _format_element_announcement(
|
| 98 |
+
element, reader_type, detail_level
|
| 99 |
+
)
|
| 100 |
+
if announcement:
|
| 101 |
+
transcript_lines.append(announcement)
|
| 102 |
+
element_count += 1
|
| 103 |
+
|
| 104 |
+
transcript = '\n\n'.join(transcript_lines)
|
| 105 |
+
|
| 106 |
+
# Generate analysis
|
| 107 |
+
analysis_lines = [
|
| 108 |
+
"## Screen Reader Analysis (Tagged Mode)",
|
| 109 |
+
"",
|
| 110 |
+
f"**Structure**: This page uses PDF tagging (accessible structure tree)",
|
| 111 |
+
f"**Elements Found**: {element_count}",
|
| 112 |
+
""
|
| 113 |
+
]
|
| 114 |
+
|
| 115 |
+
# Count element types
|
| 116 |
+
tag_counts = {}
|
| 117 |
+
for element in page_elements:
|
| 118 |
+
tag_counts[element.tag_type] = tag_counts.get(element.tag_type, 0) + 1
|
| 119 |
+
|
| 120 |
+
if tag_counts:
|
| 121 |
+
analysis_lines.extend([
|
| 122 |
+
"### Element Types",
|
| 123 |
+
""
|
| 124 |
+
])
|
| 125 |
+
for tag, count in sorted(tag_counts.items()):
|
| 126 |
+
analysis_lines.append(f"- **{tag}**: {count}")
|
| 127 |
+
|
| 128 |
+
# Check for alt text coverage
|
| 129 |
+
elements_needing_alt = [e for e in page_elements if e.tag_type in ['Figure', 'Formula', 'Artifact']]
|
| 130 |
+
elements_with_alt = [e for e in elements_needing_alt if e.alt_text]
|
| 131 |
+
|
| 132 |
+
if elements_needing_alt:
|
| 133 |
+
coverage = len(elements_with_alt) / len(elements_needing_alt) * 100
|
| 134 |
+
analysis_lines.extend([
|
| 135 |
+
"",
|
| 136 |
+
"### Alt Text Coverage",
|
| 137 |
+
"",
|
| 138 |
+
f"**Elements needing alt text**: {len(elements_needing_alt)}",
|
| 139 |
+
f"**Elements with alt text**: {len(elements_with_alt)}",
|
| 140 |
+
f"**Coverage**: {coverage:.1f}%",
|
| 141 |
+
""
|
| 142 |
+
])
|
| 143 |
+
|
| 144 |
+
if coverage < 100:
|
| 145 |
+
analysis_lines.append("⚠️ Some elements are missing alt text")
|
| 146 |
+
|
| 147 |
+
analysis = '\n'.join(analysis_lines)
|
| 148 |
+
|
| 149 |
+
return transcript, analysis
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def _simulate_untagged(
|
| 153 |
+
blocks: List[Any],
|
| 154 |
+
reader_type: str,
|
| 155 |
+
detail_level: str,
|
| 156 |
+
order_mode: str
|
| 157 |
+
) -> Tuple[str, str]:
|
| 158 |
+
"""
|
| 159 |
+
Simulate screen reader for untagged PDF using visual order.
|
| 160 |
+
|
| 161 |
+
Args:
|
| 162 |
+
blocks: List of BlockInfo objects
|
| 163 |
+
reader_type: "NVDA" or "JAWS"
|
| 164 |
+
detail_level: Detail level
|
| 165 |
+
order_mode: Reading order mode
|
| 166 |
+
|
| 167 |
+
Returns:
|
| 168 |
+
Tuple of (transcript, analysis)
|
| 169 |
+
"""
|
| 170 |
+
from app import order_blocks # Import the ordering function
|
| 171 |
+
|
| 172 |
+
# Order blocks according to mode
|
| 173 |
+
ordered_blocks = order_blocks(blocks, order_mode)
|
| 174 |
+
|
| 175 |
+
# Generate transcript
|
| 176 |
+
transcript_lines = []
|
| 177 |
+
text_block_count = 0
|
| 178 |
+
image_block_count = 0
|
| 179 |
+
|
| 180 |
+
for block in ordered_blocks:
|
| 181 |
+
if block.block_type == 0: # Text block
|
| 182 |
+
# Infer heading from font size
|
| 183 |
+
is_heading = False
|
| 184 |
+
heading_level = None
|
| 185 |
+
|
| 186 |
+
if block.spans:
|
| 187 |
+
avg_size = sum(s.size for s in block.spans) / len(block.spans)
|
| 188 |
+
if avg_size > 18:
|
| 189 |
+
is_heading = True
|
| 190 |
+
heading_level = 1
|
| 191 |
+
elif avg_size > 14:
|
| 192 |
+
is_heading = True
|
| 193 |
+
heading_level = 2
|
| 194 |
+
|
| 195 |
+
# Format announcement
|
| 196 |
+
if is_heading and detail_level != "minimal":
|
| 197 |
+
if reader_type == "NVDA":
|
| 198 |
+
transcript_lines.append(f"Heading level {heading_level}")
|
| 199 |
+
transcript_lines.append(block.text.strip())
|
| 200 |
+
else: # JAWS
|
| 201 |
+
transcript_lines.append(f"Heading {heading_level}: {block.text.strip()}")
|
| 202 |
+
else:
|
| 203 |
+
transcript_lines.append(block.text.strip())
|
| 204 |
+
|
| 205 |
+
text_block_count += 1
|
| 206 |
+
|
| 207 |
+
elif block.block_type == 1: # Image block
|
| 208 |
+
if detail_level != "minimal":
|
| 209 |
+
transcript_lines.append("[Image - no alt text available]")
|
| 210 |
+
image_block_count += 1
|
| 211 |
+
|
| 212 |
+
transcript = '\n\n'.join(transcript_lines)
|
| 213 |
+
|
| 214 |
+
# Generate analysis
|
| 215 |
+
analysis_lines = [
|
| 216 |
+
"## Screen Reader Analysis (Untagged Mode)",
|
| 217 |
+
"",
|
| 218 |
+
"⚠️ **No Structure**: This page does not use PDF tagging",
|
| 219 |
+
"",
|
| 220 |
+
"Screen readers will read text in visual order with limited context.",
|
| 221 |
+
"",
|
| 222 |
+
f"**Reading Order Mode**: {order_mode}",
|
| 223 |
+
f"**Text Blocks**: {text_block_count}",
|
| 224 |
+
f"**Images**: {image_block_count}",
|
| 225 |
+
"",
|
| 226 |
+
"### Limitations",
|
| 227 |
+
"",
|
| 228 |
+
"- No semantic information (headings, lists, tables)",
|
| 229 |
+
"- No alt text for images",
|
| 230 |
+
"- Reading order may not match intended flow",
|
| 231 |
+
"- Navigation by elements not possible",
|
| 232 |
+
"",
|
| 233 |
+
"**Recommendation**: Add PDF tagging for better accessibility"
|
| 234 |
+
]
|
| 235 |
+
|
| 236 |
+
analysis = '\n'.join(analysis_lines)
|
| 237 |
+
|
| 238 |
+
return transcript, analysis
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
def _format_element_announcement(
|
| 242 |
+
element: StructureNode,
|
| 243 |
+
reader_type: str,
|
| 244 |
+
detail_level: str
|
| 245 |
+
) -> Optional[str]:
|
| 246 |
+
"""
|
| 247 |
+
Format a structure element as a screen reader announcement.
|
| 248 |
+
|
| 249 |
+
Args:
|
| 250 |
+
element: StructureNode to announce
|
| 251 |
+
reader_type: "NVDA" or "JAWS"
|
| 252 |
+
detail_level: "minimal", "default", or "verbose"
|
| 253 |
+
|
| 254 |
+
Returns:
|
| 255 |
+
Formatted announcement string or None
|
| 256 |
+
"""
|
| 257 |
+
tag = element.tag_type
|
| 258 |
+
lines = []
|
| 259 |
+
|
| 260 |
+
# Map PDF tag types to screen reader announcements
|
| 261 |
+
if tag.startswith('H'):
|
| 262 |
+
# Heading
|
| 263 |
+
level = tag[1:] if len(tag) > 1 else '1'
|
| 264 |
+
text = element.actual_text or "[Heading]"
|
| 265 |
+
|
| 266 |
+
if detail_level == "minimal":
|
| 267 |
+
return text
|
| 268 |
+
|
| 269 |
+
if reader_type == "NVDA":
|
| 270 |
+
lines.append(f"Heading level {level}")
|
| 271 |
+
lines.append(text)
|
| 272 |
+
else: # JAWS
|
| 273 |
+
lines.append(f"Heading {level}: {text}")
|
| 274 |
+
|
| 275 |
+
elif tag == 'P':
|
| 276 |
+
# Paragraph
|
| 277 |
+
text = element.actual_text or "[Paragraph]"
|
| 278 |
+
|
| 279 |
+
if detail_level == "minimal":
|
| 280 |
+
return text
|
| 281 |
+
|
| 282 |
+
if detail_level == "verbose":
|
| 283 |
+
if reader_type == "NVDA":
|
| 284 |
+
lines.append("Paragraph")
|
| 285 |
+
lines.append(text)
|
| 286 |
+
if reader_type == "NVDA" and detail_level == "verbose":
|
| 287 |
+
lines.append("Out of paragraph")
|
| 288 |
+
else:
|
| 289 |
+
lines.append(text)
|
| 290 |
+
|
| 291 |
+
elif tag == 'Figure':
|
| 292 |
+
# Figure/Image
|
| 293 |
+
alt_text = element.alt_text or "[Image - no alt text]"
|
| 294 |
+
|
| 295 |
+
if detail_level == "minimal":
|
| 296 |
+
return None
|
| 297 |
+
|
| 298 |
+
if reader_type == "NVDA":
|
| 299 |
+
lines.append("Graphic")
|
| 300 |
+
lines.append(alt_text)
|
| 301 |
+
else: # JAWS
|
| 302 |
+
lines.append(f"Graphic: {alt_text}")
|
| 303 |
+
|
| 304 |
+
elif tag == 'Formula':
|
| 305 |
+
# Math formula
|
| 306 |
+
alt_text = element.alt_text or element.actual_text or "[Formula]"
|
| 307 |
+
|
| 308 |
+
if detail_level == "minimal":
|
| 309 |
+
return alt_text
|
| 310 |
+
|
| 311 |
+
if reader_type == "NVDA":
|
| 312 |
+
lines.append("Formula")
|
| 313 |
+
lines.append(alt_text)
|
| 314 |
+
else: # JAWS
|
| 315 |
+
lines.append(f"Formula: {alt_text}")
|
| 316 |
+
|
| 317 |
+
elif tag in ['L', 'LI']:
|
| 318 |
+
# List/List Item
|
| 319 |
+
text = element.actual_text or "[List item]"
|
| 320 |
+
|
| 321 |
+
if detail_level == "minimal":
|
| 322 |
+
return text
|
| 323 |
+
|
| 324 |
+
if tag == 'L' and detail_level == "verbose":
|
| 325 |
+
lines.append("List start")
|
| 326 |
+
else:
|
| 327 |
+
if reader_type == "NVDA":
|
| 328 |
+
lines.append("List item")
|
| 329 |
+
lines.append(text)
|
| 330 |
+
else: # JAWS
|
| 331 |
+
lines.append(f"Bullet: {text}")
|
| 332 |
+
|
| 333 |
+
elif tag == 'Table':
|
| 334 |
+
# Table
|
| 335 |
+
if detail_level != "minimal":
|
| 336 |
+
if reader_type == "NVDA":
|
| 337 |
+
lines.append("Table")
|
| 338 |
+
else: # JAWS
|
| 339 |
+
lines.append("Table start")
|
| 340 |
+
|
| 341 |
+
elif tag in ['TR', 'TD', 'TH']:
|
| 342 |
+
# Table row/cell
|
| 343 |
+
text = element.actual_text or ""
|
| 344 |
+
if text and detail_level != "minimal":
|
| 345 |
+
lines.append(text)
|
| 346 |
+
|
| 347 |
+
elif tag == 'Link':
|
| 348 |
+
# Link
|
| 349 |
+
text = element.actual_text or "[Link]"
|
| 350 |
+
|
| 351 |
+
if detail_level == "minimal":
|
| 352 |
+
return text
|
| 353 |
+
|
| 354 |
+
if reader_type == "NVDA":
|
| 355 |
+
lines.append("Link")
|
| 356 |
+
lines.append(text)
|
| 357 |
+
else: # JAWS
|
| 358 |
+
lines.append(f"Link: {text}")
|
| 359 |
+
|
| 360 |
+
elif tag == 'Span':
|
| 361 |
+
# Inline text
|
| 362 |
+
text = element.actual_text or ""
|
| 363 |
+
if text:
|
| 364 |
+
return text
|
| 365 |
+
|
| 366 |
+
elif tag in ['Document', 'Part', 'Sect', 'Div', 'Art']:
|
| 367 |
+
# Container elements - usually not announced
|
| 368 |
+
return None
|
| 369 |
+
|
| 370 |
+
else:
|
| 371 |
+
# Unknown tag type
|
| 372 |
+
if element.actual_text:
|
| 373 |
+
return element.actual_text
|
| 374 |
+
|
| 375 |
+
if lines:
|
| 376 |
+
return '\n'.join(lines)
|
| 377 |
+
|
| 378 |
+
return None
|
| 379 |
+
|
| 380 |
+
|
| 381 |
+
def format_transcript(result: Dict[str, Any]) -> str:
|
| 382 |
+
"""
|
| 383 |
+
Format screen reader transcript for display.
|
| 384 |
+
|
| 385 |
+
Args:
|
| 386 |
+
result: Result from simulate_screen_reader
|
| 387 |
+
|
| 388 |
+
Returns:
|
| 389 |
+
Formatted transcript string
|
| 390 |
+
"""
|
| 391 |
+
header = f"# {result['reader_type']} Transcript ({result['detail_level']} detail)\n\n"
|
| 392 |
+
|
| 393 |
+
if result['mode'] == 'untagged':
|
| 394 |
+
header += "⚠️ Simulated from visual order (PDF not tagged)\n\n"
|
| 395 |
+
|
| 396 |
+
header += "---\n\n"
|
| 397 |
+
|
| 398 |
+
return header + result['transcript']
|
structure_tree.py
ADDED
|
@@ -0,0 +1,493 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Structure Tree Analysis Module
|
| 3 |
+
|
| 4 |
+
Provides functionality for extracting and analyzing PDF structure trees,
|
| 5 |
+
including paragraph detection and block-to-tag mapping.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from dataclasses import dataclass, field
|
| 9 |
+
from typing import List, Optional, Dict, Any, Tuple
|
| 10 |
+
import pikepdf
|
| 11 |
+
import statistics
|
| 12 |
+
from collections import Counter
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
@dataclass
|
| 16 |
+
class StructureNode:
|
| 17 |
+
"""Represents a node in the PDF structure tree."""
|
| 18 |
+
tag_type: str # P, H1, Document, etc.
|
| 19 |
+
depth: int
|
| 20 |
+
mcid: Optional[int] = None
|
| 21 |
+
alt_text: Optional[str] = None
|
| 22 |
+
actual_text: Optional[str] = None
|
| 23 |
+
page_ref: Optional[int] = None
|
| 24 |
+
children: List['StructureNode'] = field(default_factory=list)
|
| 25 |
+
|
| 26 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 27 |
+
"""Convert to dictionary for JSON serialization."""
|
| 28 |
+
return {
|
| 29 |
+
'tag_type': self.tag_type,
|
| 30 |
+
'depth': self.depth,
|
| 31 |
+
'mcid': self.mcid,
|
| 32 |
+
'alt_text': self.alt_text,
|
| 33 |
+
'actual_text': self.actual_text,
|
| 34 |
+
'page_ref': self.page_ref,
|
| 35 |
+
'children': [child.to_dict() for child in self.children]
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def extract_structure_tree(pdf_path: str) -> Optional[StructureNode]:
|
| 40 |
+
"""
|
| 41 |
+
Extract the complete structure tree from a PDF.
|
| 42 |
+
|
| 43 |
+
Args:
|
| 44 |
+
pdf_path: Path to the PDF file
|
| 45 |
+
|
| 46 |
+
Returns:
|
| 47 |
+
Root StructureNode or None if no structure tree exists
|
| 48 |
+
"""
|
| 49 |
+
try:
|
| 50 |
+
with pikepdf.open(pdf_path) as pdf:
|
| 51 |
+
if '/StructTreeRoot' not in pdf.Root:
|
| 52 |
+
return None
|
| 53 |
+
|
| 54 |
+
struct_root = pdf.Root.StructTreeRoot
|
| 55 |
+
|
| 56 |
+
# Create root node
|
| 57 |
+
root_node = StructureNode(
|
| 58 |
+
tag_type="StructTreeRoot",
|
| 59 |
+
depth=0
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
# Recursively parse the tree
|
| 63 |
+
if '/K' in struct_root:
|
| 64 |
+
_parse_structure_element(struct_root.K, root_node, 1, pdf)
|
| 65 |
+
|
| 66 |
+
return root_node
|
| 67 |
+
|
| 68 |
+
except Exception as e:
|
| 69 |
+
print(f"Error extracting structure tree: {e}")
|
| 70 |
+
return None
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def _parse_structure_element(element, parent_node: StructureNode, depth: int, pdf: pikepdf.Pdf, max_depth: int = 20):
|
| 74 |
+
"""
|
| 75 |
+
Recursively parse a structure element and its children.
|
| 76 |
+
|
| 77 |
+
Args:
|
| 78 |
+
element: pikepdf object representing the element
|
| 79 |
+
parent_node: Parent StructureNode to attach children to
|
| 80 |
+
depth: Current depth in the tree
|
| 81 |
+
pdf: pikepdf.Pdf object for resolving references
|
| 82 |
+
max_depth: Maximum recursion depth to prevent infinite loops
|
| 83 |
+
"""
|
| 84 |
+
if depth > max_depth:
|
| 85 |
+
return
|
| 86 |
+
|
| 87 |
+
# Handle arrays of elements
|
| 88 |
+
if isinstance(element, pikepdf.Array):
|
| 89 |
+
for item in element:
|
| 90 |
+
_parse_structure_element(item, parent_node, depth, pdf, max_depth)
|
| 91 |
+
return
|
| 92 |
+
|
| 93 |
+
# Handle MCID (Marked Content ID) - leaf node
|
| 94 |
+
if isinstance(element, int):
|
| 95 |
+
node = StructureNode(
|
| 96 |
+
tag_type="MCID",
|
| 97 |
+
depth=depth,
|
| 98 |
+
mcid=element
|
| 99 |
+
)
|
| 100 |
+
parent_node.children.append(node)
|
| 101 |
+
return
|
| 102 |
+
|
| 103 |
+
# Handle dictionary (structure element)
|
| 104 |
+
if isinstance(element, pikepdf.Dictionary):
|
| 105 |
+
# Extract tag type
|
| 106 |
+
tag_type = str(element.get('/S', 'Unknown'))
|
| 107 |
+
if tag_type.startswith('/'):
|
| 108 |
+
tag_type = tag_type[1:] # Remove leading slash
|
| 109 |
+
|
| 110 |
+
# Extract attributes
|
| 111 |
+
mcid = None
|
| 112 |
+
if '/MCID' in element:
|
| 113 |
+
mcid = int(element.MCID)
|
| 114 |
+
|
| 115 |
+
alt_text = None
|
| 116 |
+
if '/Alt' in element:
|
| 117 |
+
try:
|
| 118 |
+
alt_text = str(element.Alt)
|
| 119 |
+
except:
|
| 120 |
+
pass
|
| 121 |
+
|
| 122 |
+
actual_text = None
|
| 123 |
+
if '/ActualText' in element:
|
| 124 |
+
try:
|
| 125 |
+
actual_text = str(element.ActualText)
|
| 126 |
+
except:
|
| 127 |
+
pass
|
| 128 |
+
|
| 129 |
+
page_ref = None
|
| 130 |
+
if '/Pg' in element:
|
| 131 |
+
try:
|
| 132 |
+
# Find the page number
|
| 133 |
+
page_obj = element.Pg
|
| 134 |
+
for i, page in enumerate(pdf.pages):
|
| 135 |
+
if page.obj == page_obj:
|
| 136 |
+
page_ref = i
|
| 137 |
+
break
|
| 138 |
+
except:
|
| 139 |
+
pass
|
| 140 |
+
|
| 141 |
+
# Create node
|
| 142 |
+
node = StructureNode(
|
| 143 |
+
tag_type=tag_type,
|
| 144 |
+
depth=depth,
|
| 145 |
+
mcid=mcid,
|
| 146 |
+
alt_text=alt_text,
|
| 147 |
+
actual_text=actual_text,
|
| 148 |
+
page_ref=page_ref
|
| 149 |
+
)
|
| 150 |
+
parent_node.children.append(node)
|
| 151 |
+
|
| 152 |
+
# Recursively process children
|
| 153 |
+
if '/K' in element:
|
| 154 |
+
_parse_structure_element(element.K, node, depth + 1, pdf, max_depth)
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def format_tree_text(root: StructureNode, max_nodes: int = 500) -> str:
|
| 158 |
+
"""
|
| 159 |
+
Format structure tree as indented text with box-drawing characters.
|
| 160 |
+
|
| 161 |
+
Args:
|
| 162 |
+
root: Root StructureNode
|
| 163 |
+
max_nodes: Maximum number of nodes to display
|
| 164 |
+
|
| 165 |
+
Returns:
|
| 166 |
+
Formatted text representation
|
| 167 |
+
"""
|
| 168 |
+
lines = []
|
| 169 |
+
node_count = [0] # Use list to allow modification in nested function
|
| 170 |
+
|
| 171 |
+
def _format_node(node: StructureNode, prefix: str = "", is_last: bool = True):
|
| 172 |
+
if node_count[0] >= max_nodes:
|
| 173 |
+
if node_count[0] == max_nodes:
|
| 174 |
+
lines.append(f"{prefix}... (truncated, tree too large)")
|
| 175 |
+
node_count[0] += 1
|
| 176 |
+
return
|
| 177 |
+
|
| 178 |
+
# Format node info
|
| 179 |
+
info = node.tag_type
|
| 180 |
+
if node.mcid is not None:
|
| 181 |
+
info += f" [MCID: {node.mcid}]"
|
| 182 |
+
if node.alt_text:
|
| 183 |
+
info += f" (Alt: {node.alt_text[:30]}...)" if len(node.alt_text) > 30 else f" (Alt: {node.alt_text})"
|
| 184 |
+
if node.actual_text:
|
| 185 |
+
info += f" (Text: {node.actual_text[:30]}...)" if len(node.actual_text) > 30 else f" (Text: {node.actual_text})"
|
| 186 |
+
if node.page_ref is not None:
|
| 187 |
+
info += f" [Page {node.page_ref + 1}]"
|
| 188 |
+
|
| 189 |
+
# Add line with appropriate prefix
|
| 190 |
+
if node.depth == 0:
|
| 191 |
+
lines.append(info)
|
| 192 |
+
else:
|
| 193 |
+
connector = "└── " if is_last else "├── "
|
| 194 |
+
lines.append(f"{prefix}{connector}{info}")
|
| 195 |
+
|
| 196 |
+
node_count[0] += 1
|
| 197 |
+
|
| 198 |
+
# Process children
|
| 199 |
+
if node.children:
|
| 200 |
+
extension = " " if is_last else "│ "
|
| 201 |
+
new_prefix = prefix + extension if node.depth > 0 else ""
|
| 202 |
+
|
| 203 |
+
for i, child in enumerate(node.children):
|
| 204 |
+
is_last_child = (i == len(node.children) - 1)
|
| 205 |
+
_format_node(child, new_prefix, is_last_child)
|
| 206 |
+
|
| 207 |
+
_format_node(root)
|
| 208 |
+
return "\n".join(lines)
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
def get_tree_statistics(root: StructureNode) -> Dict[str, Any]:
|
| 212 |
+
"""
|
| 213 |
+
Calculate statistics about the structure tree.
|
| 214 |
+
|
| 215 |
+
Args:
|
| 216 |
+
root: Root StructureNode
|
| 217 |
+
|
| 218 |
+
Returns:
|
| 219 |
+
Dictionary of statistics
|
| 220 |
+
"""
|
| 221 |
+
node_count = 0
|
| 222 |
+
max_depth = 0
|
| 223 |
+
tag_counts = Counter()
|
| 224 |
+
pages_with_structure = set()
|
| 225 |
+
nodes_with_alt_text = 0
|
| 226 |
+
nodes_with_actual_text = 0
|
| 227 |
+
mcid_count = 0
|
| 228 |
+
|
| 229 |
+
def _traverse(node: StructureNode):
|
| 230 |
+
nonlocal node_count, max_depth, nodes_with_alt_text, nodes_with_actual_text, mcid_count
|
| 231 |
+
|
| 232 |
+
node_count += 1
|
| 233 |
+
max_depth = max(max_depth, node.depth)
|
| 234 |
+
tag_counts[node.tag_type] += 1
|
| 235 |
+
|
| 236 |
+
if node.page_ref is not None:
|
| 237 |
+
pages_with_structure.add(node.page_ref)
|
| 238 |
+
if node.alt_text:
|
| 239 |
+
nodes_with_alt_text += 1
|
| 240 |
+
if node.actual_text:
|
| 241 |
+
nodes_with_actual_text += 1
|
| 242 |
+
if node.mcid is not None:
|
| 243 |
+
mcid_count += 1
|
| 244 |
+
|
| 245 |
+
for child in node.children:
|
| 246 |
+
_traverse(child)
|
| 247 |
+
|
| 248 |
+
_traverse(root)
|
| 249 |
+
|
| 250 |
+
return {
|
| 251 |
+
'total_nodes': node_count,
|
| 252 |
+
'max_depth': max_depth,
|
| 253 |
+
'tag_type_counts': dict(tag_counts.most_common()),
|
| 254 |
+
'pages_with_structure': sorted(list(pages_with_structure)),
|
| 255 |
+
'nodes_with_alt_text': nodes_with_alt_text,
|
| 256 |
+
'nodes_with_actual_text': nodes_with_actual_text,
|
| 257 |
+
'mcid_count': mcid_count
|
| 258 |
+
}
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
def format_statistics_markdown(stats: Dict[str, Any]) -> str:
|
| 262 |
+
"""Format tree statistics as Markdown."""
|
| 263 |
+
lines = [
|
| 264 |
+
"## Structure Tree Statistics",
|
| 265 |
+
"",
|
| 266 |
+
f"**Total Nodes**: {stats['total_nodes']}",
|
| 267 |
+
f"**Maximum Depth**: {stats['max_depth']}",
|
| 268 |
+
f"**Nodes with Alt Text**: {stats['nodes_with_alt_text']}",
|
| 269 |
+
f"**Nodes with Actual Text**: {stats['nodes_with_actual_text']}",
|
| 270 |
+
f"**MCID References**: {stats['mcid_count']}",
|
| 271 |
+
"",
|
| 272 |
+
"### Tag Type Distribution",
|
| 273 |
+
""
|
| 274 |
+
]
|
| 275 |
+
|
| 276 |
+
for tag, count in stats['tag_type_counts'].items():
|
| 277 |
+
lines.append(f"- **{tag}**: {count}")
|
| 278 |
+
|
| 279 |
+
lines.extend([
|
| 280 |
+
"",
|
| 281 |
+
f"**Pages with Structure**: {len(stats['pages_with_structure'])}"
|
| 282 |
+
])
|
| 283 |
+
|
| 284 |
+
if stats['pages_with_structure']:
|
| 285 |
+
page_list = ", ".join([str(p + 1) for p in stats['pages_with_structure'][:20]])
|
| 286 |
+
if len(stats['pages_with_structure']) > 20:
|
| 287 |
+
page_list += f" ... ({len(stats['pages_with_structure']) - 20} more)"
|
| 288 |
+
lines.append(f"({page_list})")
|
| 289 |
+
|
| 290 |
+
return "\n".join(lines)
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
def extract_mcid_for_page(pdf_path: str, page_index: int) -> List[int]:
|
| 294 |
+
"""
|
| 295 |
+
Extract all MCIDs (Marked Content IDs) from a page's content stream.
|
| 296 |
+
|
| 297 |
+
Args:
|
| 298 |
+
pdf_path: Path to PDF file
|
| 299 |
+
page_index: 0-based page index
|
| 300 |
+
|
| 301 |
+
Returns:
|
| 302 |
+
List of MCIDs found in the page
|
| 303 |
+
"""
|
| 304 |
+
import re
|
| 305 |
+
|
| 306 |
+
try:
|
| 307 |
+
with pikepdf.open(pdf_path) as pdf:
|
| 308 |
+
page = pdf.pages[page_index]
|
| 309 |
+
|
| 310 |
+
# Get the content stream
|
| 311 |
+
if '/Contents' not in page:
|
| 312 |
+
return []
|
| 313 |
+
|
| 314 |
+
# Extract content stream as text
|
| 315 |
+
contents = page.Contents
|
| 316 |
+
if isinstance(contents, pikepdf.Array):
|
| 317 |
+
# Multiple content streams
|
| 318 |
+
stream_data = b""
|
| 319 |
+
for stream in contents:
|
| 320 |
+
stream_data += bytes(stream.read_bytes())
|
| 321 |
+
else:
|
| 322 |
+
stream_data = bytes(contents.read_bytes())
|
| 323 |
+
|
| 324 |
+
# Decode content stream
|
| 325 |
+
try:
|
| 326 |
+
content_text = stream_data.decode('latin-1')
|
| 327 |
+
except:
|
| 328 |
+
content_text = stream_data.decode('utf-8', errors='ignore')
|
| 329 |
+
|
| 330 |
+
# Extract MCIDs using regex
|
| 331 |
+
# Pattern: /MCID <number> BDC or /MCID <number> >> BDC
|
| 332 |
+
mcid_pattern = r'/MCID\s+(\d+)'
|
| 333 |
+
matches = re.findall(mcid_pattern, content_text)
|
| 334 |
+
|
| 335 |
+
return [int(m) for m in matches]
|
| 336 |
+
|
| 337 |
+
except Exception as e:
|
| 338 |
+
print(f"Error extracting MCIDs: {e}")
|
| 339 |
+
return []
|
| 340 |
+
|
| 341 |
+
|
| 342 |
+
def map_blocks_to_tags(pdf_path: str, page_index: int, blocks) -> List[Dict[str, Any]]:
|
| 343 |
+
"""
|
| 344 |
+
Map visual blocks to structure tree tags via MCIDs.
|
| 345 |
+
|
| 346 |
+
Args:
|
| 347 |
+
pdf_path: Path to PDF file
|
| 348 |
+
page_index: 0-based page index
|
| 349 |
+
blocks: List of BlockInfo objects from extract_blocks_spans
|
| 350 |
+
|
| 351 |
+
Returns:
|
| 352 |
+
List of mappings with block index, tag info, MCID, alt text
|
| 353 |
+
"""
|
| 354 |
+
# Extract structure tree
|
| 355 |
+
root = extract_structure_tree(pdf_path)
|
| 356 |
+
if not root:
|
| 357 |
+
return []
|
| 358 |
+
|
| 359 |
+
# Get MCIDs from page
|
| 360 |
+
page_mcids = extract_mcid_for_page(pdf_path, page_index)
|
| 361 |
+
|
| 362 |
+
# Build MCID to structure node mapping
|
| 363 |
+
mcid_to_node = {}
|
| 364 |
+
|
| 365 |
+
def _find_mcids(node: StructureNode):
|
| 366 |
+
if node.mcid is not None and (node.page_ref is None or node.page_ref == page_index):
|
| 367 |
+
mcid_to_node[node.mcid] = node
|
| 368 |
+
for child in node.children:
|
| 369 |
+
_find_mcids(child)
|
| 370 |
+
|
| 371 |
+
_find_mcids(root)
|
| 372 |
+
|
| 373 |
+
# Create mappings
|
| 374 |
+
mappings = []
|
| 375 |
+
for i, mcid in enumerate(page_mcids):
|
| 376 |
+
if i < len(blocks) and mcid in mcid_to_node:
|
| 377 |
+
node = mcid_to_node[mcid]
|
| 378 |
+
mappings.append({
|
| 379 |
+
'block_index': i,
|
| 380 |
+
'tag_type': node.tag_type,
|
| 381 |
+
'mcid': mcid,
|
| 382 |
+
'alt_text': node.alt_text or "",
|
| 383 |
+
'actual_text': node.actual_text or ""
|
| 384 |
+
})
|
| 385 |
+
|
| 386 |
+
return mappings
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
def detect_visual_paragraphs(blocks, vertical_gap_threshold: float = 15.0) -> List[List[int]]:
|
| 390 |
+
"""
|
| 391 |
+
Detect visual paragraphs based on spacing heuristics.
|
| 392 |
+
|
| 393 |
+
Args:
|
| 394 |
+
blocks: List of BlockInfo objects
|
| 395 |
+
vertical_gap_threshold: Minimum vertical gap to consider a paragraph break
|
| 396 |
+
|
| 397 |
+
Returns:
|
| 398 |
+
List of paragraph groups, where each group is a list of block indices
|
| 399 |
+
"""
|
| 400 |
+
# Filter text blocks and sort by position
|
| 401 |
+
text_blocks = [(i, b) for i, b in enumerate(blocks) if b.block_type == 0 and b.text.strip()]
|
| 402 |
+
if not text_blocks:
|
| 403 |
+
return []
|
| 404 |
+
|
| 405 |
+
# Sort by vertical position (top to bottom)
|
| 406 |
+
text_blocks.sort(key=lambda x: x[1].bbox[1])
|
| 407 |
+
|
| 408 |
+
paragraphs = []
|
| 409 |
+
current_paragraph = [text_blocks[0][0]]
|
| 410 |
+
prev_bbox = text_blocks[0][1].bbox
|
| 411 |
+
|
| 412 |
+
for idx, block in text_blocks[1:]:
|
| 413 |
+
bbox = block.bbox
|
| 414 |
+
|
| 415 |
+
# Calculate vertical gap
|
| 416 |
+
vertical_gap = bbox[1] - prev_bbox[3]
|
| 417 |
+
|
| 418 |
+
# Check if blocks are roughly aligned horizontally (same column)
|
| 419 |
+
horizontal_overlap = min(bbox[2], prev_bbox[2]) - max(bbox[0], prev_bbox[0])
|
| 420 |
+
|
| 421 |
+
if vertical_gap < vertical_gap_threshold and horizontal_overlap > 0:
|
| 422 |
+
# Same paragraph
|
| 423 |
+
current_paragraph.append(idx)
|
| 424 |
+
else:
|
| 425 |
+
# New paragraph
|
| 426 |
+
paragraphs.append(current_paragraph)
|
| 427 |
+
current_paragraph = [idx]
|
| 428 |
+
|
| 429 |
+
prev_bbox = bbox
|
| 430 |
+
|
| 431 |
+
# Add last paragraph
|
| 432 |
+
if current_paragraph:
|
| 433 |
+
paragraphs.append(current_paragraph)
|
| 434 |
+
|
| 435 |
+
return paragraphs
|
| 436 |
+
|
| 437 |
+
|
| 438 |
+
def detect_semantic_paragraphs(pdf_path: str, page_index: int) -> List[StructureNode]:
|
| 439 |
+
"""
|
| 440 |
+
Extract semantic paragraph tags (<P>) from structure tree.
|
| 441 |
+
|
| 442 |
+
Args:
|
| 443 |
+
pdf_path: Path to PDF file
|
| 444 |
+
page_index: 0-based page index
|
| 445 |
+
|
| 446 |
+
Returns:
|
| 447 |
+
List of StructureNode objects with tag_type='P' for the page
|
| 448 |
+
"""
|
| 449 |
+
root = extract_structure_tree(pdf_path)
|
| 450 |
+
if not root:
|
| 451 |
+
return []
|
| 452 |
+
|
| 453 |
+
paragraphs = []
|
| 454 |
+
|
| 455 |
+
def _find_paragraphs(node: StructureNode):
|
| 456 |
+
if node.tag_type == 'P' and (node.page_ref is None or node.page_ref == page_index):
|
| 457 |
+
paragraphs.append(node)
|
| 458 |
+
for child in node.children:
|
| 459 |
+
_find_paragraphs(child)
|
| 460 |
+
|
| 461 |
+
_find_paragraphs(root)
|
| 462 |
+
return paragraphs
|
| 463 |
+
|
| 464 |
+
|
| 465 |
+
def compare_paragraphs(visual_paragraphs: List[List[int]], semantic_paragraphs: List[StructureNode]) -> Dict[str, Any]:
|
| 466 |
+
"""
|
| 467 |
+
Compare visual and semantic paragraph detection.
|
| 468 |
+
|
| 469 |
+
Args:
|
| 470 |
+
visual_paragraphs: List of visual paragraph groups (block indices)
|
| 471 |
+
semantic_paragraphs: List of semantic <P> tags
|
| 472 |
+
|
| 473 |
+
Returns:
|
| 474 |
+
Dictionary with comparison statistics
|
| 475 |
+
"""
|
| 476 |
+
visual_count = len(visual_paragraphs)
|
| 477 |
+
semantic_count = len(semantic_paragraphs)
|
| 478 |
+
|
| 479 |
+
# Calculate match quality score (simple heuristic)
|
| 480 |
+
if visual_count == 0 and semantic_count == 0:
|
| 481 |
+
match_score = 1.0
|
| 482 |
+
elif visual_count == 0 or semantic_count == 0:
|
| 483 |
+
match_score = 0.0
|
| 484 |
+
else:
|
| 485 |
+
# Score based on count similarity
|
| 486 |
+
match_score = min(visual_count, semantic_count) / max(visual_count, semantic_count)
|
| 487 |
+
|
| 488 |
+
return {
|
| 489 |
+
'visual_count': visual_count,
|
| 490 |
+
'semantic_count': semantic_count,
|
| 491 |
+
'match_score': match_score,
|
| 492 |
+
'count_mismatch': abs(visual_count - semantic_count)
|
| 493 |
+
}
|