pdfinspector / DEBUGGING_WORKFLOW.md
rianders's picture
Fix file load errors and implement auto-refresh functionality
0d61aa0
# PDF Debugging Workflow
This guide details how to use the PDF Inspector tool to diagnose and remediate common PDF accessibility issues.
## 1. Initial Compatibility Check
**Goal**: Determine if the document requires major remediation before detailed analysis.
1. **Upload the PDF**: Use the file uploader or select an example from the list.
2. **Run Single Page Analysis**: Click "Analyze".
3. **Check for Alerts**: Look for the "Accessibility Alert" box at the top of the summary.
* **Untagged Document**: If you see this, the document lacks the "Structure Tree" required for screen readers.
* *Remediation*: Open the source file (Word/PPT) and "Save as PDF" with tags enabled, or use Adobe Acrobat Pro's "Autotag" feature.
* **Scanned Page**: If you see this, the page is an image with no selectable text.
* *Remediation*: Perform Optical Character Recognition (OCR) using Adobe Acrobat or a similar tool.
## 2. Detailed Single-Page Inspection
**Goal**: Verify reading order and content types on a specific page.
1. **Visual Inspection**: Look at the "Analysis Results" image.
* **Red Boxes**: Indicate detected text blocks.
* **Numbers**: Show the reading order.
2. **Verify Reading Order**:
* Does the order (1, 2, 3...) follow the logical flow of the document?
* *Issue*: If columns are read left-to-right across the page instead of down the column, the reading order is broken.
* *Fix*: This usually requires manual retagging in Acrobat (Order panel).
3. **Check for Artifacts**:
* Are headers/footers marked as text blocks? (They should generally be artifacts/ignored by screen readers).
## 3. Advanced Diagnostics
**Goal**: Deep dive into specific issues using the "Advanced Analysis" tab.
### Content Stream Inspector
* **Use when**: Text looks correct visually but copies weirdly or reads wrong (e.g., "fi" ligaure issues).
* **Action**: Select a block and click "Extract Operators".
* **Look for**: `TJ` or `Tj` operators showing garbled characters or strange spacing adjustments.
### Screen Reader Simulator
* **Use when**: You want to "hear" what a user hears.
* **Action**: Select "NVDA" and click "Generate Transcript".
* **Check**:
* Are headings announced as "Heading Level X"?
* Is alt text read for images?
* Is the reading order intelligible?
### Paragraph Detection
* **Use when**: Text seems run-on or broken into too many fragments.
* **Action**: Click "Analyze Paragraphs".
* **Check**:
* **Visual vs. Semantic**: Large discrepancies suggest the `<P>` tags don't match the visual layout, which can confuse users navigating by paragraph.
### Structure Tree Visualizer
* **Use when**: The document is tagged, but navigation is broken.
* **Action**: Click "Extract Structure Tree".
* **Check**:
* Hierarchy depth.
* Correct nesting (e.g., `L` -> `LI` -> `LBody`).
## 4. Batch Analysis for Large Documents
**Goal**: Identify problematic pages in a long report.
1. **Go to Batch Analysis Tab**.
2. **Run Batch**: Analyze 50-100 pages.
3. **Review the Report**:
* **Issues Found**: Look for "Scanned Pages" or "Garbled Text".
* **Page List**: Use the list of page numbers to targeting your remediation efforts.
## Summary Checklist
- [ ] Document is Tagged (`/StructTreeRoot` exists)
- [ ] Text is selectable (not an image/scan)
- [ ] Reading order is logical (columns handled correctly)
- [ ] Images have Alt Text (or are marked as artifacts)
- [ ] Headings use Heading tags (`<H1>`, `<H2>`), not just bold text.