Spaces:

Text-to-Document-Generation
/

Docgenie-API

Paused

File size: 97,784 Bytes

dc4e6da

# DocGenie Generation Pipeline & API Documentation

**Version:** 1.0  
**Last Updated:** February 7, 2026  
**Purpose:** Comprehensive reference for the DocGenie synthetic document generation system

---

## Table of Contents

1. [Overview](#overview)
2. [Pipeline Architecture](#pipeline-architecture)
3. [Pipeline Stages (01-19)](#pipeline-stages-01-19)
4. [API Implementation](#api-implementation)
5. [Core Models & Utilities](#core-models--utilities)
6. [Configuration & Constants](#configuration--constants)
7. [Usage Examples](#usage-examples)
8. [Error Handling & Debugging](#error-handling--debugging)

---

## Overview

DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:

- **Document Question Answering (QA)**
- **Key Information Extraction (KIE)**
- **Document Layout Analysis (DLA)**
- **Document Classification (CLS)**

### Key Features

- **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content
- **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles
- **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos
- **Multi-Task Support**: Task-specific ground truth formatting and validation
- **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking
- **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs

### Technology Stack

- **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen
- **PDF Rendering**: Playwright (Chromium), PyMuPDF
- **OCR**: Microsoft Azure OCR
- **Handwriting**: Custom diffusion model
- **Image Processing**: PIL, OpenCV
- **API Framework**: FastAPI
- **Data Processing**: Pandas, NumPy

---

## Pipeline Architecture

### High-Level Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                        DOCGENIE GENERATION PIPELINE                      │
└─────────────────────────────────────────────────────────────────────────┘

┌──────────────────────┐
│  PHASE 1: SELECTION  │
└──────────────────────┘
    ↓
[01] Select Seeds ────────────► seeds.csv, clusters.csv
                                (Cluster-based diverse seed selection)

┌──────────────────────┐
│ PHASE 2: LLM GEN     │
└──────────────────────┘
    ↓
[02] Prompt LLM ──────────────► batch_results/ (JSON)
    │                           (Claude API batched calls)
    ↓
[03] Process Response ────────► raw_html/, raw_annotations/
                                (Extract HTML & GT from responses)

┌──────────────────────┐
│ PHASE 3: RENDERING   │
└──────────────────────┘
    ↓
[04] Render PDF Initial ──────► pdf_initial/, geometries/
    │                           (HTML→PDF with geometry extraction)
    ↓
[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/
    │                           (PyMuPDF text extraction)
    ↓
[06] Extract Layout ──────────► layout_element_definitions/
                                (DLA/KIE-specific annotations)

┌──────────────────────┐
│ PHASE 4: EXTRACTION  │
└──────────────────────┘
    ↓
[07] Extract Handwriting ─────► handwriting_definitions/
    │                           (Identify handwriting regions)
    ↓
[08] Extract Visual Elements ─► visual_element_definitions/
                                (Stamp/logo/barcode placeholders)

┌──────────────────────┐
│ PHASE 5: GENERATION  │
└──────────────────────┘
    ↓
[09] Create Handwriting ──────► handwriting_images/
    │                           (Diffusion model generation)
    ↓
[10] Create Visual Elements ──► visual_element_images/
                                (Generate/select stamps, logos, etc.)

┌──────────────────────┐
│ PHASE 6: COMPOSITION │
└──────────────────────┘
    ↓
[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/
    │                           (Remove handwriting placeholders)
    ↓
[12] Insert Handwriting ──────► pdf_with_handwriting/
    │                           (Overlay handwriting images)
    ↓
[13] Insert Visual Elements ──► pdf_final/
    │                           (Overlay stamps, logos, etc.)
    ↓
[14] Render Image ────────────► images/
                                (PDF→PNG conversion)

┌──────────────────────┐
│ PHASE 7: FINALIZATION│
└──────────────────────┘
    ↓
[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/
    │                           (Microsoft OCR)
    ↓
[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/
                                (Pixel→[0,1] coordinates)

┌──────────────────────┐
│ PHASE 8: VALIDATION  │
└──────────────────────┘
    ↓
[17] GT Preparation ──────────► verified_gt/
    │                           (Fuzzy matching, BIO tagging)
    ↓
[18] Analyze ─────────────────► dataset_log.json
    │                           (Statistics, cost analysis)
    ↓
[19] Create Debug Data ───────► debug/ subdirectories
                                (Visualizations for inspection)
```

### Data Flow Between Stages

```
Seed Images ──┐
              ├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries
Prompt Params ┘                                         │
                                                        ├──► [05] ──► BBoxes
                                                        │               │
                                                        │               ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐
                                                        │               │                                              │
                                                        │               └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤
                                                        │                                                              │
                                                        └──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤
                                                                                    │           (Insert HW)            │
                                                                                    └──► [13] ◄────────────────────────┘
                                                                                             (Insert VE)
                                                                                                  ↓
                                                                                            [14] ──► Image
                                                                                                  ↓
                                                                                            [15] ──► OCR BBoxes
                                                                                                  ↓
                                                                                            [16] ──► Normalized
                                                                                                  ↓
                                                                                            [17] ──► Verified GT
```

---

## Pipeline Stages (01-19)

### Stage 01: Select Seeds

**File:** `pipeline_01_select_seeds.py`

**Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.

**Key Functions:**
- `main()`: Orchestrates seed selection process
- `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission
- `plot_class_distribution()`: Visualizes class balance in selected seeds
- `visualize_cluster_histogram()`: Shows distribution across clusters

**Process:**
1. Load embeddings from base dataset
2. Perform clustering (KMeans or other algorithms)
3. Sample N seeds per cluster
4. Downscale and compress images (JPEG, max dimension)
5. Save seed manifest and cluster assignments

**Inputs:**
- `SynDatasetDefinition` configuration
- Base dataset name (e.g., `docvqa`, `cord`, `publaynet`)
- Clustering parameters from constants

**Outputs:**
```
seeds.csv                    # Selected seed document IDs per prompt call
clusters.csv                 # Cluster assignments for all documents
seeds/ (directory)           # Preprocessed seed images (JPEG, compressed)
```

**Configuration Parameters:**
- `EMBEDDING_MODEL`: Specifies which embedding model was used
- `IMAGE_MAX_DIMENSION`: Max width/height for compression
- `JPEG_QUALITY`: Compression quality (0-100)

**Example Usage:**
```python
from docgenie.generation import pipeline_01_select_seeds

pipeline_01_select_seeds.main(
    syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
    base_dataset="docvqa"
)
```

---

### Stage 02: Prompt LLM

**File:** `pipeline_02_prompt_llm.py`

**Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.

**Key Functions:**
- `main()`: Main orchestrator for LLM prompting
- `create_batched_messages()`: Constructs API-compatible message batches
- `track_batch_completion()`: Polls API for batch status
- Cost calculation utilities in `pipeline_01/cost.py`

**Process:**
1. Load seed images and encode as base64
2. Build prompts from template with parameter injection
3. Create batched API requests (Claude Batch API for cost efficiency)
4. Submit batches and track completion
5. Save results for processing in stage 03

**Inputs:**
- Prompt template from `data/prompt_templates/<template_name>/`
- Seed images from stage 01
- API credentials from environment variables
- `SynDatasetDefinition` parameters

**Outputs:**
```
prompt_batches/              # Batch metadata (batch IDs, status)
message_results/             # JSON response files per batch
logs/                        # Prompting logs and progress
```

**API Configuration:**
- **Claude:** Uses Batch API with prompt caching for cost efficiency
- **Batch Size:** Configurable via `BATCH_SIZE` constant
- **Polling Interval:** Configurable wait time between status checks
- **Model Selection:** Specified in `SynDatasetDefinition.llm_model`

**Cost Tracking:**
- Input/output token counts per request
- Cached token usage (for Claude)
- Total cost estimation per batch

**Example Configuration:**
```yaml
# In syn_dataset_definition YAML
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1  # Documents per prompt
language: "English"
doc_type: "business and administrative"
```

---

### Stage 03: Process Response

**File:** `pipeline_03_process_response.py`

**Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses.

**Key Functions:**
- `main()`: Main processor
- `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks
- `extract_gt_from_html()`: Parse JSON ground truth from `<script id="GT">` tags
- `validate_and_save_gt()`: Task-specific validation and formatting

**Process:**
1. Load message results from stage 02
2. Extract HTML content using regex patterns
3. Parse and validate ground truth JSON
4. Apply task-specific formatting (QA, KIE, DLA, CLS)
5. Save raw HTML and annotations separately

**Inputs:**
- Message results from `message_results/` (stage 02)
- `SynDatasetDefinition` task type and prompt format

**Outputs:**
```
raw_html/                    # HTML files (one per document)
raw_annotations/             # Ground truth JSON files
  qa/                        # QA format: {"question": "answer", ...}
  kie/                       # KIE format: {"entity": "value", ...}
  dla/                       # DLA format: {"element_id": "label", ...}
  cls/                       # CLS format: {"class": "category"}
logs/message_processing/     # Processing logs per message
```

**Ground Truth Formats:**

**QA Format:**
```json
{
  "What is the invoice number?": "INV-12345",
  "What is the total amount?": "$1,234.56"
}
```

**KIE Format:**
```json
{
  "company_name": "Acme Corp",
  "invoice_date": "2024-01-15",
  "total_amount": "1234.56"
}
```

**DLA Format:**
```json
{
  "layout_0": "title",
  "layout_1": "text",
  "layout_2": "table"
}
```

**Validation Checks:**
- Expected document count matches actual
- Valid JSON structure in GT
- Task-specific field presence
- HTML completeness (valid tags)

**Error Handling:**
- Malformed HTML → Logged, skipped
- Missing GT → Document flagged in logs
- Invalid JSON → Parsing error logged

---

### Stage 04: Render PDF and Extract Geometries

**File:** `pipeline_04_render_pdf_and_extract_geos.py`

**Purpose:** Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.

**Key Functions:**
- `main()`: Async batch rendering orchestrator
- `render_pdf_with_playwright()`: Single PDF rendering using Playwright/Chromium
- `preprocess_css()`: Remove conflicting CSS `@page` rules
- `validate_pdf()`: Check page count and file integrity
- `extract_geometries()`: Parse element positions from injected JavaScript

**Process:**
1. Load raw HTML from stage 03
2. Inject geometry extraction JavaScript
3. Preprocess CSS (remove fixed page sizes)
4. Render with Playwright in headless Chromium
5. Measure content dimensions dynamically
6. Export PDF with calculated page size
7. Extract and save element geometries

**Inputs:**
- Raw HTML files from `raw_html/` (stage 03)

**Outputs:**
```
pdf_initial/                 # Initial PDFs
geometries/                  # Element geometry JSON files
  {doc_id}.json              # Contains positions, dimensions, CSS classes
render_html/                 # HTML with geometry extraction scripts
debug/pdf_with_geos/         # Debug PDFs with geometry overlays (optional)
logs/rendering/              # Rendering logs and errors
```

**Geometry JSON Structure:**
```json
{
  "page_width_mm": 210.0,
  "page_height_mm": 297.0,
  "elements": [
    {
      "id": "layout_0",
      "type": "div",
      "class": "title",
      "rect": {
        "x": 20.0,
        "y": 30.0,
        "width": 170.0,
        "height": 15.0
      },
      "text": "Invoice",
      "attributes": {
        "data-label": "title",
        "data-handwriting": null,
        "data-visual-element": null
      }
    }
  ]
}
```

**Key Features:**
- **Automatic Page Sizing:** No fixed dimensions; page size adapts to content
- **Concurrent Rendering:** Semaphore-controlled parallel processing
- **Geometry Extraction:** JavaScript injected to capture element positions
- **CSS Coordinate Conversion:** 96 DPI (CSS) → 72 DPI (PDF)
- **Retry Logic:** Up to 3 attempts with configurable timeout
- **Debug Visualizations:** Optional overlay of geometries on PDFs

**Configuration Constants:**
- `CHROMIUM_CONCURRENCY`: 10 (parallel render limit)
- `PER_PDF_RENDER_TIMEOUT`: 60 seconds
- `PER_PDF_RENDER_MAX_RETRIES`: 3 attempts
- `PDF_POINT_SCALING`: 72/96 for DPI conversion

**Error Handling:**
- Timeout → Retry with increased timeout
- Multi-page PDFs → Flagged and skipped
- Missing geometries → Error logged, document marked invalid

---

### Stage 05: Extract BBoxes from PDF

**File:** `pipeline_05_extract_bboxes_from_pdf.py`

**Purpose:** Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.

**Key Functions:**
- `main()`: Main extraction orchestrator
- `extract_bboxes_from_pdf()`: PyMuPDF-based extraction
- `verify_char_to_word_mapping()`: Validate character-to-word relationships
- `create_bbox_debug_pdf()`: Generate debug visualizations

**Process:**
1. Load PDFs from stage 04
2. Extract words with PyMuPDF `get_text("words")`
3. Extract characters with `get_text("rawdict")`
4. Map characters to parent words
5. Validate mappings (required for handwriting splitting)
6. Save both word and character bboxes

**Inputs:**
- PDFs from `pdf_initial/` (stage 04)

**Outputs:**
```
pdf_word_bboxes/             # Word-level bounding boxes
  {doc_id}.json              # Format: [x0, y0, x1, y1, text]
pdf_char_bboxes/             # Character-level bounding boxes (if mappable)
  {doc_id}.json              # Format: [x0, y0, x1, y1, char, word_idx]
debug/pdf_bboxes/            # Debug PDFs with bbox overlays (optional)
logs/bbox_extraction/        # Extraction logs
```

**BBox JSON Format:**

**Word BBoxes:**
```json
[
  {"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
  {"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
]
```

**Char BBoxes (with word mapping):**
```json
[
  {"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
  {"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
]
```

**Key Features:**
- **Dual-Level Extraction:** Both word and character granularity
- **Mapping Validation:** Ensures characters can be mapped to words (critical for handwriting)
- **Debug Visualizations:** Color-coded bbox overlays on PDFs
- **PDF Coordinate System:** Uses 72 DPI (PDF points)

**Character-to-Word Mapping:**
Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.

**Error Handling:**
- Empty PDFs → Skipped with warning
- Unmappable characters → Word-level fallback
- Invalid bboxes (negative dimensions) → Filtered out

---

### Stage 06: Extract Layout Element Definitions and Annotation GT

**File:** `pipeline_06_extract_layout_element_definitions_and_annotation_gt.py`

**Purpose:** Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).

**Key Functions:**
- `main()`: Task router
- `handle_dla()`: Process DLA tasks
- `handle_kie()`: Process KIE tasks
- `parse_layout_elements()`: Extract layout elements from geometries
- `parse_kie_fields()`: Extract KIE annotated fields from geometries

**Process (DLA):**
1. Load geometries from stage 04
2. Filter elements with `data-label` attribute
3. Validate labels against task-specific valid labels
4. Extract bounding boxes and labels
5. Save layout element definitions

**Process (KIE):**
1. Load geometries from stage 04
2. Filter elements with `data-annotation` attribute
3. Extract entity names and text values
4. Validate against expected entity schema
5. Save raw annotations for stage 17

**Inputs:**
- Geometries from `geometries/` (stage 04)
- Valid labels from `SynDatasetDefinition.valid_labels`
- Task type from `SynDatasetDefinition.task`

**Outputs:**
```
layout_element_definitions/  # For DLA tasks
  {doc_id}.json              # [{id, label, rect}, ...]
raw_annotations/             # For KIE tasks
  kie_annotations/
    {doc_id}.json            # {entity_name: text_value, ...}
```

**Layout Element Definition Format (DLA):**
```json
[
  {
    "id": "layout_0",
    "label": "title",
    "rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
  },
  {
    "id": "layout_1",
    "label": "text",
    "rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
  }
]
```

**KIE Annotation Format:**
```json
{
  "company_name": "Acme Corporation",
  "invoice_number": "INV-12345",
  "invoice_date": "January 15, 2024",
  "total_amount": "$1,234.56"
}
```

**Validation Checks:**
- **Missing labels:** Elements without `data-label` → Warning
- **Invalid labels:** Labels not in valid_labels → Error
- **Zero dimensions:** Width or height = 0 → Error
- **Multiple labels (KIE only):** One element, multiple entities → Error
- **Missing text:** KIE elements without text → Warning

**Task-Specific Handling:**
- **DLA:** Converts geometries to `LayoutBBox` format for stage 17
- **KIE:** Extracts entity-value pairs for BIO tagging in stage 17
- **QA/CLS:** No processing in this stage (uses raw GT from stage 03)

---

### Stage 07: Extract Handwriting

**File:** `pipeline_07_extract_handwriting.py`

**Purpose:** Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.

**Key Functions:**
- `main()`: Main extraction orchestrator
- `parse_handwriting_geometries()`: Extract elements with `data-handwriting` attribute
- `map_handwriting_to_bboxes()`: Match handwriting text to OCR bboxes spatially
- `split_long_words()`: Split words exceeding diffusion model's character limit
- `extract_handwriting_style()`: Parse author ID from CSS class

**Process:**
1. Load geometries from stage 04
2. Filter elements with `data-handwriting` attribute
3. Load word/character bboxes from stage 05
4. Spatially match handwriting regions to bboxes
5. Split long words for diffusion model constraints
6. Extract author ID for style consistency
7. Save handwriting definitions with bbox mappings

**Inputs:**
- Geometries from `geometries/` (stage 04)
- Word bboxes from `pdf_word_bboxes/` (stage 05)
- Character bboxes from `pdf_char_bboxes/` (stage 05, if available)

**Outputs:**
```
handwriting_definitions/
  {doc_id}.json              # Handwriting region definitions
logs/handwriting_extraction/ # Extraction logs and warnings
```

**Handwriting Definition Format:**
```json
[
  {
    "id": "hw0",
    "text": "John Smith",
    "author_id": "author1",
    "bboxes": [
      "20.0,30.0,60.0,45.0,John",
      "65.0,30.0,95.0,45.0,Smith"
    ],
    "rect": {
      "x": 20.0,
      "y": 30.0,
      "width": 75.0,
      "height": 15.0
    },
    "is_signature": false
  }
]
```

**Key Features:**
- **Author ID Extraction:** Parses CSS classes like `handwriting-author1` for style consistency
- **Spatial Matching:** Uses bbox overlap/proximity to map handwriting regions to text
- **Long Word Splitting:** Splits words > `MAX_HANDWRITING_CHARS` (default: 7) into multiple images
- **Signature Detection:** Identifies signature fields for special handling
- **Character-Level Fallback:** Uses word bboxes if char-level mapping unavailable

**Configuration Constants:**
- `MAX_HANDWRITING_CHARS`: 7 (max characters per diffusion generation)
- `SPATIAL_MATCH_THRESHOLD`: Pixel tolerance for bbox matching

**Mapping Logic:**
1. Find all words whose bboxes overlap/intersect with handwriting rect
2. Extract text from matched bboxes
3. Compare with expected handwriting text
4. If char bboxes available: split long words at char boundaries
5. If char bboxes unavailable: split at word boundaries

**Error Handling:**
- No bbox matches → Warning, handwriting region skipped
- Text mismatch → Logged, uses best match
- Missing char bboxes → Word-level fallback (may affect quality)

---

### Stage 08: Extract Visual Element Definitions

**File:** `pipeline_08_extract_visual_element_definitions.py`

**Purpose:** Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.

**Key Functions:**
- `main()`: Main extraction orchestrator
- `parse_visual_element_geometries()`: Extract elements with `data-visual-element` attribute
- `parse_css_dimension()`: Parse width/height from CSS
- `parse_css_rotation()`: Extract rotation angle from CSS transform

**Process:**
1. Load geometries from stage 04
2. Filter elements with `data-visual-element` attribute
3. Parse element type (stamp, logo, barcode, etc.)
4. Extract dimensions and rotation from CSS
5. Parse content (e.g., stamp text, barcode value)
6. Validate element types and dimensions
7. Save visual element definitions

**Inputs:**
- Geometries from `geometries/` (stage 04)

**Outputs:**
```
visual_element_definitions/
  {doc_id}.json              # Visual element definitions
logs/visual_element_extraction/ # Extraction logs
```

**Visual Element Definition Format:**
```json
[
  {
    "id": "ve0",
    "type": "stamp",
    "type_unmapped": "stamp",
    "content": "CONFIDENTIAL",
    "rect": {
      "x": 150.0,
      "y": 250.0,
      "width": 60.0,
      "height": 30.0
    },
    "rotation": -15.0
  },
  {
    "id": "ve1",
    "type": "barcode",
    "type_unmapped": "barcode",
    "content": "1234567890",
    "rect": {
      "x": 20.0,
      "y": 270.0,
      "width": 100.0,
      "height": 20.0
    },
    "rotation": 0.0
  }
]
```

**Supported Visual Element Types:**
- `stamp`: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
- `logo`: Company/brand logos (selected from prefabs)
- `barcode`: Code128 barcodes
- `chart`: Charts and graphs (selected from prefabs)
- `photo`: Photographic images (selected from prefabs)

**Type Mapping:**
Maps LLM-generated type names to standard types using `VISUAL_ELEMENT_TYPE_MAPPING` constant.

**Key Features:**
- **Content Extraction:** Parses text content for stamps, values for barcodes
- **Rotation Support:** Extracts CSS `rotate()` transform angles
- **Dimension Parsing:** Handles CSS units (px, mm, %, etc.)
- **Type Validation:** Warns about unknown types, uses fallback mapping

**Validation Checks:**
- Unknown type → Logged, mapped to closest known type
- Zero dimensions → Error, element skipped
- Invalid rotation angle → Defaults to 0°
- Missing content (for stamps/barcodes) → Warning

**CSS Parsing Examples:**
```css
/* Rotation extraction */
transform: rotate(-15deg);        → -15.0
transform: rotate(0.5turn);       → 180.0

/* Dimension extraction */
width: 60mm;                      → 60.0 (mm)
height: 30px;                     → Converted to mm
```

---

### Stage 09: Create Handwriting Images

**File:** `pipeline_09_create_handwriting_images.py`

**Purpose:** Generate realistic handwriting images using a diffusion model, with per-author style consistency.

**Key Functions:**
- `main()`: Main generation orchestrator
- `generate_handwriting_diffusion()`: Call diffusion model API/local model
- `add_handwriting_blur()`: Optional post-processing for realism

**Process:**
1. Load handwriting definitions from stage 07
2. Group by author ID for style consistency
3. For each text segment:
   - Generate handwriting image with diffusion model
   - Apply optional blur for realism
   - Save image with metadata
4. Track author-to-writer style mapping
5. Save generation logs

**Inputs:**
- Handwriting definitions from `handwriting_definitions/` (stage 07)
- Diffusion model checkpoint (local or API)

**Outputs:**
```
handwriting_images/
  {doc_id}/
    hw0_0.png               # First bbox of handwriting region hw0
    hw0_1.png               # Second bbox (if multi-word)
    hw1_0.png
logs/handwriting_generation/ # Generation logs with author mappings
```

**Handwriting Image Specifications:**
- **Format:** PNG with transparency
- **Height:** 40 pixels (configurable via `HANDWRITING_HEIGHT_PX`)
- **Padding:** 0 pixels horizontal (configurable via `HANDWRITING_PADDING_PX`)
- **Background:** Transparent
- **Color:** Black text on transparent

**Diffusion Model Integration:**
Located in `handwriting_diffusion/` module:
- `generate_handwriting_diffusion_raw.py`: Core generation logic
- `text_encoder.py`: Text encoding for model input
- `tokenizer.py`: Character tokenization

**Key Features:**
- **Author-Style Consistency:** Same author ID → consistent handwriting style
- **Writer Style Mapping:** Maps author IDs to writer style IDs (e.g., "author1" → writer_42)
- **Batch Processing:** Generates multiple images efficiently
- **Optional Blur:** Post-processing for more realistic appearance
- **Quality Control:** Validates generated images (non-empty, correct dimensions)

**Configuration Constants:**
- `HANDWRITING_HEIGHT_PX`: 40 pixels
- `HANDWRITING_PADDING_PX`: 0 pixels
- `DIFFUSION_NUM_INFERENCE_STEPS`: 50 (generation quality/speed tradeoff)
- `HANDWRITING_BLUR_ENABLED`: Optional blur toggle
- `HANDWRITING_STYLES`: List of available writer styles

**Author-to-Writer Mapping Example:**
```json
{
  "author1": "writer_42",
  "author2": "writer_17",
  "author3": "writer_89"
}
```

**Error Handling:**
- Generation failure → Retry up to 3 times
- Empty image → Logged, document flagged
- Model timeout → Skipped, logged

---

### Stage 10: Create Visual Elements

**File:** `pipeline_10_create_visual_elements.py`

**Purpose:** Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.

**Key Functions:**
- `main()`: Main generation orchestrator
- `route_visual_element_generation()`: Router for element type
- `generate_stamp()`: Create stamp image with text
- `select_logo()`: Random selection from logo prefabs
- `generate_barcode()`: Create Code128 barcode
- `select_photo()`: Random selection from photo prefabs
- `select_chart()`: Random selection from chart prefabs

**Process:**
1. Load visual element definitions from stage 08
2. For each element:
   - Route to type-specific generator
   - Generate or select image
   - Resize to target dimensions
   - Save with transparency (if applicable)
3. Cache prefab directories for performance
4. Save generation logs

**Inputs:**
- Visual element definitions from `visual_element_definitions/` (stage 08)
- Prefab directories in `data/visual_element_prefabs/`:
  - `logos/`
  - `photos/`
  - `charts/`

**Outputs:**
```
visual_element_images/
  {doc_id}/
    ve0.png                 # Stamp image
    ve1.png                 # Logo image
    ve2.png                 # Barcode image
logs/visual_element_generation/ # Generation logs
```

**Type-Specific Generation:**

**Stamps:**
- Font: Configurable (default: Arial Bold)
- Color: Configurable (default: red/blue)
- Background: Transparent
- Border: Optional rounded rectangle
- Rotation: Applied during insertion (stage 13)

**Logos:**
- Source: Random selection from `data/visual_element_prefabs/logos/`
- Format: PNG with transparency preserved
- Caching: Directory contents cached after first scan

**Barcodes:**
- Type: Code128 (supports alphanumeric)
- Library: `python-barcode`
- Background: White
- Content validation: Numeric or alphanumeric

**Photos:**
- Source: Random selection from `data/visual_element_prefabs/photos/`
- Format: JPEG or PNG
- Aspect ratio: Preserved during resize

**Charts/Figures:**
- Source: Random selection from `data/visual_element_prefabs/charts/`
- Format: PNG with transparency
- Types: Bar charts, line graphs, pie charts, etc.

**Key Features:**
- **Type-Specific Logic:** Each element type has dedicated generation function
- **Prefab Caching:** Directory scans cached for performance
- **Transparent Backgrounds:** Stamps and some logos support transparency
- **Content Validation:** Barcodes validate numeric content
- **Aspect Ratio Preservation:** Images scaled without distortion

**Configuration Constants:**
- `STAMP_FONT_SIZE`: Calculated from target dimensions
- `STAMP_BORDER_WIDTH`: 2 pixels
- `BARCODE_DPI`: 300 for high quality
- `PREFAB_CACHE_SIZE`: In-memory cache limit

**Error Handling:**
- Missing prefab directory → Error, element skipped
- Empty prefab directory → Warning, fallback placeholder
- Invalid barcode content → Logged, uses fallback text
- Image generation failure → Placeholder created

---

### Stage 11: Render PDF Second Pass

**File:** `pipeline_11_render_pdf_second_pass.py`

**Purpose:** Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.

**Key Functions:**
- `main()`: Main rendering orchestrator
- `render_pdf_playwright_async()`: Async Playwright rendering
- `remove_handwriting_placeholders_from_html()`: Strip handwriting elements

**Process:**
1. Load HTML from `render_html/` (stage 04)
2. Remove all elements with `data-handwriting` attribute
3. Re-render PDF using same dimensions from stage 04
4. Validate PDF output
5. Save PDFs without handwriting placeholders

**Inputs:**
- HTML from `render_html/` (stage 04)
- Page dimensions from stage 04 logs

**Outputs:**
```
pdf_without_handwriting_placeholder/
  {doc_id}.pdf              # PDFs with handwriting regions blank
logs/rendering_second_pass/ # Rendering logs
```

**Key Differences from Stage 04:**
- **Uses Pre-calculated Dimensions:** No content measurement needed
- **Handwriting Elements Removed:** Not just hidden (visibility: hidden), but removed from DOM
- **No Geometry Extraction:** Geometries already saved in stage 04
- **Faster Rendering:** No JavaScript injection for measurement

**HTML Preprocessing:**
```html
<!-- Before (Stage 04) -->
<div data-handwriting="author1" class="handwriting">John Smith</div>

<!-- After (Stage 11) -->
<!-- Element completely removed -->
```

**Why This Stage Exists:**
Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.

**Rendering Configuration:**
- Same timeout and retry logic as stage 04
- Reuses Playwright browser context for efficiency
- No debug output (geometries not extracted)

**Error Handling:**
- Multi-page PDF → Flagged and skipped
- Rendering timeout → Retry with increased timeout
- HTML parsing error → Logged, document marked invalid

---

### Stage 12: Insert Handwriting Images

**File:** `pipeline_12_insert_handwriting_images.py`

**Purpose:** Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.

**Key Functions:**
- `main()`: Main insertion orchestrator
- `insert_handwriting_into_pdf()`: Per-document insertion
- `scale_image_with_aspect_ratio()`: Resize images while preserving aspect
- `group_bboxes_by_line()`: Group multi-word handwriting by line

**Process:**
1. Load PDFs from stage 11
2. Load handwriting images from stage 09
3. Load handwriting definitions (bbox mappings) from stage 07
4. For each handwriting region:
   - Group bboxes by line/block
   - Scale images to fit bboxes
   - Apply random offsets for natural variation
   - Insert images at calculated positions
5. Save PDFs with handwriting

**Inputs:**
- PDFs from `pdf_without_handwriting_placeholder/` (stage 11)
- Handwriting images from `handwriting_images/` (stage 09)
- Handwriting definitions from `handwriting_definitions/` (stage 07)

**Outputs:**
```
pdf_with_handwriting/
  {doc_id}.pdf              # PDFs with handwriting inserted
logs/handwriting_insertion/ # Insertion logs
debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)
```

**Insertion Logic:**

**1. Image Scaling:**
- High-res scaling: 3x upsampling before insertion
- Aspect ratio preserved
- Left-aligned within bbox (respects layout rect)

**2. Positioning:**
- **X coordinate:** Left edge of bbox + random offset
- **Y coordinate:** Top edge of bbox + random offset
- **Multi-word lines:** Consistent Y offset for line

**3. Random Offsets:**
```python
x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)
```

**4. Block/Line Grouping:**
For multi-word handwriting (e.g., "John Smith"):
- Group bboxes by Y coordinate (same line)
- Apply consistent Y offset to entire line
- Individual X offsets per word for natural spacing

**Key Features:**
- **High-Resolution Insertion:** 3x scaling for quality
- **Natural Variation:** Random offsets simulate handwriting imperfection
- **Line Consistency:** Multi-word lines maintain baseline
- **Aspect Ratio Preservation:** Images not distorted
- **Transparency Support:** PNG alpha channel preserved

**Configuration Constants:**
- `HANDWRITING_IMAGE_UPSCALE_FACTOR`: 3x
- `MAX_HANDWRITING_RAND_X`: ±2 pixels
- `MAX_HANDWRITING_RAND_Y`: ±1 pixel
- `HANDWRITING_LINE_Y_CONSISTENCY`: Same Y offset per line

**Coordinate System:**
- PDF uses 72 DPI (points)
- Bboxes from stage 07 are in PDF coordinates
- No conversion needed

**Error Handling:**
- Missing handwriting image → Warning, bbox skipped
- Image too large for bbox → Scaled down with warning
- PyMuPDF insertion failure → Logged, document flagged

---

### Stage 13: Insert Visual Elements

**File:** `pipeline_13_insert_visual_elements.py`

**Purpose:** Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.

**Key Functions:**
- `main()`: Main insertion orchestrator
- `insert_visual_elements_into_pdf()`: Per-document insertion
- `scale_image_with_aspect_ratio()`: Resize images (same as stage 12)

**Process:**
1. Load PDFs from stage 12
2. Load visual element images from stage 10
3. Load visual element definitions from stage 08
4. For each visual element:
   - Scale image to fit bbox
   - Calculate centered position
   - Apply rotation (if specified)
   - Insert image at calculated position
5. Save final PDFs
6. If no visual elements: copy PDF from stage 12

**Inputs:**
- PDFs from `pdf_with_handwriting/` (stage 12)
- Visual element images from `visual_element_images/` (stage 10)
- Visual element definitions from `visual_element_definitions/` (stage 08)

**Outputs:**
```
pdf_final/
  {doc_id}.pdf              # Final PDFs with all elements
logs/visual_element_insertion/ # Insertion logs
debug/visual_element_insertion/ # Debug PDFs (optional)
```

**Insertion Logic:**

**1. Image Scaling:**
- High-res scaling: 3x upsampling
- Aspect ratio preserved
- Centered within bbox (not left-aligned like handwriting)

**2. Positioning:**
- **X coordinate:** Center of bbox - half image width
- **Y coordinate:** Center of bbox - half image height

**3. Rotation:**
- Applied via PyMuPDF transformation matrix
- Rotation around image center
- Angle from visual element definition

**Key Differences from Stage 12 (Handwriting):**
- **Centered placement:** Visual elements centered in bbox
- **No random offsets:** Precise placement for logos/stamps
- **Rotation support:** Stamps often rotated for "APPROVED" effect
- **Fallback:** Copies PDF if no visual elements (ensures output exists)

**Rotation Transformation:**
```python
# PyMuPDF rotation matrix
rotation_matrix = fitz.Matrix(rotation_angle)
image_rect = image_rect * rotation_matrix
```

**Key Features:**
- **High-Resolution Insertion:** 3x scaling for quality
- **Centered Alignment:** Visual elements centered in bboxes
- **Rotation Support:** Arbitrary angles for stamps
- **Transparency Preservation:** PNG alpha channel maintained
- **Fallback Handling:** Copies PDF if no visual elements

**Configuration Constants:**
- `VISUAL_ELEMENT_UPSCALE_FACTOR`: 3x
- `ROTATION_PRECISION`: Angle precision in degrees

**Coordinate System:**
- Same as stage 12 (PDF 72 DPI)
- Rotation applied after positioning

**Error Handling:**
- Missing visual element image → Warning, element skipped
- Image too large for bbox → Scaled down with warning
- PyMuPDF insertion failure → Logged, document flagged
- No visual elements → PDF copied from stage 12

---

### Stage 14: Render Image

**File:** `pipeline_14_render_image.py`

**Purpose:** Convert final PDFs to high-quality PNG images for OCR and dataset distribution.

**Key Functions:**
- `main()`: Main conversion orchestrator
- `convert_pdf_to_image()`: PDF to PNG conversion

**Process:**
1. Load PDFs from stage 13
2. Convert each PDF to PNG using custom PDF-to-image module
3. Validate image dimensions
4. Save images

**Inputs:**
- PDFs from `pdf_final/` (stage 13)

**Outputs:**
```
images/
  {doc_id}.png              # Final document images
logs/image_rendering/       # Conversion logs
```

**Image Specifications:**
- **Format:** PNG
- **DPI:** Configurable (default: 200 DPI for quality OCR)
- **Color Mode:** RGB (24-bit)
- **Compression:** PNG lossless

**PDF-to-Image Module:**
Located in custom module (not standard library):
- Handles PDF rendering at specified DPI
- Single-page conversion only (multi-page PDFs skipped)
- Uses PDF coordinate system: 72 DPI internally

**Key Features:**
- **High DPI:** 200+ DPI for accurate OCR
- **Single Page Only:** Multi-page PDFs flagged as errors
- **Lossless Compression:** PNG preserves all details
- **Size Validation:** Checks image dimensions match PDF

**Configuration Constants:**
- `IMAGE_DPI`: 200 (OCR quality vs file size tradeoff)
- `IMAGE_MAX_DIMENSION`: Optional max width/height

**Coordinate System Conversion:**
```
PDF: 210mm × 297mm @ 72 DPI → 595 × 842 points
PNG: 210mm × 297mm @ 200 DPI → 1654 × 2339 pixels
```

**Why This Stage:**
- OCR performs better on high-DPI images
- Images are final output format for datasets
- PNG preserves quality better than JPEG for text

**Error Handling:**
- Multi-page PDF → Flagged and skipped
- Conversion failure → Logged, document marked invalid
- Empty image → Error, document flagged
- Dimension mismatch → Warning logged

---

### Stage 15: Perform OCR

**File:** `pipeline_15_perform_ocr.py`

**Purpose:** Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.

**Key Functions:**
- `main()`: Main OCR orchestrator
- `call_microsoft_ocr()`: Microsoft Azure OCR API call
- `convert_ocr_to_bbox_format()`: Transform OCR results to internal format
- `aggregate_words_to_segments()`: Group words into lines

**Process:**
1. Determine which documents need OCR:
   - Has handwriting → Requires OCR
   - Has visual elements → Requires OCR
   - Neither → Copy PDF bboxes from stage 05
2. For documents requiring OCR:
   - Call Microsoft OCR service
   - Parse word-level bboxes
   - Aggregate into line-level segments
   - Convert coordinates to PDF space
3. Save final bboxes

**Inputs:**
- Images from `images/` (stage 14)
- Handwriting definitions from `handwriting_definitions/` (stage 07)
- Visual element definitions from `visual_element_definitions/` (stage 08)
- PDF bboxes from `pdf_word_bboxes/` (stage 05, for non-OCR documents)

**Outputs:**
```
final_word_bboxes/
  {doc_id}.json             # Word-level bounding boxes
final_segment_bboxes/
  {doc_id}.json             # Line-level bounding boxes
ocr_results_cache/
  {doc_id}.json             # Raw OCR API responses (cached)
logs/ocr/                   # OCR logs and errors
```

**OCR Decision Logic:**
```python
requires_ocr = (
    has_handwriting(doc_id) or 
    has_visual_elements(doc_id)
)

if requires_ocr:
    perform_microsoft_ocr()
else:
    copy_pdf_bboxes()  # Reuse stage 05 results
```

**Why Handwriting/Visual Elements Require OCR:**
- Handwriting images inserted in stage 12 → Not in PDF text layer
- Visual elements inserted in stage 13 → Not in PDF text layer
- PDF text extraction (stage 05) misses these elements
- OCR captures all visible text on rendered image

**Microsoft OCR API:**
- **Service:** Azure Computer Vision (Read API)
- **Input:** PNG image (from stage 14)
- **Output:** Word polygons, text, confidence scores
- **Caching:** Results cached to avoid re-processing

**OCR Response Format:**
```json
{
  "readResults": [{
    "page": 1,
    "words": [
      {
        "boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
        "text": "Invoice",
        "confidence": 0.98
      }
    ],
    "lines": [
      {
        "boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
        "text": "Invoice Date",
        "words": [...]
      }
    ]
  }]
}
```

**Coordinate Conversion:**
OCR returns pixel coordinates; converted to PDF points:
```python
pdf_x = ocr_x * (pdf_width / image_width)
pdf_y = ocr_y * (pdf_height / image_height)
```

**Segment Aggregation:**
Groups words into lines based on Y coordinate proximity.

**Key Features:**
- **Selective OCR:** Only runs OCR when necessary (cost/time savings)
- **Result Caching:** Avoids redundant API calls
- **Dual-Level Output:** Word and line (segment) bboxes
- **Coordinate Conversion:** OCR pixels → PDF points
- **Fallback:** Copies PDF bboxes for documents without handwriting/VEs

**Configuration:**
- OCR API key from environment variables
- Timeout: 30 seconds per request
- Retry: Up to 3 attempts

**Error Handling:**
- OCR API failure → Retry, fallback to PDF bboxes if all retries fail
- Empty OCR result → Warning, uses PDF bboxes
- Coordinate conversion error → Logged, bbox skipped

---

### Stage 16: Normalize BBoxes

**File:** `pipeline_16_normalize_bboxes.py`

**Purpose:** Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.

**Key Functions:**
- `main()`: Main normalization orchestrator
- `normalize_word_and_segment_bboxes()`: Normalize word/segment bboxes
- `normalize_layout_bboxes()`: Normalize layout element bboxes (DLA only)
- `normalize_coordinates()`: Core coordinate transformation

**Process:**
1. Load final bboxes from stage 15
2. Load image dimensions (for normalization denominators)
3. For each bbox:
   - Transform: `normalized_x = pixel_x / image_width`
   - Transform: `normalized_y = pixel_y / image_height`
   - Preserve text content
4. Save normalized bboxes
5. For DLA tasks: also normalize layout element bboxes

**Inputs:**
- Word bboxes from `final_word_bboxes/` (stage 15)
- Segment bboxes from `final_segment_bboxes/` (stage 15)
- Layout element definitions from `layout_element_definitions/` (stage 06, for DLA)
- Image dimensions from stage 14 logs

**Outputs:**
```
normalized_word_bboxes/
  {doc_id}.json             # Normalized word bboxes
normalized_segment_bboxes/
  {doc_id}.json             # Normalized segment bboxes
normalized_gt/              # For DLA tasks only
  {doc_id}.json             # Normalized layout elements
logs/normalization/         # Normalization logs
```

**Normalization Formula:**
```python
normalized_bbox = {
    "x0": bbox["x0"] / image_width,
    "y0": bbox["y0"] / image_height,
    "x1": bbox["x1"] / image_width,
    "y1": bbox["y1"] / image_height,
    "text": bbox["text"]  # Preserved
}
```

**Normalized BBox Format:**
```json
[
  {
    "x0": 0.095,    # Was: 20 pixels out of 210mm @ 200dpi
    "y0": 0.128,    # Was: 30 pixels
    "x1": 0.286,    # Was: 60 pixels
    "y1": 0.192,    # Was: 45 pixels
    "text": "Invoice"
  }
]
```

**DLA-Specific Normalization:**
For Document Layout Analysis tasks, layout element bboxes are also normalized:
```json
[
  {
    "label": "title",
    "bbox": [0.095, 0.128, 0.905, 0.192]  # [x0, y0, x1, y1]
  },
  {
    "label": "text",
    "bbox": [0.095, 0.213, 0.905, 0.534]
  }
]
```

**Why Normalization:**
- **Model Training:** Most models expect [0, 1] coordinates
- **Resolution Independence:** Works across different image sizes
- **Standard Format:** Matches common dataset formats (e.g., LayoutLM)

**Coordinate System Mapping:**
```
PDF Points (72 DPI):
  595 × 842 points (A4)
  ↓ (stage 14 conversion @ 200 DPI)
Image Pixels (200 DPI):
  1654 × 2339 pixels
  ↓ (stage 16 normalization)
Normalized [0, 1]:
  0.0-1.0 × 0.0-1.0
```

**Key Features:**
- **Preserves Text:** Text content unchanged during normalization
- **Task-Specific:** DLA tasks get additional layout bbox normalization
- **Validation:** Checks for out-of-bounds coordinates (clamps to [0, 1])
- **Precision:** Full float precision maintained

**Error Handling:**
- Out-of-bounds coordinates → Clamped to [0, 1] with warning
- Missing image dimensions → Error, document skipped
- Zero image dimensions → Error, document marked invalid

---

### Stage 17: GT Preparation & Verification

**File:** `pipeline_17_gt_preparation_verification.py`

**Purpose:** Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.

**Key Functions:**
- `main()`: Main verification orchestrator
- `route_task()`: Route to task-specific handler
- `handle_qa()`: QA ground truth processing
- `handle_kie()`: KIE ground truth with BIO tagging
- `handle_dla()`: DLA ground truth processing
- `fuzzy_match_text()`: Levenshtein-based text matching

**Process:**
1. Load raw GT from stage 03/06
2. Load final word bboxes from stage 15
3. Route to task-specific handler
4. Perform fuzzy text matching (GT text → OCR text)
5. Map GT annotations to bbox indices
6. Apply task-specific formatting
7. Validate and save verified GT

**Inputs:**
- Raw GT from `raw_annotations/` (stage 03/06)
- Word bboxes from `final_word_bboxes/` (stage 15)
- Layout elements from `layout_element_definitions/` (stage 06, for DLA)
- Visual elements from `visual_element_definitions/` (stage 08, for DLA)

**Outputs:**
```
verified_gt/
  qa/
    {doc_id}.json           # QA format
  kie/
    {doc_id}.json           # KIE format with BIO tagging
  dla/
    {doc_id}.json           # DLA format with normalized bboxes
  cls/
    {doc_id}.json           # Classification format
logs/gt_verification/       # Verification logs with match statistics
```

---

#### **QA (Question Answering) Format**

**Process:**
1. Load QA pairs: `{"question": "answer", ...}`
2. For each answer:
   - Find words in bboxes matching answer text (fuzzy)
   - Record bbox indices of matching words
3. Save verified GT with bbox mappings

**Output Format:**
```json
{
  "questions": [
    {
      "question": "What is the invoice number?",
      "answer": "INV-12345",
      "answer_bbox_indices": [15, 16]  # Word indices in final_word_bboxes
    },
    {
      "question": "What is the total amount?",
      "answer": "$1,234.56",
      "answer_bbox_indices": [42]
    }
  ]
}
```

**Fuzzy Matching:**
Uses Levenshtein distance with 0.85 similarity cutoff:
```python
similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
if similarity >= 0.85:
    match_found = True
```

---

#### **KIE (Key Information Extraction) Format**

**Process:**
1. Load entity annotations: `{entity_name: text_value, ...}`
2. For each entity:
   - Find words matching entity value (fuzzy)
   - Generate BIO tags for all words
3. Save verified GT with BIO tagging

**Output Format:**
```json
{
  "entities": [
    {
      "entity": "company_name",
      "value": "Acme Corporation",
      "bbox_indices": [5, 6]
    },
    {
      "entity": "invoice_date",
      "value": "January 15, 2024",
      "bbox_indices": [20, 21, 22]
    }
  ],
  "word_labels": [
    "O", "O", "O", "O", "O",           # Words 0-4: Outside entities
    "B-company_name", "I-company_name", # Words 5-6: Company name
    "O", "O", ...,                      # Words 7-19: Outside
    "B-invoice_date", "I-invoice_date", "I-invoice_date"  # Words 20-22
  ]
}
```

**BIO Tagging:**
- `B-entity`: Beginning of entity
- `I-entity`: Inside entity (continuation)
- `O`: Outside any entity

---

#### **DLA (Document Layout Analysis) Format**

**Process:**
1. Load layout element definitions from stage 06
2. Load visual element definitions from stage 08
3. Validate labels (must be in `valid_labels`)
4. Check spatial constraints (no containment, minimal overlap)
5. Merge visual elements into layout annotations
6. Normalize bboxes to [0, 1]
7. Save verified GT

**Output Format:**
```json
{
  "layout_elements": [
    {
      "id": "layout_0",
      "label": "title",
      "bbox": [0.095, 0.128, 0.905, 0.192]  # Normalized [x0, y0, x1, y1]
    },
    {
      "id": "layout_1",
      "label": "text",
      "bbox": [0.095, 0.213, 0.905, 0.534]
    },
    {
      "id": "ve0",
      "label": "figure",  # Visual element mapped to layout label
      "bbox": [0.714, 0.895, 0.905, 0.980]
    }
  ]
}
```

**Visual Element Merging:**
Visual elements from stage 08 are converted to layout labels:
- `stamp` → `figure` (or custom mapping)
- `logo` → `figure`
- `chart` → `figure`
- `barcode` → `figure`
- `photo` → `figure`

**Spatial Validation:**
- **No containment:** One bbox fully inside another → Error
- **Minimal overlap:** Overlap area < 5% of smaller bbox → Warning
- **Valid labels:** All labels must be in `SynDatasetDefinition.valid_labels`

---

#### **CLS (Classification) Format**

**Process:**
1. Load classification label from raw GT
2. Validate label against expected classes
3. Save verified GT

**Output Format:**
```json
{
  "document_class": "invoice",
  "confidence": 1.0
}
```

---

**Fuzzy Matching Details:**

Uses Levenshtein distance (via `fuzz` library) to handle OCR discrepancies:
```python
# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
# similarity = 0.89 (above 0.85 threshold) → Match!
```

**Match Statistics Logged:**
- Total GT annotations
- Successfully matched annotations
- Failed matches (similarity < 0.85)
- Average similarity score

**Key Features:**
- **Fuzzy Matching:** Handles OCR errors gracefully
- **Task-Specific Formatting:** QA, KIE, DLA, CLS all handled differently
- **BIO Tagging:** Automatic generation for KIE
- **Visual Element Integration:** DLA merges visual elements as layout annotations
- **Spatial Validation:** Detects overlapping/contained layout elements
- **Similarity Tracking:** Logs match quality for analysis

**Error Handling:**
- Match failure (similarity < 0.85) → Logged, annotation skipped
- Invalid labels (DLA) → Error, document marked invalid
- Spatial violations (DLA) → Warning, elements flagged
- Missing bboxes → Error, document marked invalid

---

### Stage 18: Analyze

**File:** `pipeline_18_analyze.py`

**Purpose:** Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.

**Key Functions:**
- `main()`: Main analysis orchestrator
- `calculate_api_costs()`: Compute LLM API costs from token usage

**Process:**
1. Load all document logs from stages 01-17
2. Categorize documents: valid vs. invalid
3. Calculate error distributions
4. Compute API usage and costs
5. Generate statistics (handwriting, visual elements, annotations)
6. Save comprehensive dataset log

**Inputs:**
- All document logs from previous stages
- Batch results from stage 02 (for cost calculation)
- Message processing logs from stage 03
- Prompt usage statistics

**Outputs:**
```
dataset_log.json            # Comprehensive dataset statistics
logs/analysis/              # Analysis logs
```

**Dataset Log Structure:**
```json
{
  "metadata": {
    "syndatadef_name": "docvqa_alpha=1.0",
    "task": "qa",
    "total_documents_requested": 1000,
    "generation_date": "2026-02-07"
  },
  
  "prompting": {
    "total_prompts": 100,
    "total_batches": 10,
    "llm_model": "claude-sonnet-4-20250514"
  },
  
  "total_cost_summary": {
    "total_cost_usd": 123.45,
    "input_tokens": 1500000,
    "output_tokens": 800000,
    "cached_tokens": 500000,
    "cost_per_document": 0.12
  },
  
  "valid_samples_stats": {
    "total_valid": 847,
    "total_invalid": 153,
    "validity_rate": 0.847,
    "avg_handwriting_regions_per_doc": 2.3,
    "avg_visual_elements_per_doc": 1.5,
    "avg_annotations_per_doc": 8.7,
    "documents_with_handwriting": 654,
    "documents_with_visual_elements": 512
  },
  
  "valid_samples": [
    {
      "doc_id": "doc_0001",
      "seed_image": "docvqa_train_12345",
      "has_handwriting": true,
      "has_visual_elements": true,
      "num_annotations": 10,
      "num_words": 247,
      "image_size": [1654, 2339]
    }
  ],
  
  "valid_samples_by_category": {
    "invoice": 234,
    "receipt": 198,
    "form": 415
  },
  
  "errors": {
    "multipage_pdf": 12,
    "missing_ocr_result": 5,
    "failed_gt_verification": 38,
    "rendering_timeout": 8,
    "llm_parsing_error": 23,
    "bbox_extraction_failed": 4,
    "handwriting_generation_failed": 7,
    "visual_element_generation_failed": 3,
    "other": 53
  }
}
```

**Error Categories:**

| Error Category | Description |
|---------------|-------------|
| `multipage_pdf` | PDF rendered with multiple pages (invalid) |
| `missing_ocr_result` | OCR failed or returned empty result |
| `failed_gt_verification` | GT matching/validation failed in stage 17 |
| `rendering_timeout` | PDF rendering exceeded timeout |
| `llm_parsing_error` | Failed to extract HTML/GT from LLM response |
| `bbox_extraction_failed` | PyMuPDF failed to extract bboxes |
| `handwriting_generation_failed` | Diffusion model failed |
| `visual_element_generation_failed` | VE generation/selection failed |
| `other` | Miscellaneous errors |

**Cost Calculation:**

**Claude API Pricing (example):**
```python
costs = {
    "input": input_tokens * 0.003 / 1000,      # $3 per 1M tokens
    "output": output_tokens * 0.015 / 1000,    # $15 per 1M tokens
    "cached": cached_tokens * 0.0003 / 1000    # $0.30 per 1M tokens (prompt caching)
}
total_cost = costs["input"] + costs["output"] + costs["cached"]
```

**Statistics Computed:**
- **Validity Rate:** Percentage of documents passing all stages
- **Handwriting Stats:** Documents with handwriting, avg regions per doc
- **Visual Element Stats:** Documents with VEs, avg elements per doc
- **Annotation Stats:** Avg QA pairs/KIE entities/DLA elements per doc
- **Token Usage:** Input/output/cached token totals
- **Cost Metrics:** Total cost, cost per document, cost per valid document

**Key Features:**
- **Comprehensive Error Tracking:** All error categories logged
- **Cost Transparency:** Token-level cost breakdown
- **Quality Metrics:** Validity rate, avg annotations, etc.
- **Category Breakdown:** Valid samples grouped by document type
- **Per-Document Tracking:** Each valid document's metadata saved

**Usage:**
This log is essential for:
- Understanding dataset quality
- Optimizing pipeline (identify bottlenecks)
- Cost estimation for future runs
- Debugging (error distribution analysis)

---

### Stage 19: Create Debug Data

**File:** `pipeline_19_create_debug_data.py`

**Purpose:** Generate comprehensive debug visualizations for manual inspection and quality assurance.

**Key Functions:**
- `main()`: Main debug generator
- `visualize_visual_element_bboxes()`: Overlay VE bboxes on PDFs
- `visualize_final_bboxes_on_images()`: Overlay OCR bboxes on images
- `visualize_pdf_bboxes()`: Overlay PDF bboxes

**Process:**
1. Load all generated documents
2. Create debug subdirectories
3. For each document:
   - Generate PDF bbox overlays
   - Generate VE bbox overlays
   - Generate handwriting insertion region overlays
   - Generate final OCR bbox overlays on images
4. Copy raw HTML with debug.js script for browser inspection

**Inputs:**
- All intermediate and final outputs from stages 01-18

**Outputs:**
```
debug/
  pdf_bboxes/               # PDF word bboxes overlaid (stage 05)
    {doc_id}.pdf
  visual_element_bboxes/    # VE bboxes overlaid (stage 08)
    {doc_id}.pdf
  handwriting_insertion/    # Handwriting regions overlaid (stage 12)
    {doc_id}.pdf
  final_bboxes_on_images/   # OCR bboxes on final images (stage 15)
    {doc_id}.png
  html_with_debug/          # Raw HTML + debug.js
    {doc_id}.html
    debug.js                # Browser-based inspection script
```

**Debug Visualizations:**

**1. PDF BBoxes (Stage 05):**
- Red rectangles: Word bounding boxes from PyMuPDF
- Annotated with word text
- Purpose: Verify PDF text extraction quality

**2. Visual Element BBoxes (Stage 08):**
- Blue rectangles: Visual element placeholder regions
- Annotated with element type (stamp, logo, etc.)
- Purpose: Verify VE extraction and positioning

**3. Handwriting Insertion Regions (Stage 12):**
- Green rectangles: Handwriting bbox regions
- Annotated with handwriting text
- Purpose: Verify handwriting placement accuracy

**4. Final BBoxes on Images (Stage 15):**
- Orange rectangles: OCR word bounding boxes
- Overlaid on final rendered images
- Purpose: Verify OCR accuracy and coverage

**Debug JavaScript (debug.js):**
```javascript
// Browser-based inspection tool
// Features:
// - Highlight elements on hover
// - Show geometry data in console
// - Toggle element visibility
// - Measure element dimensions
```

**Key Features:**
- **Color-Coded Overlays:** Different colors for different stages
- **Text Annotations:** Bboxes labeled with content
- **Multi-Format:** Both PDF and PNG visualizations
- **Browser Inspection:** HTML with interactive debug script
- **Selective Generation:** Only enabled with `DEBUG_MODE=true` flag

**Configuration:**
- `DEBUG_MODE`: Enable/disable debug output (default: false)
- `DEBUG_BBOX_LINE_WIDTH`: Line thickness for overlays (default: 2)
- `DEBUG_BBOX_OPACITY`: Overlay transparency (default: 0.5)

**When to Use:**
- Visual quality assurance
- Debugging bbox extraction issues
- Verifying handwriting/VE insertion
- Identifying OCR problems
- Manual inspection of edge cases

**Performance Note:**
Debug generation adds ~20% processing time; disabled by default in production.

---

## API Implementation

The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                         API ARCHITECTURE                         │
└─────────────────────────────────────────────────────────────────┘

Client Request
     │
     ↓
┌─────────────────────┐
│   FastAPI Server    │
│   (main.py)         │
└─────────────────────┘
     │
     ├──► Validate Request (schemas.py)
     │
     ├──► Download Seed Images (utils.py)
     │
     ├──► Build Prompt (utils.py)
     │
     ├──► Call Claude API (Synchronous)
     │    └──► Claude Sonnet 4.5
     │
     ├──► Extract HTML & GT (pipeline_03 functions)
     │
     ├──► Render PDF (pipeline_04 functions)
     │    └──► Playwright/Chromium
     │
     ├──► Extract BBoxes (pipeline_05 functions)
     │    └──► PyMuPDF
     │
     └──► Return Response
          └──► JSON with base64-encoded PDFs
```

---

### Endpoints

#### **GET /**
Health check endpoint.

**Response:**
```json
{
  "message": "DocGenie API is running",
  "version": "1.0",
  "status": "healthy"
}
```

---

#### **GET /health**
Detailed health status.

**Response:**
```json
{
  "status": "healthy",
  "api_version": "1.0",
  "playwright_available": true,
  "llm_model": "claude-sonnet-4-5-20250929"
}
```

---

#### **POST /generate**
Generate documents with ground truth annotations.

**Request Body** (`schemas.GenerateDocumentRequest`):
```json
{
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "English",
    "doc_type": "business and administrative",
    "gt_type": "Multiple questions and their answers",
    "gt_format": "{\"question\": \"answer\", ...}",
    "num_solutions": 2
  }
}
```

**Request Validation:**
- `seed_images`: 1-10 URLs (HTTPS only)
- `num_solutions`: 1-5 documents
- All prompt parameters required

**Response** (`schemas.GenerateDocumentResponse`):
```json
{
  "success": true,
  "message": "Successfully generated 2 documents",
  "documents": [
    {
      "document_id": "doc_20260207_001",
      "html": "<html>...</html>",
      "css": "body { ... }",
      "ground_truth": {
        "What is the invoice number?": "INV-12345"
      },
      "pdf_base64": "JVBERi0xLjQK...",
      "bboxes": [
        {
          "x0": 20.0,
          "y0": 30.0,
          "x1": 60.0,
          "y1": 45.0,
          "text": "Invoice"
        }
      ],
      "page_width_mm": 210.0,
      "page_height_mm": 297.0
    }
  ],
  "total_documents": 2
}
```

---

#### **POST /generate-files**
Generate documents and return as downloadable files.

**Request:** Same as `/generate`

**Response:** 
- **Content-Type:** `application/zip`
- **File:** ZIP archive containing:
  - `doc_001.pdf`
  - `doc_001_gt.json`
  - `doc_001_bboxes.json`
  - `doc_002.pdf`
  - ...

**File Structure:**
```
generated_documents.zip
├── doc_001.pdf
├── doc_001_gt.json
├── doc_001_bboxes.json
├── doc_001_metadata.json
├── doc_002.pdf
├── ...
```

---

### Request/Response Schemas

Defined in `api/schemas.py`:

#### **PromptParameters**
```python
class PromptParameters(BaseModel):
    language: str = "English"
    doc_type: str = "business and administrative"
    gt_type: str = "Multiple questions and their answers"
    gt_format: str = '{"question": "answer", ...}'
    num_solutions: int = Field(default=1, ge=1, le=5)
```

#### **GenerateDocumentRequest**
```python
class GenerateDocumentRequest(BaseModel):
    seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
    prompt_params: PromptParameters
```

#### **BoundingBox**
```python
class BoundingBox(BaseModel):
    x0: float
    y0: float
    x1: float
    y1: float
    text: str
```

#### **DocumentResult**
```python
class DocumentResult(BaseModel):
    document_id: str
    html: str
    css: str
    ground_truth: Optional[dict]
    pdf_base64: str
    bboxes: List[BoundingBox]
    page_width_mm: float
    page_height_mm: float
```

#### **GenerateDocumentResponse**
```python
class GenerateDocumentResponse(BaseModel):
    success: bool
    message: str
    documents: List[DocumentResult]
    total_documents: int
```

---

### API Pipeline Flow

Detailed integration with pipeline stages:

```python
# Simplified API flow (from api/main.py)

@app.post("/generate", response_model=GenerateDocumentResponse)
async def generate_documents(request: GenerateDocumentRequest):
    # 1. Download and encode seed images
    seed_images_base64 = await download_and_encode_images(request.seed_images)
    
    # 2. Build prompt from template
    prompt = build_prompt_from_template(
        seed_images=seed_images_base64,
        params=request.prompt_params
    )
    
    # 3. Call Claude API (synchronous, not batched)
    llm_response = await call_claude_api(
        prompt=prompt,
        model="claude-sonnet-4-5-20250929"
    )
    
    # 4. Extract HTML and GT (from pipeline_03)
    documents_html = extract_html_from_message(llm_response)
    documents_gt = extract_gt_from_html(documents_html)
    
    # 5. Validate HTML (from utils.py)
    validate_html_structure(documents_html)
    
    # 6. Render PDFs (from pipeline_04)
    pdfs = await render_pdfs_with_playwright(documents_html)
    
    # 7. Validate PDFs (from utils.py)
    validate_pdf_pages(pdfs)
    
    # 8. Extract bboxes (from pipeline_05)
    bboxes = extract_bboxes_from_pdfs(pdfs)
    
    # 9. Validate bboxes (from utils.py)
    validate_bbox_completeness(bboxes)
    
    # 10. Encode PDFs to base64
    pdfs_base64 = encode_pdfs_to_base64(pdfs)
    
    # 11. Build response
    return GenerateDocumentResponse(
        success=True,
        documents=[...],
        total_documents=len(documents_html)
    )
```

---

### Integration with Pipeline Functions

**Reused from Pipeline:**
- `extract_html_from_message()` from `pipeline_03`
- `extract_gt_from_html()` from `pipeline_03`
- `render_pdf_with_playwright()` from `pipeline_04`
- `extract_bboxes_from_pdf()` from `pipeline_05`
- Various utilities from `docgenie.generation.utils`

**API-Specific Functions (api/utils.py):**
- `download_seed_images()`: Fetch images from URLs
- `encode_images_to_base64()`: Convert images for API transmission
- `build_prompt_from_template()`: Template-based prompt construction
- `call_claude_api_sync()`: Synchronous Claude API call (non-batched)
- `encode_pdf_to_base64()`: PDF encoding for response
- `validate_html_structure()`: HTML validation
- `validate_pdf_pages()`: PDF page count/size validation
- `validate_bbox_completeness()`: Ensure bboxes extracted

---

### Configuration

#### **Environment Variables (.env)**
```bash
ANTHROPIC_API_KEY=sk-ant-...            # Required for Claude API
LLM_MODEL=claude-sonnet-4-5-20250929   # Default model
API_PORT=8000                           # Server port
DEBUG_MODE=false                        # Enable debug logging
```

#### **Prompt Templates**
Located in `data/prompt_templates/`:

**Template Structure:**
```
data/prompt_templates/<template_name>/
├── system_prompt.txt                   # System message
├── user_prompt_template.txt            # User message template
└── example_output.html                 # Example for few-shot
```

**Placeholder Substitution:**
```python
# In user_prompt_template.txt
"""
Please generate {num_solutions} {doc_type} documents in {language}.

Ground truth format: {gt_format}
Ground truth type: {gt_type}
"""

# Substituted with request.prompt_params
```

#### **CORS Configuration**
```python
# In api/main.py
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],        # Open for development
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)
```

---

### Authentication & Security

**Current State:**
- **No built-in authentication:** API is open (for development)
- **API Key Management:** Claude API key in `.env`, not passed in requests
- **CORS:** Open (`allow_origins=["*"]`) for development

**Production Recommendations:**
- Add API key authentication (e.g., Bearer tokens)
- Restrict CORS origins to known frontends
- Rate limiting (e.g., Redis-based)
- Input sanitization for HTML injection prevention
- HTTPS only (terminate SSL at reverse proxy)

---

### Performance Considerations

**Async Rendering:**
- Playwright rendering uses async/await
- Concurrent requests supported by FastAPI
- Semaphore control prevents resource exhaustion

**Image Size Limits:**
- Seed images compressed before API transmission
- Max dimension: 1024px (configurable)
- JPEG quality: 85% (configurable)

**Timeouts:**
- PDF render timeout: 60 seconds per document
- Claude API timeout: 120 seconds
- Total request timeout: 300 seconds (5 minutes)

**Retry Logic:**
- PDF rendering: Up to 3 attempts
- Claude API: Up to 2 retries on network errors
- No retry on validation failures

**Concurrency:**
- FastAPI default: Multiple workers (configurable)
- Playwright: Semaphore-controlled (10 concurrent renders)

---

### Limitations

**Current Limitations:**
1. **Single-Page Only:** Multi-page PDFs flagged as errors
2. **Seed Image Limit:** Maximum 10 seed images per request
3. **Document Limit:** Maximum 5 document variations (`num_solutions`)
4. **No Handwriting/Visual Elements:** Stages 07-13 not integrated (API stops at stage 06)
5. **Synchronous LLM:** No batching (higher cost per document)
6. **No Dataset Export:** No `/export-dataset` endpoint for full pipeline runs

**Known Issues:**
- Large documents (>10 pages worth of content) may timeout
- Complex CSS (animations, 3D transforms) may not render correctly
- Some Unicode characters may not display in PDFs

---

### Future Integration Plan

From `api/PIPELINE_INTEGRATION.md`:

#### **Stage 3: Handwriting & Visual Elements (Stages 07-11)**

**New Request Parameters:**
```python
class PromptParameters(BaseModel):
    # ... existing ...
    enable_handwriting: bool = False
    handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
    enable_visual_elements: bool = False
    visual_element_types: List[str] = ["stamp", "logo"]
```

**New Response Fields:**
```python
class DocumentResult(BaseModel):
    # ... existing ...
    handwriting_regions: Optional[List[HandwritingRegion]]
    visual_elements: Optional[List[VisualElement]]
```

**Impact:**
- Longer processing time (diffusion model: ~5s per handwriting region)
- Larger response size (additional images)

---

#### **Stage 4: Image Finalization & OCR (Stages 12-15)**

**New Response Fields:**
```python
class DocumentResult(BaseModel):
    # ... existing ...
    image_base64: str                    # Final rendered image (PNG)
    ocr_text: str                        # Full OCR text
    ocr_confidence: float                # Average OCR confidence
```

**Impact:**
- OCR API costs (~$1.50 per 1000 images)
- Additional 2-3 seconds per document (OCR latency)

---

#### **Stage 5: Dataset Packaging (Stages 16-19)**

**New Endpoint:**
```python
@app.post("/export-dataset")
async def export_dataset(request: ExportDatasetRequest):
    """
    Run full pipeline (stages 01-19) and return packaged dataset.
    """
    # Run pipeline with syndatadef
    # Return ZIP with images, GTs, statistics
```

**Request:**
```python
class ExportDatasetRequest(BaseModel):
    syndatadef_config: dict              # Full SynDatasetDefinition
    output_format: str = "huggingface"   # "huggingface", "coco", "custom"
```

**Response:**
- ZIP archive with full dataset
- Includes `dataset_log.json` from stage 18

---

### Example Usage

#### **Python Client (api/example_usage.py)**

```python
import requests
import base64
from pathlib import Path

# API endpoint
API_URL = "http://localhost:8000/generate"

# Prepare request
request_data = {
    "seed_images": [
        "https://example.com/invoice_seed.jpg",
        "https://example.com/receipt_seed.jpg"
    ],
    "prompt_params": {
        "language": "English",
        "doc_type": "invoices and receipts",
        "gt_type": "Multiple questions and their answers",
        "gt_format": '{"question": "answer", ...}',
        "num_solutions": 3
    }
}

# Call API
response = requests.post(API_URL, json=request_data)
result = response.json()

# Process results
if result["success"]:
    for doc in result["documents"]:
        doc_id = doc["document_id"]
        
        # Save PDF
        pdf_data = base64.b64decode(doc["pdf_base64"])
        Path(f"{doc_id}.pdf").write_bytes(pdf_data)
        
        # Save GT
        Path(f"{doc_id}_gt.json").write_text(
            json.dumps(doc["ground_truth"], indent=2)
        )
        
        # Save HTML
        Path(f"{doc_id}.html").write_text(doc["html"])
        
        print(f"Saved: {doc_id}")
        print(f"  - BBoxes: {len(doc['bboxes'])}")
        print(f"  - GT Annotations: {len(doc['ground_truth'])}")
```

#### **cURL Example**

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "seed_images": [
      "https://example.com/seed.jpg"
    ],
    "prompt_params": {
      "language": "English",
      "doc_type": "business documents",
      "gt_type": "Questions and answers",
      "gt_format": "{\"question\": \"answer\"}",
      "num_solutions": 1
    }
  }'
```

#### **JavaScript/TypeScript Client**

```typescript
const response = await fetch('http://localhost:8000/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    seed_images: ['https://example.com/seed.jpg'],
    prompt_params: {
      language: 'English',
      doc_type: 'invoices',
      gt_type: 'Questions and answers',
      gt_format: '{"question": "answer"}',
      num_solutions: 2
    }
  })
});

const result = await response.json();

// Decode PDF
const pdfBlob = new Blob(
  [Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
  { type: 'application/pdf' }
);

// Download PDF
const url = URL.createObjectURL(pdfBlob);
const a = document.createElement('a');
a.href = url;
a.download = `${result.documents[0].document_id}.pdf`;
a.click();
```

---

### Testing

**Test File:** `api/test_api.py`

```python
import pytest
from fastapi.testclient import TestClient
from api.main import app

client = TestClient(app)

def test_health_check():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_generate_documents():
    request_data = {
        "seed_images": ["https://example.com/seed.jpg"],
        "prompt_params": {
            "language": "English",
            "doc_type": "invoices",
            "gt_type": "Questions",
            "gt_format": "{\"q\": \"a\"}",
            "num_solutions": 1
        }
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 200
    result = response.json()
    assert result["success"] is True
    assert len(result["documents"]) > 0

def test_invalid_request():
    request_data = {
        "seed_images": [],  # Invalid: empty
        "prompt_params": {"num_solutions": 10}  # Invalid: > 5
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 422  # Validation error
```

**Run Tests:**
```bash
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
pytest api/test_api.py -v
```

---

### Deployment

**Start Server:**
```bash
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
chmod +x start.sh
./start.sh
```

**start.sh Contents:**
```bash
#!/bin/bash
export $(cat .env | xargs)
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

**Production Deployment:**
```bash
# With Gunicorn (multi-worker)
gunicorn main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 300
```

**Docker Deployment:**
```dockerfile
FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

---

## Core Models & Utilities

### SynDatasetDefinition Model

**File:** `docgenie/generation/models/_syndatadef.py`

**Purpose:** Configuration object for synthetic dataset generation, loaded from YAML files.

**Key Attributes:**
```python
@dataclass
class SynDatasetDefinition:
    # Dataset metadata
    name: str                           # "docvqa_alpha=1.0"
    task: TaskType                      # TaskType.QA, .KIE, .DLA, .CLS
    base_dataset: str                   # "docvqa", "cord", etc.
    
    # LLM configuration
    llm_model: str                      # "claude-sonnet-4-20250514"
    prompt_template: str                # "DocGenie"
    num_solutions: int                  # Documents per prompt
    
    # Prompting parameters
    language: str                       # "English", "German", etc.
    doc_type: str                       # "business and administrative"
    gt_type: str                        # Task-specific GT description
    gt_format: str                      # Expected GT format
    
    # Dataset parameters
    num_samples: int                    # Total documents to generate
    alpha: float                        # Clustering diversity parameter
    
    # Seed selection
    seeds_per_cluster: int              # Seeds sampled per cluster
    clustering_method: str              # "kmeans", "agglomerative"
    
    # Task-specific
    valid_labels: Optional[List[str]]   # For DLA: ["title", "text", ...]
    
    # Output paths
    output_dir: Path                    # Root output directory
```

**Key Methods:**
```python
def get_file_structure(self) -> FileStructure:
    """Returns FileStructure manager for output directories."""

def build_prompt(self, seed_images: List[str]) -> str:
    """Builds prompt from template with parameter substitution."""

def iter_document_logs(self) -> Iterator[DocumentLog]:
    """Iterates over all document logs."""

def update_document_status(self, doc_id: str, status: Status):
    """Updates document status in log."""

@classmethod
def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
    """Load configuration from YAML file."""
```

**Example YAML:**
```yaml
name: "docvqa_alpha=1.0"
task: "qa"
base_dataset: "docvqa"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1

language: "English"
doc_type: "business and administrative documents"
gt_type: "Multiple questions and their answers"
gt_format: '{"question": "answer", ...}'

num_samples: 1000
alpha: 1.0
seeds_per_cluster: 10
clustering_method: "kmeans"

output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"
```

---

### PipelineParameters Model

**File:** `docgenie/generation/models/_pipeline.py`

**Purpose:** Runtime parameters for pipeline execution.

**Attributes:**
```python
@dataclass
class PipelineParameters:
    # Execution control
    start_stage: int = 1                # First stage to execute
    end_stage: int = 19                 # Last stage to execute
    skip_existing: bool = True          # Skip documents with existing outputs
    
    # Parallelization
    max_workers: int = 10               # Concurrent processing
    chromium_concurrency: int = 10      # Parallel PDF renders
    
    # Debug mode
    debug_mode: bool = False            # Enable debug visualizations
    
    # Retry configuration
    max_retries: int = 3                # Retry attempts
    retry_delay: float = 2.0            # Seconds between retries
    
    # Timeouts
    pdf_render_timeout: int = 60        # Seconds
    ocr_timeout: int = 30               # Seconds
```

---

### FileStructure Model

**File:** `docgenie/generation/models/_file.py`

**Purpose:** Manages directory structure for generated data.

**Key Properties:**
```python
@dataclass
class FileStructure:
    root: Path                          # Root output directory
    
    # Directory properties (all return Path)
    seeds_directory: Path
    prompt_batches_directory: Path
    message_results_directory: Path
    raw_html_directory: Path
    raw_annotations_directory: Path
    geometries_directory: Path
    pdf_initial_directory: Path
    render_html_directory: Path
    pdf_word_bboxes_directory: Path
    pdf_char_bboxes_directory: Path
    layout_element_definitions_directory: Path
    handwriting_definitions_directory: Path
    visual_element_definitions_directory: Path
    handwriting_images_directory: Path
    visual_element_images_directory: Path
    pdf_without_handwriting_placeholder_directory: Path
    pdf_with_handwriting_directory: Path
    pdf_final_directory: Path
    images_directory: Path
    final_word_bboxes_directory: Path
    final_segment_bboxes_directory: Path
    normalized_word_bboxes_directory: Path
    normalized_segment_bboxes_directory: Path
    verified_gt_directory: Path
    
    # Debug subdirectories
    debug_directory: Path
    debug_pdf_bboxes_directory: Path
    debug_visual_element_bboxes_directory: Path
    debug_handwriting_insertion_directory: Path
    debug_final_bboxes_on_images_directory: Path
    debug_html_with_debug_directory: Path
```

**Key Methods:**
```python
def create_all_directories(self):
    """Create all output directories."""

def get_document_path(self, doc_id: str, stage: str) -> Path:
    """Get path for specific document and stage."""
```

---

### DocumentLog Model

**File:** `docgenie/generation/models/_log.py`

**Purpose:** Document-level metadata and status tracking.

**Attributes:**
```python
@dataclass
class DocumentLog:
    doc_id: str                         # Unique document ID
    seed_image_id: str                  # Source seed image
    prompt_call_id: str                 # Prompt batch/call ID
    
    status: Status                      # VALID, INVALID, PROCESSING
    
    # Stage completion flags
    has_raw_html: bool = False
    has_raw_gt: bool = False
    has_pdf_initial: bool = False
    has_geometries: bool = False
    has_bboxes: bool = False
    has_handwriting: bool = False
    has_visual_elements: bool = False
    has_final_image: bool = False
    has_ocr_result: bool = False
    has_verified_gt: bool = False
    
    # Statistics
    num_words: int = 0
    num_annotations: int = 0
    num_handwriting_regions: int = 0
    num_visual_elements: int = 0
    
    # Error tracking
    error_stage: Optional[str] = None
    error_message: Optional[str] = None
    error_category: Optional[str] = None
    
    # Timestamps
    created_at: datetime
    updated_at: datetime
```

**Key Methods:**
```python
def mark_stage_complete(self, stage: str):
    """Mark pipeline stage as complete."""

def mark_error(self, stage: str, error_msg: str, category: str):
    """Record error and mark document as invalid."""

def is_valid(self) -> bool:
    """Check if document passed all stages."""
```

---

### BBox Model

**File:** `docgenie/generation/models/_bbox.py`

**Purpose:** Bounding box representation with text content.

**Attributes:**
```python
@dataclass
class BBox:
    rect: Rect                          # {x0, y0, x1, y1}
    text: str                           # Text content
    metadata: Optional[dict] = None     # Additional data
```

**Key Methods:**
```python
@property
def width(self) -> float:
    return self.rect["x1"] - self.rect["x0"]

@property
def height(self) -> float:
    return self.rect["y1"] - self.rect["y0"]

def normalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert to normalized [0, 1] coordinates."""

def unnormalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert from normalized to pixel coordinates."""

def to_dict(self) -> dict:
    """Serialize to dictionary."""

@classmethod
def from_dict(cls, data: dict) -> "BBox":
    """Deserialize from dictionary."""
```

---

### LayoutBBox Model

**File:** `docgenie/generation/models/_bbox.py`

**Purpose:** Layout element bounding box (for DLA tasks).

**Attributes:**
```python
@dataclass
class LayoutBBox:
    label: str                          # "title", "text", "table", etc.
    rect: Rect                          # {x0, y0, x1, y1}
    metadata: Optional[dict] = None
```

**Key Methods:**
```python
def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
    """Normalize coordinates."""

def contains(self, other: "LayoutBBox") -> bool:
    """Check if this bbox fully contains another."""

def overlaps(self, other: "LayoutBBox") -> bool:
    """Check if this bbox overlaps with another."""

def overlap_area(self, other: "LayoutBBox") -> float:
    """Calculate overlap area with another bbox."""
```

---

### Utility Modules

#### **BBox Utilities (utils/bboxes.py)**

```python
def load_bboxes_from_file(file_path: Path) -> List[BBox]:
    """Load bboxes from JSON file."""

def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
    """Save bboxes to JSON file."""

def visualize_bboxes_on_pdf(
    pdf_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "red"
):
    """Draw bbox overlays on PDF."""

def visualize_bboxes_on_image(
    image_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "orange"
):
    """Draw bbox overlays on image."""

def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
    """Check if bbox1 contains bbox2."""
```

---

#### **Geometry Utilities (utils/geos.py)**

```python
def filter_layout_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-label attribute."""

def filter_handwriting_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-handwriting attribute."""

def filter_visual_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-visual-element attribute."""

def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
    """Extract elements with specific CSS class."""
```

---

#### **Serialization Utilities (utils/serialization.py)**

```python
def encode_image_to_base64(image_path: Path) -> str:
    """Encode image file to base64 string."""

def decode_base64_to_image(base64_str: str, output_path: Path):
    """Decode base64 string and save as image."""

def serialize_dataclass(obj: Any) -> dict:
    """Serialize dataclass to dictionary."""

def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
    """Deserialize dictionary to dataclass."""
```

---

## Configuration & Constants

### Key Constants (generation/constants.py)

```python
# Document processing
HTML_PARSER = "html.parser"           # BeautifulSoup parser
PDF_POINT_SCALING = 72 / 96           # CSS DPI to PDF DPI

# Bounding boxes
BBOX_OVERLAP_THRESHOLD = 0.05         # 5% overlap tolerance
SPATIAL_MATCH_THRESHOLD = 10.0        # Pixel tolerance for matching

# PDF rendering
CHROMIUM_CONCURRENCY = 10             # Parallel renders
PER_PDF_RENDER_TIMEOUT = 60           # Seconds
PER_PDF_RENDER_MAX_RETRIES = 3        # Retry attempts

# Handwriting
MAX_HANDWRITING_CHARS = 7             # Max chars per diffusion generation
HANDWRITING_HEIGHT_PX = 40            # Image height
HANDWRITING_PADDING_PX = 0            # Horizontal padding
DIFFUSION_NUM_INFERENCE_STEPS = 50    # Generation quality
HANDWRITING_IMAGE_UPSCALE_FACTOR = 3  # Insertion scaling
MAX_HANDWRITING_RAND_X = 2            # Random X offset (pixels)
MAX_HANDWRITING_RAND_Y = 1            # Random Y offset (pixels)

# Visual elements
VISUAL_ELEMENT_UPSCALE_FACTOR = 3     # Insertion scaling
STAMP_BORDER_WIDTH = 2                # Stamp border thickness
BARCODE_DPI = 300                     # Barcode image quality

# OCR
IMAGE_DPI = 200                       # Final image DPI
OCR_CONFIDENCE_THRESHOLD = 0.8        # Min confidence

# Ground truth
FUZZY_MATCH_THRESHOLD = 0.85          # Levenshtein similarity cutoff

# Handwriting styles
HANDWRITING_STYLES = [
    "writer_0", "writer_1", "writer_2", ..., "writer_99"
]

# Visual element types
VISUAL_ELEMENT_TYPES = [
    "stamp", "logo", "barcode", "chart", "photo"
]

# Visual element type mapping (LLM output → standard)
VISUAL_ELEMENT_TYPE_MAPPING = {
    "stamp": "stamp",
    "company_stamp": "stamp",
    "approval_stamp": "stamp",
    "logo": "logo",
    "company_logo": "logo",
    "brand_logo": "logo",
    "barcode": "barcode",
    "code128": "barcode",
    "chart": "chart",
    "graph": "chart",
    "figure": "chart",
    "photo": "photo",
    "image": "photo",
    "picture": "photo"
}
```

---

### Environment Variables

**Required:**
```bash
ANTHROPIC_API_KEY=sk-ant-...          # Claude API key
MICROSOFT_AZURE_OCR_KEY=...           # Azure OCR key
MICROSOFT_AZURE_OCR_ENDPOINT=...      # Azure OCR endpoint
```

**Optional:**
```bash
LLM_MODEL=claude-sonnet-4-20250514    # Override model
DEBUG_MODE=false                      # Enable debug output
LOG_LEVEL=INFO                        # Logging verbosity
CHROMIUM_CONCURRENCY=10               # Override concurrency
```

---

## Error Handling & Debugging

### Error Categories

**Comprehensive error tracking in stage 18:**

| Category | Description | Resolution |
|----------|-------------|------------|
| `multipage_pdf` | PDF rendered with >1 page | Check HTML content size, CSS page breaks |
| `missing_ocr_result` | OCR API failed or empty | Check Azure credentials, retry |
| `failed_gt_verification` | GT text not found in OCR | Review fuzzy match threshold, inspect HTML |
| `rendering_timeout` | PDF render exceeded timeout | Increase timeout, simplify HTML |
| `llm_parsing_error` | Failed to extract HTML/GT | Review LLM response format, update regex |
| `bbox_extraction_failed` | PyMuPDF extraction error | Check PDF validity, inspect fonts |
| `handwriting_generation_failed` | Diffusion model error | Check model checkpoint, GPU availability |
| `visual_element_generation_failed` | VE creation error | Check prefab directories, image validity |

---

### Debug Visualizations (Stage 19)

**Generated debug outputs:**

1. **PDF BBoxes:** Verify PyMuPDF extraction quality
2. **Visual Element BBoxes:** Verify VE extraction and positioning
3. **Handwriting Insertion:** Verify handwriting placement
4. **Final BBoxes on Images:** Verify OCR accuracy

**Enable debug mode:**
```python
# In pipeline execution
pipeline_params = PipelineParameters(
    debug_mode=True,
    # ... other params
)
```

---

### Logging

**Log locations:**
```
data/datasets/synthesized_datasets/<dataset_name>/logs/
├── pipeline_01/
│   └── seed_selection.log
├── pipeline_02/
│   └── prompting.log
├── pipeline_03/
│   └── message_processing/
│       ├── batch_001.log
│       └── batch_002.log
├── ...
└── pipeline_19/
    └── debug_generation.log
```

**Log format:**
```
[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout
```

---

### Common Issues & Solutions

**Issue: Multi-page PDFs**
- **Cause:** HTML content exceeds page size
- **Solution:** Reduce content, check for long tables, use CSS `overflow: hidden`

**Issue: Handwriting not appearing**
- **Cause:** Character bboxes not available (stage 05)
- **Solution:** Use simpler fonts in HTML, ensure PyMuPDF can extract chars

**Issue: OCR missing text**
- **Cause:** Low image DPI, poor contrast
- **Solution:** Increase `IMAGE_DPI` (stage 14), adjust HTML styling

**Issue: GT verification failures**
- **Cause:** OCR discrepancies, fuzzy match threshold too high
- **Solution:** Lower `FUZZY_MATCH_THRESHOLD`, improve HTML text rendering

**Issue: Claude API timeout**
- **Cause:** Large seed images, complex prompts
- **Solution:** Compress seed images, simplify prompt template

---

## Usage Examples

### Full Pipeline Execution

```python
from docgenie.generation import pipeline_01_select_seeds
from docgenie.generation.models import SynDatasetDefinition, PipelineParameters

# Load configuration
syndatadef = SynDatasetDefinition.from_yaml(
    "data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
)

# Configure pipeline
params = PipelineParameters(
    start_stage=1,
    end_stage=19,
    skip_existing=True,
    debug_mode=False,
    max_workers=10
)

# Execute pipeline stages
for stage in range(params.start_stage, params.end_stage + 1):
    print(f"Executing stage {stage:02d}...")
    
    # Import and run stage module
    stage_module = importlib.import_module(
        f"docgenie.generation.pipeline_{stage:02d}_*"
    )
    stage_module.main(syndatadef, params)
    
    print(f"Stage {stage:02d} complete.")

print("Pipeline execution complete!")
```

---

### API Usage

**See API Implementation section for detailed examples.**

---

### Custom Dataset Definition

```yaml
# data/syn_dataset_definitions/custom_invoices.yaml

name: "custom_invoices_v1"
task: "kie"
base_dataset: "cord"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "ClaudeRefined12"
num_solutions: 1

language: "English"
doc_type: "invoices and receipts"
gt_type: "Key-value pairs (entity extraction)"
gt_format: '{"entity_name": "entity_value", ...}'

num_samples: 500
alpha: 0.75
seeds_per_cluster: 5
clustering_method: "kmeans"

# KIE-specific
valid_labels: null  # Not used for KIE

output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"
```

---

## Conclusion

This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:

- **Pipeline Integration Guide:** `api/PIPELINE_INTEGRATION.md`
- **API README:** `api/README.md`
- **Source Code:** `docgenie/generation/` and `api/`

**Key Takeaways:**

1. **19-Stage Pipeline:** Modular design from seed selection to GT verification
2. **Multi-Task Support:** QA, KIE, DLA, CLS with task-specific handling
3. **Realistic Documents:** LLM-generated content, diffusion handwriting, visual elements
4. **Quality Assurance:** Comprehensive validation, OCR verification, error tracking
5. **API Integration:** FastAPI service for synchronous document generation (stages 01-06)
6. **Extensibility:** Modular code, clear interfaces, easy to extend

**Pipeline Strengths:**

- Task-agnostic core with task-specific adapters
- Extensive logging and error tracking
- Parallel processing where applicable
- Debug visualizations at every stage
- Integration with state-of-the-art models

**Future Directions:**

- Full API integration (stages 07-19)
- Additional document types (forms, legal, medical)
- Multi-language support expansion
- Enhanced visual element generation (charts, diagrams)
- Real-time generation optimization