Docgenie-API / GENERATION_PIPELINE_DOCUMENTATION.md
Ahadhassan-2003
deploy: update HF Space
dc4e6da

DocGenie Generation Pipeline & API Documentation

Version: 1.0
Last Updated: February 7, 2026
Purpose: Comprehensive reference for the DocGenie synthetic document generation system


Table of Contents

  1. Overview
  2. Pipeline Architecture
  3. Pipeline Stages (01-19)
  4. API Implementation
  5. Core Models & Utilities
  6. Configuration & Constants
  7. Usage Examples
  8. Error Handling & Debugging

Overview

DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:

  • Document Question Answering (QA)
  • Key Information Extraction (KIE)
  • Document Layout Analysis (DLA)
  • Document Classification (CLS)

Key Features

  • LLM-Powered Generation: Uses Claude/Gemini/Open-source models to generate diverse document content
  • Realistic Handwriting: Diffusion model-based handwriting synthesis with author-specific styles
  • Visual Element Integration: Stamps, logos, barcodes, charts, and photos
  • Multi-Task Support: Task-specific ground truth formatting and validation
  • Quality Assurance: Comprehensive validation, OCR verification, and error tracking
  • Modular Design: Each pipeline stage is independently executable with clear inputs/outputs

Technology Stack

  • LLM APIs: Claude (Anthropic), Gemini, DeepSeek, Qwen
  • PDF Rendering: Playwright (Chromium), PyMuPDF
  • OCR: Microsoft Azure OCR
  • Handwriting: Custom diffusion model
  • Image Processing: PIL, OpenCV
  • API Framework: FastAPI
  • Data Processing: Pandas, NumPy

Pipeline Architecture

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        DOCGENIE GENERATION PIPELINE                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PHASE 1: SELECTION  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[01] Select Seeds ────────────► seeds.csv, clusters.csv
                                (Cluster-based diverse seed selection)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 2: LLM GEN     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[02] Prompt LLM ──────────────► batch_results/ (JSON)
    β”‚                           (Claude API batched calls)
    ↓
[03] Process Response ────────► raw_html/, raw_annotations/
                                (Extract HTML & GT from responses)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 3: RENDERING   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[04] Render PDF Initial ──────► pdf_initial/, geometries/
    │                           (HTML→PDF with geometry extraction)
    ↓
[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/
    β”‚                           (PyMuPDF text extraction)
    ↓
[06] Extract Layout ──────────► layout_element_definitions/
                                (DLA/KIE-specific annotations)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 4: EXTRACTION  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[07] Extract Handwriting ─────► handwriting_definitions/
    β”‚                           (Identify handwriting regions)
    ↓
[08] Extract Visual Elements ─► visual_element_definitions/
                                (Stamp/logo/barcode placeholders)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 5: GENERATION  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[09] Create Handwriting ──────► handwriting_images/
    β”‚                           (Diffusion model generation)
    ↓
[10] Create Visual Elements ──► visual_element_images/
                                (Generate/select stamps, logos, etc.)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 6: COMPOSITION β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/
    β”‚                           (Remove handwriting placeholders)
    ↓
[12] Insert Handwriting ──────► pdf_with_handwriting/
    β”‚                           (Overlay handwriting images)
    ↓
[13] Insert Visual Elements ──► pdf_final/
    β”‚                           (Overlay stamps, logos, etc.)
    ↓
[14] Render Image ────────────► images/
                                (PDF→PNG conversion)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 7: FINALIZATIONβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/
    β”‚                           (Microsoft OCR)
    ↓
[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/
                                (Pixel→[0,1] coordinates)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHASE 8: VALIDATION  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
[17] GT Preparation ──────────► verified_gt/
    β”‚                           (Fuzzy matching, BIO tagging)
    ↓
[18] Analyze ─────────────────► dataset_log.json
    β”‚                           (Statistics, cost analysis)
    ↓
[19] Create Debug Data ───────► debug/ subdirectories
                                (Visualizations for inspection)

Data Flow Between Stages

Seed Images ──┐
              β”œβ”€β”€β–Ί [02] ──► HTML + GT ──► [04] ──► PDF + Geometries
Prompt Params β”˜                                         β”‚
                                                        β”œβ”€β”€β–Ί [05] ──► BBoxes
                                                        β”‚               β”‚
                                                        β”‚               β”œβ”€β”€β–Ί [07] ──► HW Defs ──► [09] ──► HW Images ──┐
                                                        β”‚               β”‚                                              β”‚
                                                        β”‚               └──► [08] ──► VE Defs ──► [10] ──► VE Images ───
                                                        β”‚                                                              β”‚
                                                        └──► [11] ──► PDF (no HW) ──┬──► [12] ◄──────────────────────────
                                                                                    β”‚           (Insert HW)            β”‚
                                                                                    └──► [13] β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                                             (Insert VE)
                                                                                                  ↓
                                                                                            [14] ──► Image
                                                                                                  ↓
                                                                                            [15] ──► OCR BBoxes
                                                                                                  ↓
                                                                                            [16] ──► Normalized
                                                                                                  ↓
                                                                                            [17] ──► Verified GT

Pipeline Stages (01-19)

Stage 01: Select Seeds

File: pipeline_01_select_seeds.py

Purpose: Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.

Key Functions:

  • main(): Orchestrates seed selection process
  • downscale_and_compress_seeds(): Prepares seed images for efficient API transmission
  • plot_class_distribution(): Visualizes class balance in selected seeds
  • visualize_cluster_histogram(): Shows distribution across clusters

Process:

  1. Load embeddings from base dataset
  2. Perform clustering (KMeans or other algorithms)
  3. Sample N seeds per cluster
  4. Downscale and compress images (JPEG, max dimension)
  5. Save seed manifest and cluster assignments

Inputs:

  • SynDatasetDefinition configuration
  • Base dataset name (e.g., docvqa, cord, publaynet)
  • Clustering parameters from constants

Outputs:

seeds.csv                    # Selected seed document IDs per prompt call
clusters.csv                 # Cluster assignments for all documents
seeds/ (directory)           # Preprocessed seed images (JPEG, compressed)

Configuration Parameters:

  • EMBEDDING_MODEL: Specifies which embedding model was used
  • IMAGE_MAX_DIMENSION: Max width/height for compression
  • JPEG_QUALITY: Compression quality (0-100)

Example Usage:

from docgenie.generation import pipeline_01_select_seeds

pipeline_01_select_seeds.main(
    syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
    base_dataset="docvqa"
)

Stage 02: Prompt LLM

File: pipeline_02_prompt_llm.py

Purpose: Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.

Key Functions:

  • main(): Main orchestrator for LLM prompting
  • create_batched_messages(): Constructs API-compatible message batches
  • track_batch_completion(): Polls API for batch status
  • Cost calculation utilities in pipeline_01/cost.py

Process:

  1. Load seed images and encode as base64
  2. Build prompts from template with parameter injection
  3. Create batched API requests (Claude Batch API for cost efficiency)
  4. Submit batches and track completion
  5. Save results for processing in stage 03

Inputs:

  • Prompt template from data/prompt_templates/<template_name>/
  • Seed images from stage 01
  • API credentials from environment variables
  • SynDatasetDefinition parameters

Outputs:

prompt_batches/              # Batch metadata (batch IDs, status)
message_results/             # JSON response files per batch
logs/                        # Prompting logs and progress

API Configuration:

  • Claude: Uses Batch API with prompt caching for cost efficiency
  • Batch Size: Configurable via BATCH_SIZE constant
  • Polling Interval: Configurable wait time between status checks
  • Model Selection: Specified in SynDatasetDefinition.llm_model

Cost Tracking:

  • Input/output token counts per request
  • Cached token usage (for Claude)
  • Total cost estimation per batch

Example Configuration:

# In syn_dataset_definition YAML
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1  # Documents per prompt
language: "English"
doc_type: "business and administrative"

Stage 03: Process Response

File: pipeline_03_process_response.py

Purpose: Extract and validate HTML documents and ground truth annotations from LLM responses.

Key Functions:

  • main(): Main processor
  • extract_html_from_message(): Regex-based HTML extraction from markdown code blocks
  • extract_gt_from_html(): Parse JSON ground truth from <script id="GT"> tags
  • validate_and_save_gt(): Task-specific validation and formatting

Process:

  1. Load message results from stage 02
  2. Extract HTML content using regex patterns
  3. Parse and validate ground truth JSON
  4. Apply task-specific formatting (QA, KIE, DLA, CLS)
  5. Save raw HTML and annotations separately

Inputs:

  • Message results from message_results/ (stage 02)
  • SynDatasetDefinition task type and prompt format

Outputs:

raw_html/                    # HTML files (one per document)
raw_annotations/             # Ground truth JSON files
  qa/                        # QA format: {"question": "answer", ...}
  kie/                       # KIE format: {"entity": "value", ...}
  dla/                       # DLA format: {"element_id": "label", ...}
  cls/                       # CLS format: {"class": "category"}
logs/message_processing/     # Processing logs per message

Ground Truth Formats:

QA Format:

{
  "What is the invoice number?": "INV-12345",
  "What is the total amount?": "$1,234.56"
}

KIE Format:

{
  "company_name": "Acme Corp",
  "invoice_date": "2024-01-15",
  "total_amount": "1234.56"
}

DLA Format:

{
  "layout_0": "title",
  "layout_1": "text",
  "layout_2": "table"
}

Validation Checks:

  • Expected document count matches actual
  • Valid JSON structure in GT
  • Task-specific field presence
  • HTML completeness (valid tags)

Error Handling:

  • Malformed HTML β†’ Logged, skipped
  • Missing GT β†’ Document flagged in logs
  • Invalid JSON β†’ Parsing error logged

Stage 04: Render PDF and Extract Geometries

File: pipeline_04_render_pdf_and_extract_geos.py

Purpose: Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.

Key Functions:

  • main(): Async batch rendering orchestrator
  • render_pdf_with_playwright(): Single PDF rendering using Playwright/Chromium
  • preprocess_css(): Remove conflicting CSS @page rules
  • validate_pdf(): Check page count and file integrity
  • extract_geometries(): Parse element positions from injected JavaScript

Process:

  1. Load raw HTML from stage 03
  2. Inject geometry extraction JavaScript
  3. Preprocess CSS (remove fixed page sizes)
  4. Render with Playwright in headless Chromium
  5. Measure content dimensions dynamically
  6. Export PDF with calculated page size
  7. Extract and save element geometries

Inputs:

  • Raw HTML files from raw_html/ (stage 03)

Outputs:

pdf_initial/                 # Initial PDFs
geometries/                  # Element geometry JSON files
  {doc_id}.json              # Contains positions, dimensions, CSS classes
render_html/                 # HTML with geometry extraction scripts
debug/pdf_with_geos/         # Debug PDFs with geometry overlays (optional)
logs/rendering/              # Rendering logs and errors

Geometry JSON Structure:

{
  "page_width_mm": 210.0,
  "page_height_mm": 297.0,
  "elements": [
    {
      "id": "layout_0",
      "type": "div",
      "class": "title",
      "rect": {
        "x": 20.0,
        "y": 30.0,
        "width": 170.0,
        "height": 15.0
      },
      "text": "Invoice",
      "attributes": {
        "data-label": "title",
        "data-handwriting": null,
        "data-visual-element": null
      }
    }
  ]
}

Key Features:

  • Automatic Page Sizing: No fixed dimensions; page size adapts to content
  • Concurrent Rendering: Semaphore-controlled parallel processing
  • Geometry Extraction: JavaScript injected to capture element positions
  • CSS Coordinate Conversion: 96 DPI (CSS) β†’ 72 DPI (PDF)
  • Retry Logic: Up to 3 attempts with configurable timeout
  • Debug Visualizations: Optional overlay of geometries on PDFs

Configuration Constants:

  • CHROMIUM_CONCURRENCY: 10 (parallel render limit)
  • PER_PDF_RENDER_TIMEOUT: 60 seconds
  • PER_PDF_RENDER_MAX_RETRIES: 3 attempts
  • PDF_POINT_SCALING: 72/96 for DPI conversion

Error Handling:

  • Timeout β†’ Retry with increased timeout
  • Multi-page PDFs β†’ Flagged and skipped
  • Missing geometries β†’ Error logged, document marked invalid

Stage 05: Extract BBoxes from PDF

File: pipeline_05_extract_bboxes_from_pdf.py

Purpose: Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.

Key Functions:

  • main(): Main extraction orchestrator
  • extract_bboxes_from_pdf(): PyMuPDF-based extraction
  • verify_char_to_word_mapping(): Validate character-to-word relationships
  • create_bbox_debug_pdf(): Generate debug visualizations

Process:

  1. Load PDFs from stage 04
  2. Extract words with PyMuPDF get_text("words")
  3. Extract characters with get_text("rawdict")
  4. Map characters to parent words
  5. Validate mappings (required for handwriting splitting)
  6. Save both word and character bboxes

Inputs:

  • PDFs from pdf_initial/ (stage 04)

Outputs:

pdf_word_bboxes/             # Word-level bounding boxes
  {doc_id}.json              # Format: [x0, y0, x1, y1, text]
pdf_char_bboxes/             # Character-level bounding boxes (if mappable)
  {doc_id}.json              # Format: [x0, y0, x1, y1, char, word_idx]
debug/pdf_bboxes/            # Debug PDFs with bbox overlays (optional)
logs/bbox_extraction/        # Extraction logs

BBox JSON Format:

Word BBoxes:

[
  {"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
  {"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
]

Char BBoxes (with word mapping):

[
  {"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
  {"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
]

Key Features:

  • Dual-Level Extraction: Both word and character granularity
  • Mapping Validation: Ensures characters can be mapped to words (critical for handwriting)
  • Debug Visualizations: Color-coded bbox overlays on PDFs
  • PDF Coordinate System: Uses 72 DPI (PDF points)

Character-to-Word Mapping: Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.

Error Handling:

  • Empty PDFs β†’ Skipped with warning
  • Unmappable characters β†’ Word-level fallback
  • Invalid bboxes (negative dimensions) β†’ Filtered out

Stage 06: Extract Layout Element Definitions and Annotation GT

File: pipeline_06_extract_layout_element_definitions_and_annotation_gt.py

Purpose: Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).

Key Functions:

  • main(): Task router
  • handle_dla(): Process DLA tasks
  • handle_kie(): Process KIE tasks
  • parse_layout_elements(): Extract layout elements from geometries
  • parse_kie_fields(): Extract KIE annotated fields from geometries

Process (DLA):

  1. Load geometries from stage 04
  2. Filter elements with data-label attribute
  3. Validate labels against task-specific valid labels
  4. Extract bounding boxes and labels
  5. Save layout element definitions

Process (KIE):

  1. Load geometries from stage 04
  2. Filter elements with data-annotation attribute
  3. Extract entity names and text values
  4. Validate against expected entity schema
  5. Save raw annotations for stage 17

Inputs:

  • Geometries from geometries/ (stage 04)
  • Valid labels from SynDatasetDefinition.valid_labels
  • Task type from SynDatasetDefinition.task

Outputs:

layout_element_definitions/  # For DLA tasks
  {doc_id}.json              # [{id, label, rect}, ...]
raw_annotations/             # For KIE tasks
  kie_annotations/
    {doc_id}.json            # {entity_name: text_value, ...}

Layout Element Definition Format (DLA):

[
  {
    "id": "layout_0",
    "label": "title",
    "rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
  },
  {
    "id": "layout_1",
    "label": "text",
    "rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
  }
]

KIE Annotation Format:

{
  "company_name": "Acme Corporation",
  "invoice_number": "INV-12345",
  "invoice_date": "January 15, 2024",
  "total_amount": "$1,234.56"
}

Validation Checks:

  • Missing labels: Elements without data-label β†’ Warning
  • Invalid labels: Labels not in valid_labels β†’ Error
  • Zero dimensions: Width or height = 0 β†’ Error
  • Multiple labels (KIE only): One element, multiple entities β†’ Error
  • Missing text: KIE elements without text β†’ Warning

Task-Specific Handling:

  • DLA: Converts geometries to LayoutBBox format for stage 17
  • KIE: Extracts entity-value pairs for BIO tagging in stage 17
  • QA/CLS: No processing in this stage (uses raw GT from stage 03)

Stage 07: Extract Handwriting

File: pipeline_07_extract_handwriting.py

Purpose: Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.

Key Functions:

  • main(): Main extraction orchestrator
  • parse_handwriting_geometries(): Extract elements with data-handwriting attribute
  • map_handwriting_to_bboxes(): Match handwriting text to OCR bboxes spatially
  • split_long_words(): Split words exceeding diffusion model's character limit
  • extract_handwriting_style(): Parse author ID from CSS class

Process:

  1. Load geometries from stage 04
  2. Filter elements with data-handwriting attribute
  3. Load word/character bboxes from stage 05
  4. Spatially match handwriting regions to bboxes
  5. Split long words for diffusion model constraints
  6. Extract author ID for style consistency
  7. Save handwriting definitions with bbox mappings

Inputs:

  • Geometries from geometries/ (stage 04)
  • Word bboxes from pdf_word_bboxes/ (stage 05)
  • Character bboxes from pdf_char_bboxes/ (stage 05, if available)

Outputs:

handwriting_definitions/
  {doc_id}.json              # Handwriting region definitions
logs/handwriting_extraction/ # Extraction logs and warnings

Handwriting Definition Format:

[
  {
    "id": "hw0",
    "text": "John Smith",
    "author_id": "author1",
    "bboxes": [
      "20.0,30.0,60.0,45.0,John",
      "65.0,30.0,95.0,45.0,Smith"
    ],
    "rect": {
      "x": 20.0,
      "y": 30.0,
      "width": 75.0,
      "height": 15.0
    },
    "is_signature": false
  }
]

Key Features:

  • Author ID Extraction: Parses CSS classes like handwriting-author1 for style consistency
  • Spatial Matching: Uses bbox overlap/proximity to map handwriting regions to text
  • Long Word Splitting: Splits words > MAX_HANDWRITING_CHARS (default: 7) into multiple images
  • Signature Detection: Identifies signature fields for special handling
  • Character-Level Fallback: Uses word bboxes if char-level mapping unavailable

Configuration Constants:

  • MAX_HANDWRITING_CHARS: 7 (max characters per diffusion generation)
  • SPATIAL_MATCH_THRESHOLD: Pixel tolerance for bbox matching

Mapping Logic:

  1. Find all words whose bboxes overlap/intersect with handwriting rect
  2. Extract text from matched bboxes
  3. Compare with expected handwriting text
  4. If char bboxes available: split long words at char boundaries
  5. If char bboxes unavailable: split at word boundaries

Error Handling:

  • No bbox matches β†’ Warning, handwriting region skipped
  • Text mismatch β†’ Logged, uses best match
  • Missing char bboxes β†’ Word-level fallback (may affect quality)

Stage 08: Extract Visual Element Definitions

File: pipeline_08_extract_visual_element_definitions.py

Purpose: Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.

Key Functions:

  • main(): Main extraction orchestrator
  • parse_visual_element_geometries(): Extract elements with data-visual-element attribute
  • parse_css_dimension(): Parse width/height from CSS
  • parse_css_rotation(): Extract rotation angle from CSS transform

Process:

  1. Load geometries from stage 04
  2. Filter elements with data-visual-element attribute
  3. Parse element type (stamp, logo, barcode, etc.)
  4. Extract dimensions and rotation from CSS
  5. Parse content (e.g., stamp text, barcode value)
  6. Validate element types and dimensions
  7. Save visual element definitions

Inputs:

  • Geometries from geometries/ (stage 04)

Outputs:

visual_element_definitions/
  {doc_id}.json              # Visual element definitions
logs/visual_element_extraction/ # Extraction logs

Visual Element Definition Format:

[
  {
    "id": "ve0",
    "type": "stamp",
    "type_unmapped": "stamp",
    "content": "CONFIDENTIAL",
    "rect": {
      "x": 150.0,
      "y": 250.0,
      "width": 60.0,
      "height": 30.0
    },
    "rotation": -15.0
  },
  {
    "id": "ve1",
    "type": "barcode",
    "type_unmapped": "barcode",
    "content": "1234567890",
    "rect": {
      "x": 20.0,
      "y": 270.0,
      "width": 100.0,
      "height": 20.0
    },
    "rotation": 0.0
  }
]

Supported Visual Element Types:

  • stamp: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
  • logo: Company/brand logos (selected from prefabs)
  • barcode: Code128 barcodes
  • chart: Charts and graphs (selected from prefabs)
  • photo: Photographic images (selected from prefabs)

Type Mapping: Maps LLM-generated type names to standard types using VISUAL_ELEMENT_TYPE_MAPPING constant.

Key Features:

  • Content Extraction: Parses text content for stamps, values for barcodes
  • Rotation Support: Extracts CSS rotate() transform angles
  • Dimension Parsing: Handles CSS units (px, mm, %, etc.)
  • Type Validation: Warns about unknown types, uses fallback mapping

Validation Checks:

  • Unknown type β†’ Logged, mapped to closest known type
  • Zero dimensions β†’ Error, element skipped
  • Invalid rotation angle β†’ Defaults to 0Β°
  • Missing content (for stamps/barcodes) β†’ Warning

CSS Parsing Examples:

/* Rotation extraction */
transform: rotate(-15deg);        β†’ -15.0
transform: rotate(0.5turn);       β†’ 180.0

/* Dimension extraction */
width: 60mm;                      β†’ 60.0 (mm)
height: 30px;                     β†’ Converted to mm

Stage 09: Create Handwriting Images

File: pipeline_09_create_handwriting_images.py

Purpose: Generate realistic handwriting images using a diffusion model, with per-author style consistency.

Key Functions:

  • main(): Main generation orchestrator
  • generate_handwriting_diffusion(): Call diffusion model API/local model
  • add_handwriting_blur(): Optional post-processing for realism

Process:

  1. Load handwriting definitions from stage 07
  2. Group by author ID for style consistency
  3. For each text segment:
    • Generate handwriting image with diffusion model
    • Apply optional blur for realism
    • Save image with metadata
  4. Track author-to-writer style mapping
  5. Save generation logs

Inputs:

  • Handwriting definitions from handwriting_definitions/ (stage 07)
  • Diffusion model checkpoint (local or API)

Outputs:

handwriting_images/
  {doc_id}/
    hw0_0.png               # First bbox of handwriting region hw0
    hw0_1.png               # Second bbox (if multi-word)
    hw1_0.png
logs/handwriting_generation/ # Generation logs with author mappings

Handwriting Image Specifications:

  • Format: PNG with transparency
  • Height: 40 pixels (configurable via HANDWRITING_HEIGHT_PX)
  • Padding: 0 pixels horizontal (configurable via HANDWRITING_PADDING_PX)
  • Background: Transparent
  • Color: Black text on transparent

Diffusion Model Integration: Located in handwriting_diffusion/ module:

  • generate_handwriting_diffusion_raw.py: Core generation logic
  • text_encoder.py: Text encoding for model input
  • tokenizer.py: Character tokenization

Key Features:

  • Author-Style Consistency: Same author ID β†’ consistent handwriting style
  • Writer Style Mapping: Maps author IDs to writer style IDs (e.g., "author1" β†’ writer_42)
  • Batch Processing: Generates multiple images efficiently
  • Optional Blur: Post-processing for more realistic appearance
  • Quality Control: Validates generated images (non-empty, correct dimensions)

Configuration Constants:

  • HANDWRITING_HEIGHT_PX: 40 pixels
  • HANDWRITING_PADDING_PX: 0 pixels
  • DIFFUSION_NUM_INFERENCE_STEPS: 50 (generation quality/speed tradeoff)
  • HANDWRITING_BLUR_ENABLED: Optional blur toggle
  • HANDWRITING_STYLES: List of available writer styles

Author-to-Writer Mapping Example:

{
  "author1": "writer_42",
  "author2": "writer_17",
  "author3": "writer_89"
}

Error Handling:

  • Generation failure β†’ Retry up to 3 times
  • Empty image β†’ Logged, document flagged
  • Model timeout β†’ Skipped, logged

Stage 10: Create Visual Elements

File: pipeline_10_create_visual_elements.py

Purpose: Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.

Key Functions:

  • main(): Main generation orchestrator
  • route_visual_element_generation(): Router for element type
  • generate_stamp(): Create stamp image with text
  • select_logo(): Random selection from logo prefabs
  • generate_barcode(): Create Code128 barcode
  • select_photo(): Random selection from photo prefabs
  • select_chart(): Random selection from chart prefabs

Process:

  1. Load visual element definitions from stage 08
  2. For each element:
    • Route to type-specific generator
    • Generate or select image
    • Resize to target dimensions
    • Save with transparency (if applicable)
  3. Cache prefab directories for performance
  4. Save generation logs

Inputs:

  • Visual element definitions from visual_element_definitions/ (stage 08)
  • Prefab directories in data/visual_element_prefabs/:
    • logos/
    • photos/
    • charts/

Outputs:

visual_element_images/
  {doc_id}/
    ve0.png                 # Stamp image
    ve1.png                 # Logo image
    ve2.png                 # Barcode image
logs/visual_element_generation/ # Generation logs

Type-Specific Generation:

Stamps:

  • Font: Configurable (default: Arial Bold)
  • Color: Configurable (default: red/blue)
  • Background: Transparent
  • Border: Optional rounded rectangle
  • Rotation: Applied during insertion (stage 13)

Logos:

  • Source: Random selection from data/visual_element_prefabs/logos/
  • Format: PNG with transparency preserved
  • Caching: Directory contents cached after first scan

Barcodes:

  • Type: Code128 (supports alphanumeric)
  • Library: python-barcode
  • Background: White
  • Content validation: Numeric or alphanumeric

Photos:

  • Source: Random selection from data/visual_element_prefabs/photos/
  • Format: JPEG or PNG
  • Aspect ratio: Preserved during resize

Charts/Figures:

  • Source: Random selection from data/visual_element_prefabs/charts/
  • Format: PNG with transparency
  • Types: Bar charts, line graphs, pie charts, etc.

Key Features:

  • Type-Specific Logic: Each element type has dedicated generation function
  • Prefab Caching: Directory scans cached for performance
  • Transparent Backgrounds: Stamps and some logos support transparency
  • Content Validation: Barcodes validate numeric content
  • Aspect Ratio Preservation: Images scaled without distortion

Configuration Constants:

  • STAMP_FONT_SIZE: Calculated from target dimensions
  • STAMP_BORDER_WIDTH: 2 pixels
  • BARCODE_DPI: 300 for high quality
  • PREFAB_CACHE_SIZE: In-memory cache limit

Error Handling:

  • Missing prefab directory β†’ Error, element skipped
  • Empty prefab directory β†’ Warning, fallback placeholder
  • Invalid barcode content β†’ Logged, uses fallback text
  • Image generation failure β†’ Placeholder created

Stage 11: Render PDF Second Pass

File: pipeline_11_render_pdf_second_pass.py

Purpose: Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.

Key Functions:

  • main(): Main rendering orchestrator
  • render_pdf_playwright_async(): Async Playwright rendering
  • remove_handwriting_placeholders_from_html(): Strip handwriting elements

Process:

  1. Load HTML from render_html/ (stage 04)
  2. Remove all elements with data-handwriting attribute
  3. Re-render PDF using same dimensions from stage 04
  4. Validate PDF output
  5. Save PDFs without handwriting placeholders

Inputs:

  • HTML from render_html/ (stage 04)
  • Page dimensions from stage 04 logs

Outputs:

pdf_without_handwriting_placeholder/
  {doc_id}.pdf              # PDFs with handwriting regions blank
logs/rendering_second_pass/ # Rendering logs

Key Differences from Stage 04:

  • Uses Pre-calculated Dimensions: No content measurement needed
  • Handwriting Elements Removed: Not just hidden (visibility: hidden), but removed from DOM
  • No Geometry Extraction: Geometries already saved in stage 04
  • Faster Rendering: No JavaScript injection for measurement

HTML Preprocessing:

<!-- Before (Stage 04) -->
<div data-handwriting="author1" class="handwriting">John Smith</div>

<!-- After (Stage 11) -->
<!-- Element completely removed -->

Why This Stage Exists: Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.

Rendering Configuration:

  • Same timeout and retry logic as stage 04
  • Reuses Playwright browser context for efficiency
  • No debug output (geometries not extracted)

Error Handling:

  • Multi-page PDF β†’ Flagged and skipped
  • Rendering timeout β†’ Retry with increased timeout
  • HTML parsing error β†’ Logged, document marked invalid

Stage 12: Insert Handwriting Images

File: pipeline_12_insert_handwriting_images.py

Purpose: Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.

Key Functions:

  • main(): Main insertion orchestrator
  • insert_handwriting_into_pdf(): Per-document insertion
  • scale_image_with_aspect_ratio(): Resize images while preserving aspect
  • group_bboxes_by_line(): Group multi-word handwriting by line

Process:

  1. Load PDFs from stage 11
  2. Load handwriting images from stage 09
  3. Load handwriting definitions (bbox mappings) from stage 07
  4. For each handwriting region:
    • Group bboxes by line/block
    • Scale images to fit bboxes
    • Apply random offsets for natural variation
    • Insert images at calculated positions
  5. Save PDFs with handwriting

Inputs:

  • PDFs from pdf_without_handwriting_placeholder/ (stage 11)
  • Handwriting images from handwriting_images/ (stage 09)
  • Handwriting definitions from handwriting_definitions/ (stage 07)

Outputs:

pdf_with_handwriting/
  {doc_id}.pdf              # PDFs with handwriting inserted
logs/handwriting_insertion/ # Insertion logs
debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)

Insertion Logic:

1. Image Scaling:

  • High-res scaling: 3x upsampling before insertion
  • Aspect ratio preserved
  • Left-aligned within bbox (respects layout rect)

2. Positioning:

  • X coordinate: Left edge of bbox + random offset
  • Y coordinate: Top edge of bbox + random offset
  • Multi-word lines: Consistent Y offset for line

3. Random Offsets:

x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)

4. Block/Line Grouping: For multi-word handwriting (e.g., "John Smith"):

  • Group bboxes by Y coordinate (same line)
  • Apply consistent Y offset to entire line
  • Individual X offsets per word for natural spacing

Key Features:

  • High-Resolution Insertion: 3x scaling for quality
  • Natural Variation: Random offsets simulate handwriting imperfection
  • Line Consistency: Multi-word lines maintain baseline
  • Aspect Ratio Preservation: Images not distorted
  • Transparency Support: PNG alpha channel preserved

Configuration Constants:

  • HANDWRITING_IMAGE_UPSCALE_FACTOR: 3x
  • MAX_HANDWRITING_RAND_X: Β±2 pixels
  • MAX_HANDWRITING_RAND_Y: Β±1 pixel
  • HANDWRITING_LINE_Y_CONSISTENCY: Same Y offset per line

Coordinate System:

  • PDF uses 72 DPI (points)
  • Bboxes from stage 07 are in PDF coordinates
  • No conversion needed

Error Handling:

  • Missing handwriting image β†’ Warning, bbox skipped
  • Image too large for bbox β†’ Scaled down with warning
  • PyMuPDF insertion failure β†’ Logged, document flagged

Stage 13: Insert Visual Elements

File: pipeline_13_insert_visual_elements.py

Purpose: Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.

Key Functions:

  • main(): Main insertion orchestrator
  • insert_visual_elements_into_pdf(): Per-document insertion
  • scale_image_with_aspect_ratio(): Resize images (same as stage 12)

Process:

  1. Load PDFs from stage 12
  2. Load visual element images from stage 10
  3. Load visual element definitions from stage 08
  4. For each visual element:
    • Scale image to fit bbox
    • Calculate centered position
    • Apply rotation (if specified)
    • Insert image at calculated position
  5. Save final PDFs
  6. If no visual elements: copy PDF from stage 12

Inputs:

  • PDFs from pdf_with_handwriting/ (stage 12)
  • Visual element images from visual_element_images/ (stage 10)
  • Visual element definitions from visual_element_definitions/ (stage 08)

Outputs:

pdf_final/
  {doc_id}.pdf              # Final PDFs with all elements
logs/visual_element_insertion/ # Insertion logs
debug/visual_element_insertion/ # Debug PDFs (optional)

Insertion Logic:

1. Image Scaling:

  • High-res scaling: 3x upsampling
  • Aspect ratio preserved
  • Centered within bbox (not left-aligned like handwriting)

2. Positioning:

  • X coordinate: Center of bbox - half image width
  • Y coordinate: Center of bbox - half image height

3. Rotation:

  • Applied via PyMuPDF transformation matrix
  • Rotation around image center
  • Angle from visual element definition

Key Differences from Stage 12 (Handwriting):

  • Centered placement: Visual elements centered in bbox
  • No random offsets: Precise placement for logos/stamps
  • Rotation support: Stamps often rotated for "APPROVED" effect
  • Fallback: Copies PDF if no visual elements (ensures output exists)

Rotation Transformation:

# PyMuPDF rotation matrix
rotation_matrix = fitz.Matrix(rotation_angle)
image_rect = image_rect * rotation_matrix

Key Features:

  • High-Resolution Insertion: 3x scaling for quality
  • Centered Alignment: Visual elements centered in bboxes
  • Rotation Support: Arbitrary angles for stamps
  • Transparency Preservation: PNG alpha channel maintained
  • Fallback Handling: Copies PDF if no visual elements

Configuration Constants:

  • VISUAL_ELEMENT_UPSCALE_FACTOR: 3x
  • ROTATION_PRECISION: Angle precision in degrees

Coordinate System:

  • Same as stage 12 (PDF 72 DPI)
  • Rotation applied after positioning

Error Handling:

  • Missing visual element image β†’ Warning, element skipped
  • Image too large for bbox β†’ Scaled down with warning
  • PyMuPDF insertion failure β†’ Logged, document flagged
  • No visual elements β†’ PDF copied from stage 12

Stage 14: Render Image

File: pipeline_14_render_image.py

Purpose: Convert final PDFs to high-quality PNG images for OCR and dataset distribution.

Key Functions:

  • main(): Main conversion orchestrator
  • convert_pdf_to_image(): PDF to PNG conversion

Process:

  1. Load PDFs from stage 13
  2. Convert each PDF to PNG using custom PDF-to-image module
  3. Validate image dimensions
  4. Save images

Inputs:

  • PDFs from pdf_final/ (stage 13)

Outputs:

images/
  {doc_id}.png              # Final document images
logs/image_rendering/       # Conversion logs

Image Specifications:

  • Format: PNG
  • DPI: Configurable (default: 200 DPI for quality OCR)
  • Color Mode: RGB (24-bit)
  • Compression: PNG lossless

PDF-to-Image Module: Located in custom module (not standard library):

  • Handles PDF rendering at specified DPI
  • Single-page conversion only (multi-page PDFs skipped)
  • Uses PDF coordinate system: 72 DPI internally

Key Features:

  • High DPI: 200+ DPI for accurate OCR
  • Single Page Only: Multi-page PDFs flagged as errors
  • Lossless Compression: PNG preserves all details
  • Size Validation: Checks image dimensions match PDF

Configuration Constants:

  • IMAGE_DPI: 200 (OCR quality vs file size tradeoff)
  • IMAGE_MAX_DIMENSION: Optional max width/height

Coordinate System Conversion:

PDF: 210mm Γ— 297mm @ 72 DPI β†’ 595 Γ— 842 points
PNG: 210mm Γ— 297mm @ 200 DPI β†’ 1654 Γ— 2339 pixels

Why This Stage:

  • OCR performs better on high-DPI images
  • Images are final output format for datasets
  • PNG preserves quality better than JPEG for text

Error Handling:

  • Multi-page PDF β†’ Flagged and skipped
  • Conversion failure β†’ Logged, document marked invalid
  • Empty image β†’ Error, document flagged
  • Dimension mismatch β†’ Warning logged

Stage 15: Perform OCR

File: pipeline_15_perform_ocr.py

Purpose: Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.

Key Functions:

  • main(): Main OCR orchestrator
  • call_microsoft_ocr(): Microsoft Azure OCR API call
  • convert_ocr_to_bbox_format(): Transform OCR results to internal format
  • aggregate_words_to_segments(): Group words into lines

Process:

  1. Determine which documents need OCR:
    • Has handwriting β†’ Requires OCR
    • Has visual elements β†’ Requires OCR
    • Neither β†’ Copy PDF bboxes from stage 05
  2. For documents requiring OCR:
    • Call Microsoft OCR service
    • Parse word-level bboxes
    • Aggregate into line-level segments
    • Convert coordinates to PDF space
  3. Save final bboxes

Inputs:

  • Images from images/ (stage 14)
  • Handwriting definitions from handwriting_definitions/ (stage 07)
  • Visual element definitions from visual_element_definitions/ (stage 08)
  • PDF bboxes from pdf_word_bboxes/ (stage 05, for non-OCR documents)

Outputs:

final_word_bboxes/
  {doc_id}.json             # Word-level bounding boxes
final_segment_bboxes/
  {doc_id}.json             # Line-level bounding boxes
ocr_results_cache/
  {doc_id}.json             # Raw OCR API responses (cached)
logs/ocr/                   # OCR logs and errors

OCR Decision Logic:

requires_ocr = (
    has_handwriting(doc_id) or 
    has_visual_elements(doc_id)
)

if requires_ocr:
    perform_microsoft_ocr()
else:
    copy_pdf_bboxes()  # Reuse stage 05 results

Why Handwriting/Visual Elements Require OCR:

  • Handwriting images inserted in stage 12 β†’ Not in PDF text layer
  • Visual elements inserted in stage 13 β†’ Not in PDF text layer
  • PDF text extraction (stage 05) misses these elements
  • OCR captures all visible text on rendered image

Microsoft OCR API:

  • Service: Azure Computer Vision (Read API)
  • Input: PNG image (from stage 14)
  • Output: Word polygons, text, confidence scores
  • Caching: Results cached to avoid re-processing

OCR Response Format:

{
  "readResults": [{
    "page": 1,
    "words": [
      {
        "boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
        "text": "Invoice",
        "confidence": 0.98
      }
    ],
    "lines": [
      {
        "boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
        "text": "Invoice Date",
        "words": [...]
      }
    ]
  }]
}

Coordinate Conversion: OCR returns pixel coordinates; converted to PDF points:

pdf_x = ocr_x * (pdf_width / image_width)
pdf_y = ocr_y * (pdf_height / image_height)

Segment Aggregation: Groups words into lines based on Y coordinate proximity.

Key Features:

  • Selective OCR: Only runs OCR when necessary (cost/time savings)
  • Result Caching: Avoids redundant API calls
  • Dual-Level Output: Word and line (segment) bboxes
  • Coordinate Conversion: OCR pixels β†’ PDF points
  • Fallback: Copies PDF bboxes for documents without handwriting/VEs

Configuration:

  • OCR API key from environment variables
  • Timeout: 30 seconds per request
  • Retry: Up to 3 attempts

Error Handling:

  • OCR API failure β†’ Retry, fallback to PDF bboxes if all retries fail
  • Empty OCR result β†’ Warning, uses PDF bboxes
  • Coordinate conversion error β†’ Logged, bbox skipped

Stage 16: Normalize BBoxes

File: pipeline_16_normalize_bboxes.py

Purpose: Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.

Key Functions:

  • main(): Main normalization orchestrator
  • normalize_word_and_segment_bboxes(): Normalize word/segment bboxes
  • normalize_layout_bboxes(): Normalize layout element bboxes (DLA only)
  • normalize_coordinates(): Core coordinate transformation

Process:

  1. Load final bboxes from stage 15
  2. Load image dimensions (for normalization denominators)
  3. For each bbox:
    • Transform: normalized_x = pixel_x / image_width
    • Transform: normalized_y = pixel_y / image_height
    • Preserve text content
  4. Save normalized bboxes
  5. For DLA tasks: also normalize layout element bboxes

Inputs:

  • Word bboxes from final_word_bboxes/ (stage 15)
  • Segment bboxes from final_segment_bboxes/ (stage 15)
  • Layout element definitions from layout_element_definitions/ (stage 06, for DLA)
  • Image dimensions from stage 14 logs

Outputs:

normalized_word_bboxes/
  {doc_id}.json             # Normalized word bboxes
normalized_segment_bboxes/
  {doc_id}.json             # Normalized segment bboxes
normalized_gt/              # For DLA tasks only
  {doc_id}.json             # Normalized layout elements
logs/normalization/         # Normalization logs

Normalization Formula:

normalized_bbox = {
    "x0": bbox["x0"] / image_width,
    "y0": bbox["y0"] / image_height,
    "x1": bbox["x1"] / image_width,
    "y1": bbox["y1"] / image_height,
    "text": bbox["text"]  # Preserved
}

Normalized BBox Format:

[
  {
    "x0": 0.095,    # Was: 20 pixels out of 210mm @ 200dpi
    "y0": 0.128,    # Was: 30 pixels
    "x1": 0.286,    # Was: 60 pixels
    "y1": 0.192,    # Was: 45 pixels
    "text": "Invoice"
  }
]

DLA-Specific Normalization: For Document Layout Analysis tasks, layout element bboxes are also normalized:

[
  {
    "label": "title",
    "bbox": [0.095, 0.128, 0.905, 0.192]  # [x0, y0, x1, y1]
  },
  {
    "label": "text",
    "bbox": [0.095, 0.213, 0.905, 0.534]
  }
]

Why Normalization:

  • Model Training: Most models expect [0, 1] coordinates
  • Resolution Independence: Works across different image sizes
  • Standard Format: Matches common dataset formats (e.g., LayoutLM)

Coordinate System Mapping:

PDF Points (72 DPI):
  595 Γ— 842 points (A4)
  ↓ (stage 14 conversion @ 200 DPI)
Image Pixels (200 DPI):
  1654 Γ— 2339 pixels
  ↓ (stage 16 normalization)
Normalized [0, 1]:
  0.0-1.0 Γ— 0.0-1.0

Key Features:

  • Preserves Text: Text content unchanged during normalization
  • Task-Specific: DLA tasks get additional layout bbox normalization
  • Validation: Checks for out-of-bounds coordinates (clamps to [0, 1])
  • Precision: Full float precision maintained

Error Handling:

  • Out-of-bounds coordinates β†’ Clamped to [0, 1] with warning
  • Missing image dimensions β†’ Error, document skipped
  • Zero image dimensions β†’ Error, document marked invalid

Stage 17: GT Preparation & Verification

File: pipeline_17_gt_preparation_verification.py

Purpose: Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.

Key Functions:

  • main(): Main verification orchestrator
  • route_task(): Route to task-specific handler
  • handle_qa(): QA ground truth processing
  • handle_kie(): KIE ground truth with BIO tagging
  • handle_dla(): DLA ground truth processing
  • fuzzy_match_text(): Levenshtein-based text matching

Process:

  1. Load raw GT from stage 03/06
  2. Load final word bboxes from stage 15
  3. Route to task-specific handler
  4. Perform fuzzy text matching (GT text β†’ OCR text)
  5. Map GT annotations to bbox indices
  6. Apply task-specific formatting
  7. Validate and save verified GT

Inputs:

  • Raw GT from raw_annotations/ (stage 03/06)
  • Word bboxes from final_word_bboxes/ (stage 15)
  • Layout elements from layout_element_definitions/ (stage 06, for DLA)
  • Visual elements from visual_element_definitions/ (stage 08, for DLA)

Outputs:

verified_gt/
  qa/
    {doc_id}.json           # QA format
  kie/
    {doc_id}.json           # KIE format with BIO tagging
  dla/
    {doc_id}.json           # DLA format with normalized bboxes
  cls/
    {doc_id}.json           # Classification format
logs/gt_verification/       # Verification logs with match statistics

QA (Question Answering) Format

Process:

  1. Load QA pairs: {"question": "answer", ...}
  2. For each answer:
    • Find words in bboxes matching answer text (fuzzy)
    • Record bbox indices of matching words
  3. Save verified GT with bbox mappings

Output Format:

{
  "questions": [
    {
      "question": "What is the invoice number?",
      "answer": "INV-12345",
      "answer_bbox_indices": [15, 16]  # Word indices in final_word_bboxes
    },
    {
      "question": "What is the total amount?",
      "answer": "$1,234.56",
      "answer_bbox_indices": [42]
    }
  ]
}

Fuzzy Matching: Uses Levenshtein distance with 0.85 similarity cutoff:

similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
if similarity >= 0.85:
    match_found = True

KIE (Key Information Extraction) Format

Process:

  1. Load entity annotations: {entity_name: text_value, ...}
  2. For each entity:
    • Find words matching entity value (fuzzy)
    • Generate BIO tags for all words
  3. Save verified GT with BIO tagging

Output Format:

{
  "entities": [
    {
      "entity": "company_name",
      "value": "Acme Corporation",
      "bbox_indices": [5, 6]
    },
    {
      "entity": "invoice_date",
      "value": "January 15, 2024",
      "bbox_indices": [20, 21, 22]
    }
  ],
  "word_labels": [
    "O", "O", "O", "O", "O",           # Words 0-4: Outside entities
    "B-company_name", "I-company_name", # Words 5-6: Company name
    "O", "O", ...,                      # Words 7-19: Outside
    "B-invoice_date", "I-invoice_date", "I-invoice_date"  # Words 20-22
  ]
}

BIO Tagging:

  • B-entity: Beginning of entity
  • I-entity: Inside entity (continuation)
  • O: Outside any entity

DLA (Document Layout Analysis) Format

Process:

  1. Load layout element definitions from stage 06
  2. Load visual element definitions from stage 08
  3. Validate labels (must be in valid_labels)
  4. Check spatial constraints (no containment, minimal overlap)
  5. Merge visual elements into layout annotations
  6. Normalize bboxes to [0, 1]
  7. Save verified GT

Output Format:

{
  "layout_elements": [
    {
      "id": "layout_0",
      "label": "title",
      "bbox": [0.095, 0.128, 0.905, 0.192]  # Normalized [x0, y0, x1, y1]
    },
    {
      "id": "layout_1",
      "label": "text",
      "bbox": [0.095, 0.213, 0.905, 0.534]
    },
    {
      "id": "ve0",
      "label": "figure",  # Visual element mapped to layout label
      "bbox": [0.714, 0.895, 0.905, 0.980]
    }
  ]
}

Visual Element Merging: Visual elements from stage 08 are converted to layout labels:

  • stamp β†’ figure (or custom mapping)
  • logo β†’ figure
  • chart β†’ figure
  • barcode β†’ figure
  • photo β†’ figure

Spatial Validation:

  • No containment: One bbox fully inside another β†’ Error
  • Minimal overlap: Overlap area < 5% of smaller bbox β†’ Warning
  • Valid labels: All labels must be in SynDatasetDefinition.valid_labels

CLS (Classification) Format

Process:

  1. Load classification label from raw GT
  2. Validate label against expected classes
  3. Save verified GT

Output Format:

{
  "document_class": "invoice",
  "confidence": 1.0
}

Fuzzy Matching Details:

Uses Levenshtein distance (via fuzz library) to handle OCR discrepancies:

# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
# similarity = 0.89 (above 0.85 threshold) β†’ Match!

Match Statistics Logged:

  • Total GT annotations
  • Successfully matched annotations
  • Failed matches (similarity < 0.85)
  • Average similarity score

Key Features:

  • Fuzzy Matching: Handles OCR errors gracefully
  • Task-Specific Formatting: QA, KIE, DLA, CLS all handled differently
  • BIO Tagging: Automatic generation for KIE
  • Visual Element Integration: DLA merges visual elements as layout annotations
  • Spatial Validation: Detects overlapping/contained layout elements
  • Similarity Tracking: Logs match quality for analysis

Error Handling:

  • Match failure (similarity < 0.85) β†’ Logged, annotation skipped
  • Invalid labels (DLA) β†’ Error, document marked invalid
  • Spatial violations (DLA) β†’ Warning, elements flagged
  • Missing bboxes β†’ Error, document marked invalid

Stage 18: Analyze

File: pipeline_18_analyze.py

Purpose: Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.

Key Functions:

  • main(): Main analysis orchestrator
  • calculate_api_costs(): Compute LLM API costs from token usage

Process:

  1. Load all document logs from stages 01-17
  2. Categorize documents: valid vs. invalid
  3. Calculate error distributions
  4. Compute API usage and costs
  5. Generate statistics (handwriting, visual elements, annotations)
  6. Save comprehensive dataset log

Inputs:

  • All document logs from previous stages
  • Batch results from stage 02 (for cost calculation)
  • Message processing logs from stage 03
  • Prompt usage statistics

Outputs:

dataset_log.json            # Comprehensive dataset statistics
logs/analysis/              # Analysis logs

Dataset Log Structure:

{
  "metadata": {
    "syndatadef_name": "docvqa_alpha=1.0",
    "task": "qa",
    "total_documents_requested": 1000,
    "generation_date": "2026-02-07"
  },
  
  "prompting": {
    "total_prompts": 100,
    "total_batches": 10,
    "llm_model": "claude-sonnet-4-20250514"
  },
  
  "total_cost_summary": {
    "total_cost_usd": 123.45,
    "input_tokens": 1500000,
    "output_tokens": 800000,
    "cached_tokens": 500000,
    "cost_per_document": 0.12
  },
  
  "valid_samples_stats": {
    "total_valid": 847,
    "total_invalid": 153,
    "validity_rate": 0.847,
    "avg_handwriting_regions_per_doc": 2.3,
    "avg_visual_elements_per_doc": 1.5,
    "avg_annotations_per_doc": 8.7,
    "documents_with_handwriting": 654,
    "documents_with_visual_elements": 512
  },
  
  "valid_samples": [
    {
      "doc_id": "doc_0001",
      "seed_image": "docvqa_train_12345",
      "has_handwriting": true,
      "has_visual_elements": true,
      "num_annotations": 10,
      "num_words": 247,
      "image_size": [1654, 2339]
    }
  ],
  
  "valid_samples_by_category": {
    "invoice": 234,
    "receipt": 198,
    "form": 415
  },
  
  "errors": {
    "multipage_pdf": 12,
    "missing_ocr_result": 5,
    "failed_gt_verification": 38,
    "rendering_timeout": 8,
    "llm_parsing_error": 23,
    "bbox_extraction_failed": 4,
    "handwriting_generation_failed": 7,
    "visual_element_generation_failed": 3,
    "other": 53
  }
}

Error Categories:

Error Category Description
multipage_pdf PDF rendered with multiple pages (invalid)
missing_ocr_result OCR failed or returned empty result
failed_gt_verification GT matching/validation failed in stage 17
rendering_timeout PDF rendering exceeded timeout
llm_parsing_error Failed to extract HTML/GT from LLM response
bbox_extraction_failed PyMuPDF failed to extract bboxes
handwriting_generation_failed Diffusion model failed
visual_element_generation_failed VE generation/selection failed
other Miscellaneous errors

Cost Calculation:

Claude API Pricing (example):

costs = {
    "input": input_tokens * 0.003 / 1000,      # $3 per 1M tokens
    "output": output_tokens * 0.015 / 1000,    # $15 per 1M tokens
    "cached": cached_tokens * 0.0003 / 1000    # $0.30 per 1M tokens (prompt caching)
}
total_cost = costs["input"] + costs["output"] + costs["cached"]

Statistics Computed:

  • Validity Rate: Percentage of documents passing all stages
  • Handwriting Stats: Documents with handwriting, avg regions per doc
  • Visual Element Stats: Documents with VEs, avg elements per doc
  • Annotation Stats: Avg QA pairs/KIE entities/DLA elements per doc
  • Token Usage: Input/output/cached token totals
  • Cost Metrics: Total cost, cost per document, cost per valid document

Key Features:

  • Comprehensive Error Tracking: All error categories logged
  • Cost Transparency: Token-level cost breakdown
  • Quality Metrics: Validity rate, avg annotations, etc.
  • Category Breakdown: Valid samples grouped by document type
  • Per-Document Tracking: Each valid document's metadata saved

Usage: This log is essential for:

  • Understanding dataset quality
  • Optimizing pipeline (identify bottlenecks)
  • Cost estimation for future runs
  • Debugging (error distribution analysis)

Stage 19: Create Debug Data

File: pipeline_19_create_debug_data.py

Purpose: Generate comprehensive debug visualizations for manual inspection and quality assurance.

Key Functions:

  • main(): Main debug generator
  • visualize_visual_element_bboxes(): Overlay VE bboxes on PDFs
  • visualize_final_bboxes_on_images(): Overlay OCR bboxes on images
  • visualize_pdf_bboxes(): Overlay PDF bboxes

Process:

  1. Load all generated documents
  2. Create debug subdirectories
  3. For each document:
    • Generate PDF bbox overlays
    • Generate VE bbox overlays
    • Generate handwriting insertion region overlays
    • Generate final OCR bbox overlays on images
  4. Copy raw HTML with debug.js script for browser inspection

Inputs:

  • All intermediate and final outputs from stages 01-18

Outputs:

debug/
  pdf_bboxes/               # PDF word bboxes overlaid (stage 05)
    {doc_id}.pdf
  visual_element_bboxes/    # VE bboxes overlaid (stage 08)
    {doc_id}.pdf
  handwriting_insertion/    # Handwriting regions overlaid (stage 12)
    {doc_id}.pdf
  final_bboxes_on_images/   # OCR bboxes on final images (stage 15)
    {doc_id}.png
  html_with_debug/          # Raw HTML + debug.js
    {doc_id}.html
    debug.js                # Browser-based inspection script

Debug Visualizations:

1. PDF BBoxes (Stage 05):

  • Red rectangles: Word bounding boxes from PyMuPDF
  • Annotated with word text
  • Purpose: Verify PDF text extraction quality

2. Visual Element BBoxes (Stage 08):

  • Blue rectangles: Visual element placeholder regions
  • Annotated with element type (stamp, logo, etc.)
  • Purpose: Verify VE extraction and positioning

3. Handwriting Insertion Regions (Stage 12):

  • Green rectangles: Handwriting bbox regions
  • Annotated with handwriting text
  • Purpose: Verify handwriting placement accuracy

4. Final BBoxes on Images (Stage 15):

  • Orange rectangles: OCR word bounding boxes
  • Overlaid on final rendered images
  • Purpose: Verify OCR accuracy and coverage

Debug JavaScript (debug.js):

// Browser-based inspection tool
// Features:
// - Highlight elements on hover
// - Show geometry data in console
// - Toggle element visibility
// - Measure element dimensions

Key Features:

  • Color-Coded Overlays: Different colors for different stages
  • Text Annotations: Bboxes labeled with content
  • Multi-Format: Both PDF and PNG visualizations
  • Browser Inspection: HTML with interactive debug script
  • Selective Generation: Only enabled with DEBUG_MODE=true flag

Configuration:

  • DEBUG_MODE: Enable/disable debug output (default: false)
  • DEBUG_BBOX_LINE_WIDTH: Line thickness for overlays (default: 2)
  • DEBUG_BBOX_OPACITY: Overlay transparency (default: 0.5)

When to Use:

  • Visual quality assurance
  • Debugging bbox extraction issues
  • Verifying handwriting/VE insertion
  • Identifying OCR problems
  • Manual inspection of edge cases

Performance Note: Debug generation adds ~20% processing time; disabled by default in production.


API Implementation

The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         API ARCHITECTURE                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Client Request
     β”‚
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   FastAPI Server    β”‚
β”‚   (main.py)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”œβ”€β”€β–Ί Validate Request (schemas.py)
     β”‚
     β”œβ”€β”€β–Ί Download Seed Images (utils.py)
     β”‚
     β”œβ”€β”€β–Ί Build Prompt (utils.py)
     β”‚
     β”œβ”€β”€β–Ί Call Claude API (Synchronous)
     β”‚    └──► Claude Sonnet 4.5
     β”‚
     β”œβ”€β”€β–Ί Extract HTML & GT (pipeline_03 functions)
     β”‚
     β”œβ”€β”€β–Ί Render PDF (pipeline_04 functions)
     β”‚    └──► Playwright/Chromium
     β”‚
     β”œβ”€β”€β–Ί Extract BBoxes (pipeline_05 functions)
     β”‚    └──► PyMuPDF
     β”‚
     └──► Return Response
          └──► JSON with base64-encoded PDFs

Endpoints

GET /

Health check endpoint.

Response:

{
  "message": "DocGenie API is running",
  "version": "1.0",
  "status": "healthy"
}

GET /health

Detailed health status.

Response:

{
  "status": "healthy",
  "api_version": "1.0",
  "playwright_available": true,
  "llm_model": "claude-sonnet-4-5-20250929"
}

POST /generate

Generate documents with ground truth annotations.

Request Body (schemas.GenerateDocumentRequest):

{
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "English",
    "doc_type": "business and administrative",
    "gt_type": "Multiple questions and their answers",
    "gt_format": "{\"question\": \"answer\", ...}",
    "num_solutions": 2
  }
}

Request Validation:

  • seed_images: 1-10 URLs (HTTPS only)
  • num_solutions: 1-5 documents
  • All prompt parameters required

Response (schemas.GenerateDocumentResponse):

{
  "success": true,
  "message": "Successfully generated 2 documents",
  "documents": [
    {
      "document_id": "doc_20260207_001",
      "html": "<html>...</html>",
      "css": "body { ... }",
      "ground_truth": {
        "What is the invoice number?": "INV-12345"
      },
      "pdf_base64": "JVBERi0xLjQK...",
      "bboxes": [
        {
          "x0": 20.0,
          "y0": 30.0,
          "x1": 60.0,
          "y1": 45.0,
          "text": "Invoice"
        }
      ],
      "page_width_mm": 210.0,
      "page_height_mm": 297.0
    }
  ],
  "total_documents": 2
}

POST /generate-files

Generate documents and return as downloadable files.

Request: Same as /generate

Response:

  • Content-Type: application/zip
  • File: ZIP archive containing:
    • doc_001.pdf
    • doc_001_gt.json
    • doc_001_bboxes.json
    • doc_002.pdf
    • ...

File Structure:

generated_documents.zip
β”œβ”€β”€ doc_001.pdf
β”œβ”€β”€ doc_001_gt.json
β”œβ”€β”€ doc_001_bboxes.json
β”œβ”€β”€ doc_001_metadata.json
β”œβ”€β”€ doc_002.pdf
β”œβ”€β”€ ...

Request/Response Schemas

Defined in api/schemas.py:

PromptParameters

class PromptParameters(BaseModel):
    language: str = "English"
    doc_type: str = "business and administrative"
    gt_type: str = "Multiple questions and their answers"
    gt_format: str = '{"question": "answer", ...}'
    num_solutions: int = Field(default=1, ge=1, le=5)

GenerateDocumentRequest

class GenerateDocumentRequest(BaseModel):
    seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
    prompt_params: PromptParameters

BoundingBox

class BoundingBox(BaseModel):
    x0: float
    y0: float
    x1: float
    y1: float
    text: str

DocumentResult

class DocumentResult(BaseModel):
    document_id: str
    html: str
    css: str
    ground_truth: Optional[dict]
    pdf_base64: str
    bboxes: List[BoundingBox]
    page_width_mm: float
    page_height_mm: float

GenerateDocumentResponse

class GenerateDocumentResponse(BaseModel):
    success: bool
    message: str
    documents: List[DocumentResult]
    total_documents: int

API Pipeline Flow

Detailed integration with pipeline stages:

# Simplified API flow (from api/main.py)

@app.post("/generate", response_model=GenerateDocumentResponse)
async def generate_documents(request: GenerateDocumentRequest):
    # 1. Download and encode seed images
    seed_images_base64 = await download_and_encode_images(request.seed_images)
    
    # 2. Build prompt from template
    prompt = build_prompt_from_template(
        seed_images=seed_images_base64,
        params=request.prompt_params
    )
    
    # 3. Call Claude API (synchronous, not batched)
    llm_response = await call_claude_api(
        prompt=prompt,
        model="claude-sonnet-4-5-20250929"
    )
    
    # 4. Extract HTML and GT (from pipeline_03)
    documents_html = extract_html_from_message(llm_response)
    documents_gt = extract_gt_from_html(documents_html)
    
    # 5. Validate HTML (from utils.py)
    validate_html_structure(documents_html)
    
    # 6. Render PDFs (from pipeline_04)
    pdfs = await render_pdfs_with_playwright(documents_html)
    
    # 7. Validate PDFs (from utils.py)
    validate_pdf_pages(pdfs)
    
    # 8. Extract bboxes (from pipeline_05)
    bboxes = extract_bboxes_from_pdfs(pdfs)
    
    # 9. Validate bboxes (from utils.py)
    validate_bbox_completeness(bboxes)
    
    # 10. Encode PDFs to base64
    pdfs_base64 = encode_pdfs_to_base64(pdfs)
    
    # 11. Build response
    return GenerateDocumentResponse(
        success=True,
        documents=[...],
        total_documents=len(documents_html)
    )

Integration with Pipeline Functions

Reused from Pipeline:

  • extract_html_from_message() from pipeline_03
  • extract_gt_from_html() from pipeline_03
  • render_pdf_with_playwright() from pipeline_04
  • extract_bboxes_from_pdf() from pipeline_05
  • Various utilities from docgenie.generation.utils

API-Specific Functions (api/utils.py):

  • download_seed_images(): Fetch images from URLs
  • encode_images_to_base64(): Convert images for API transmission
  • build_prompt_from_template(): Template-based prompt construction
  • call_claude_api_sync(): Synchronous Claude API call (non-batched)
  • encode_pdf_to_base64(): PDF encoding for response
  • validate_html_structure(): HTML validation
  • validate_pdf_pages(): PDF page count/size validation
  • validate_bbox_completeness(): Ensure bboxes extracted

Configuration

Environment Variables (.env)

ANTHROPIC_API_KEY=sk-ant-...            # Required for Claude API
LLM_MODEL=claude-sonnet-4-5-20250929   # Default model
API_PORT=8000                           # Server port
DEBUG_MODE=false                        # Enable debug logging

Prompt Templates

Located in data/prompt_templates/:

Template Structure:

data/prompt_templates/<template_name>/
β”œβ”€β”€ system_prompt.txt                   # System message
β”œβ”€β”€ user_prompt_template.txt            # User message template
└── example_output.html                 # Example for few-shot

Placeholder Substitution:

# In user_prompt_template.txt
"""
Please generate {num_solutions} {doc_type} documents in {language}.

Ground truth format: {gt_format}
Ground truth type: {gt_type}
"""

# Substituted with request.prompt_params

CORS Configuration

# In api/main.py
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],        # Open for development
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Authentication & Security

Current State:

  • No built-in authentication: API is open (for development)
  • API Key Management: Claude API key in .env, not passed in requests
  • CORS: Open (allow_origins=["*"]) for development

Production Recommendations:

  • Add API key authentication (e.g., Bearer tokens)
  • Restrict CORS origins to known frontends
  • Rate limiting (e.g., Redis-based)
  • Input sanitization for HTML injection prevention
  • HTTPS only (terminate SSL at reverse proxy)

Performance Considerations

Async Rendering:

  • Playwright rendering uses async/await
  • Concurrent requests supported by FastAPI
  • Semaphore control prevents resource exhaustion

Image Size Limits:

  • Seed images compressed before API transmission
  • Max dimension: 1024px (configurable)
  • JPEG quality: 85% (configurable)

Timeouts:

  • PDF render timeout: 60 seconds per document
  • Claude API timeout: 120 seconds
  • Total request timeout: 300 seconds (5 minutes)

Retry Logic:

  • PDF rendering: Up to 3 attempts
  • Claude API: Up to 2 retries on network errors
  • No retry on validation failures

Concurrency:

  • FastAPI default: Multiple workers (configurable)
  • Playwright: Semaphore-controlled (10 concurrent renders)

Limitations

Current Limitations:

  1. Single-Page Only: Multi-page PDFs flagged as errors
  2. Seed Image Limit: Maximum 10 seed images per request
  3. Document Limit: Maximum 5 document variations (num_solutions)
  4. No Handwriting/Visual Elements: Stages 07-13 not integrated (API stops at stage 06)
  5. Synchronous LLM: No batching (higher cost per document)
  6. No Dataset Export: No /export-dataset endpoint for full pipeline runs

Known Issues:

  • Large documents (>10 pages worth of content) may timeout
  • Complex CSS (animations, 3D transforms) may not render correctly
  • Some Unicode characters may not display in PDFs

Future Integration Plan

From api/PIPELINE_INTEGRATION.md:

Stage 3: Handwriting & Visual Elements (Stages 07-11)

New Request Parameters:

class PromptParameters(BaseModel):
    # ... existing ...
    enable_handwriting: bool = False
    handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
    enable_visual_elements: bool = False
    visual_element_types: List[str] = ["stamp", "logo"]

New Response Fields:

class DocumentResult(BaseModel):
    # ... existing ...
    handwriting_regions: Optional[List[HandwritingRegion]]
    visual_elements: Optional[List[VisualElement]]

Impact:

  • Longer processing time (diffusion model: ~5s per handwriting region)
  • Larger response size (additional images)

Stage 4: Image Finalization & OCR (Stages 12-15)

New Response Fields:

class DocumentResult(BaseModel):
    # ... existing ...
    image_base64: str                    # Final rendered image (PNG)
    ocr_text: str                        # Full OCR text
    ocr_confidence: float                # Average OCR confidence

Impact:

  • OCR API costs (~$1.50 per 1000 images)
  • Additional 2-3 seconds per document (OCR latency)

Stage 5: Dataset Packaging (Stages 16-19)

New Endpoint:

@app.post("/export-dataset")
async def export_dataset(request: ExportDatasetRequest):
    """
    Run full pipeline (stages 01-19) and return packaged dataset.
    """
    # Run pipeline with syndatadef
    # Return ZIP with images, GTs, statistics

Request:

class ExportDatasetRequest(BaseModel):
    syndatadef_config: dict              # Full SynDatasetDefinition
    output_format: str = "huggingface"   # "huggingface", "coco", "custom"

Response:

  • ZIP archive with full dataset
  • Includes dataset_log.json from stage 18

Example Usage

Python Client (api/example_usage.py)

import requests
import base64
from pathlib import Path

# API endpoint
API_URL = "http://localhost:8000/generate"

# Prepare request
request_data = {
    "seed_images": [
        "https://example.com/invoice_seed.jpg",
        "https://example.com/receipt_seed.jpg"
    ],
    "prompt_params": {
        "language": "English",
        "doc_type": "invoices and receipts",
        "gt_type": "Multiple questions and their answers",
        "gt_format": '{"question": "answer", ...}',
        "num_solutions": 3
    }
}

# Call API
response = requests.post(API_URL, json=request_data)
result = response.json()

# Process results
if result["success"]:
    for doc in result["documents"]:
        doc_id = doc["document_id"]
        
        # Save PDF
        pdf_data = base64.b64decode(doc["pdf_base64"])
        Path(f"{doc_id}.pdf").write_bytes(pdf_data)
        
        # Save GT
        Path(f"{doc_id}_gt.json").write_text(
            json.dumps(doc["ground_truth"], indent=2)
        )
        
        # Save HTML
        Path(f"{doc_id}.html").write_text(doc["html"])
        
        print(f"Saved: {doc_id}")
        print(f"  - BBoxes: {len(doc['bboxes'])}")
        print(f"  - GT Annotations: {len(doc['ground_truth'])}")

cURL Example

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "seed_images": [
      "https://example.com/seed.jpg"
    ],
    "prompt_params": {
      "language": "English",
      "doc_type": "business documents",
      "gt_type": "Questions and answers",
      "gt_format": "{\"question\": \"answer\"}",
      "num_solutions": 1
    }
  }'

JavaScript/TypeScript Client

const response = await fetch('http://localhost:8000/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    seed_images: ['https://example.com/seed.jpg'],
    prompt_params: {
      language: 'English',
      doc_type: 'invoices',
      gt_type: 'Questions and answers',
      gt_format: '{"question": "answer"}',
      num_solutions: 2
    }
  })
});

const result = await response.json();

// Decode PDF
const pdfBlob = new Blob(
  [Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
  { type: 'application/pdf' }
);

// Download PDF
const url = URL.createObjectURL(pdfBlob);
const a = document.createElement('a');
a.href = url;
a.download = `${result.documents[0].document_id}.pdf`;
a.click();

Testing

Test File: api/test_api.py

import pytest
from fastapi.testclient import TestClient
from api.main import app

client = TestClient(app)

def test_health_check():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_generate_documents():
    request_data = {
        "seed_images": ["https://example.com/seed.jpg"],
        "prompt_params": {
            "language": "English",
            "doc_type": "invoices",
            "gt_type": "Questions",
            "gt_format": "{\"q\": \"a\"}",
            "num_solutions": 1
        }
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 200
    result = response.json()
    assert result["success"] is True
    assert len(result["documents"]) > 0

def test_invalid_request():
    request_data = {
        "seed_images": [],  # Invalid: empty
        "prompt_params": {"num_solutions": 10}  # Invalid: > 5
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 422  # Validation error

Run Tests:

cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
pytest api/test_api.py -v

Deployment

Start Server:

cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
chmod +x start.sh
./start.sh

start.sh Contents:

#!/bin/bash
export $(cat .env | xargs)
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Production Deployment:

# With Gunicorn (multi-worker)
gunicorn main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 300

Docker Deployment:

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Core Models & Utilities

SynDatasetDefinition Model

File: docgenie/generation/models/_syndatadef.py

Purpose: Configuration object for synthetic dataset generation, loaded from YAML files.

Key Attributes:

@dataclass
class SynDatasetDefinition:
    # Dataset metadata
    name: str                           # "docvqa_alpha=1.0"
    task: TaskType                      # TaskType.QA, .KIE, .DLA, .CLS
    base_dataset: str                   # "docvqa", "cord", etc.
    
    # LLM configuration
    llm_model: str                      # "claude-sonnet-4-20250514"
    prompt_template: str                # "DocGenie"
    num_solutions: int                  # Documents per prompt
    
    # Prompting parameters
    language: str                       # "English", "German", etc.
    doc_type: str                       # "business and administrative"
    gt_type: str                        # Task-specific GT description
    gt_format: str                      # Expected GT format
    
    # Dataset parameters
    num_samples: int                    # Total documents to generate
    alpha: float                        # Clustering diversity parameter
    
    # Seed selection
    seeds_per_cluster: int              # Seeds sampled per cluster
    clustering_method: str              # "kmeans", "agglomerative"
    
    # Task-specific
    valid_labels: Optional[List[str]]   # For DLA: ["title", "text", ...]
    
    # Output paths
    output_dir: Path                    # Root output directory

Key Methods:

def get_file_structure(self) -> FileStructure:
    """Returns FileStructure manager for output directories."""

def build_prompt(self, seed_images: List[str]) -> str:
    """Builds prompt from template with parameter substitution."""

def iter_document_logs(self) -> Iterator[DocumentLog]:
    """Iterates over all document logs."""

def update_document_status(self, doc_id: str, status: Status):
    """Updates document status in log."""

@classmethod
def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
    """Load configuration from YAML file."""

Example YAML:

name: "docvqa_alpha=1.0"
task: "qa"
base_dataset: "docvqa"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1

language: "English"
doc_type: "business and administrative documents"
gt_type: "Multiple questions and their answers"
gt_format: '{"question": "answer", ...}'

num_samples: 1000
alpha: 1.0
seeds_per_cluster: 10
clustering_method: "kmeans"

output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"

PipelineParameters Model

File: docgenie/generation/models/_pipeline.py

Purpose: Runtime parameters for pipeline execution.

Attributes:

@dataclass
class PipelineParameters:
    # Execution control
    start_stage: int = 1                # First stage to execute
    end_stage: int = 19                 # Last stage to execute
    skip_existing: bool = True          # Skip documents with existing outputs
    
    # Parallelization
    max_workers: int = 10               # Concurrent processing
    chromium_concurrency: int = 10      # Parallel PDF renders
    
    # Debug mode
    debug_mode: bool = False            # Enable debug visualizations
    
    # Retry configuration
    max_retries: int = 3                # Retry attempts
    retry_delay: float = 2.0            # Seconds between retries
    
    # Timeouts
    pdf_render_timeout: int = 60        # Seconds
    ocr_timeout: int = 30               # Seconds

FileStructure Model

File: docgenie/generation/models/_file.py

Purpose: Manages directory structure for generated data.

Key Properties:

@dataclass
class FileStructure:
    root: Path                          # Root output directory
    
    # Directory properties (all return Path)
    seeds_directory: Path
    prompt_batches_directory: Path
    message_results_directory: Path
    raw_html_directory: Path
    raw_annotations_directory: Path
    geometries_directory: Path
    pdf_initial_directory: Path
    render_html_directory: Path
    pdf_word_bboxes_directory: Path
    pdf_char_bboxes_directory: Path
    layout_element_definitions_directory: Path
    handwriting_definitions_directory: Path
    visual_element_definitions_directory: Path
    handwriting_images_directory: Path
    visual_element_images_directory: Path
    pdf_without_handwriting_placeholder_directory: Path
    pdf_with_handwriting_directory: Path
    pdf_final_directory: Path
    images_directory: Path
    final_word_bboxes_directory: Path
    final_segment_bboxes_directory: Path
    normalized_word_bboxes_directory: Path
    normalized_segment_bboxes_directory: Path
    verified_gt_directory: Path
    
    # Debug subdirectories
    debug_directory: Path
    debug_pdf_bboxes_directory: Path
    debug_visual_element_bboxes_directory: Path
    debug_handwriting_insertion_directory: Path
    debug_final_bboxes_on_images_directory: Path
    debug_html_with_debug_directory: Path

Key Methods:

def create_all_directories(self):
    """Create all output directories."""

def get_document_path(self, doc_id: str, stage: str) -> Path:
    """Get path for specific document and stage."""

DocumentLog Model

File: docgenie/generation/models/_log.py

Purpose: Document-level metadata and status tracking.

Attributes:

@dataclass
class DocumentLog:
    doc_id: str                         # Unique document ID
    seed_image_id: str                  # Source seed image
    prompt_call_id: str                 # Prompt batch/call ID
    
    status: Status                      # VALID, INVALID, PROCESSING
    
    # Stage completion flags
    has_raw_html: bool = False
    has_raw_gt: bool = False
    has_pdf_initial: bool = False
    has_geometries: bool = False
    has_bboxes: bool = False
    has_handwriting: bool = False
    has_visual_elements: bool = False
    has_final_image: bool = False
    has_ocr_result: bool = False
    has_verified_gt: bool = False
    
    # Statistics
    num_words: int = 0
    num_annotations: int = 0
    num_handwriting_regions: int = 0
    num_visual_elements: int = 0
    
    # Error tracking
    error_stage: Optional[str] = None
    error_message: Optional[str] = None
    error_category: Optional[str] = None
    
    # Timestamps
    created_at: datetime
    updated_at: datetime

Key Methods:

def mark_stage_complete(self, stage: str):
    """Mark pipeline stage as complete."""

def mark_error(self, stage: str, error_msg: str, category: str):
    """Record error and mark document as invalid."""

def is_valid(self) -> bool:
    """Check if document passed all stages."""

BBox Model

File: docgenie/generation/models/_bbox.py

Purpose: Bounding box representation with text content.

Attributes:

@dataclass
class BBox:
    rect: Rect                          # {x0, y0, x1, y1}
    text: str                           # Text content
    metadata: Optional[dict] = None     # Additional data

Key Methods:

@property
def width(self) -> float:
    return self.rect["x1"] - self.rect["x0"]

@property
def height(self) -> float:
    return self.rect["y1"] - self.rect["y0"]

def normalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert to normalized [0, 1] coordinates."""

def unnormalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert from normalized to pixel coordinates."""

def to_dict(self) -> dict:
    """Serialize to dictionary."""

@classmethod
def from_dict(cls, data: dict) -> "BBox":
    """Deserialize from dictionary."""

LayoutBBox Model

File: docgenie/generation/models/_bbox.py

Purpose: Layout element bounding box (for DLA tasks).

Attributes:

@dataclass
class LayoutBBox:
    label: str                          # "title", "text", "table", etc.
    rect: Rect                          # {x0, y0, x1, y1}
    metadata: Optional[dict] = None

Key Methods:

def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
    """Normalize coordinates."""

def contains(self, other: "LayoutBBox") -> bool:
    """Check if this bbox fully contains another."""

def overlaps(self, other: "LayoutBBox") -> bool:
    """Check if this bbox overlaps with another."""

def overlap_area(self, other: "LayoutBBox") -> float:
    """Calculate overlap area with another bbox."""

Utility Modules

BBox Utilities (utils/bboxes.py)

def load_bboxes_from_file(file_path: Path) -> List[BBox]:
    """Load bboxes from JSON file."""

def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
    """Save bboxes to JSON file."""

def visualize_bboxes_on_pdf(
    pdf_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "red"
):
    """Draw bbox overlays on PDF."""

def visualize_bboxes_on_image(
    image_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "orange"
):
    """Draw bbox overlays on image."""

def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
    """Check if bbox1 contains bbox2."""

Geometry Utilities (utils/geos.py)

def filter_layout_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-label attribute."""

def filter_handwriting_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-handwriting attribute."""

def filter_visual_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-visual-element attribute."""

def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
    """Extract elements with specific CSS class."""

Serialization Utilities (utils/serialization.py)

def encode_image_to_base64(image_path: Path) -> str:
    """Encode image file to base64 string."""

def decode_base64_to_image(base64_str: str, output_path: Path):
    """Decode base64 string and save as image."""

def serialize_dataclass(obj: Any) -> dict:
    """Serialize dataclass to dictionary."""

def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
    """Deserialize dictionary to dataclass."""

Configuration & Constants

Key Constants (generation/constants.py)

# Document processing
HTML_PARSER = "html.parser"           # BeautifulSoup parser
PDF_POINT_SCALING = 72 / 96           # CSS DPI to PDF DPI

# Bounding boxes
BBOX_OVERLAP_THRESHOLD = 0.05         # 5% overlap tolerance
SPATIAL_MATCH_THRESHOLD = 10.0        # Pixel tolerance for matching

# PDF rendering
CHROMIUM_CONCURRENCY = 10             # Parallel renders
PER_PDF_RENDER_TIMEOUT = 60           # Seconds
PER_PDF_RENDER_MAX_RETRIES = 3        # Retry attempts

# Handwriting
MAX_HANDWRITING_CHARS = 7             # Max chars per diffusion generation
HANDWRITING_HEIGHT_PX = 40            # Image height
HANDWRITING_PADDING_PX = 0            # Horizontal padding
DIFFUSION_NUM_INFERENCE_STEPS = 50    # Generation quality
HANDWRITING_IMAGE_UPSCALE_FACTOR = 3  # Insertion scaling
MAX_HANDWRITING_RAND_X = 2            # Random X offset (pixels)
MAX_HANDWRITING_RAND_Y = 1            # Random Y offset (pixels)

# Visual elements
VISUAL_ELEMENT_UPSCALE_FACTOR = 3     # Insertion scaling
STAMP_BORDER_WIDTH = 2                # Stamp border thickness
BARCODE_DPI = 300                     # Barcode image quality

# OCR
IMAGE_DPI = 200                       # Final image DPI
OCR_CONFIDENCE_THRESHOLD = 0.8        # Min confidence

# Ground truth
FUZZY_MATCH_THRESHOLD = 0.85          # Levenshtein similarity cutoff

# Handwriting styles
HANDWRITING_STYLES = [
    "writer_0", "writer_1", "writer_2", ..., "writer_99"
]

# Visual element types
VISUAL_ELEMENT_TYPES = [
    "stamp", "logo", "barcode", "chart", "photo"
]

# Visual element type mapping (LLM output β†’ standard)
VISUAL_ELEMENT_TYPE_MAPPING = {
    "stamp": "stamp",
    "company_stamp": "stamp",
    "approval_stamp": "stamp",
    "logo": "logo",
    "company_logo": "logo",
    "brand_logo": "logo",
    "barcode": "barcode",
    "code128": "barcode",
    "chart": "chart",
    "graph": "chart",
    "figure": "chart",
    "photo": "photo",
    "image": "photo",
    "picture": "photo"
}

Environment Variables

Required:

ANTHROPIC_API_KEY=sk-ant-...          # Claude API key
MICROSOFT_AZURE_OCR_KEY=...           # Azure OCR key
MICROSOFT_AZURE_OCR_ENDPOINT=...      # Azure OCR endpoint

Optional:

LLM_MODEL=claude-sonnet-4-20250514    # Override model
DEBUG_MODE=false                      # Enable debug output
LOG_LEVEL=INFO                        # Logging verbosity
CHROMIUM_CONCURRENCY=10               # Override concurrency

Error Handling & Debugging

Error Categories

Comprehensive error tracking in stage 18:

Category Description Resolution
multipage_pdf PDF rendered with >1 page Check HTML content size, CSS page breaks
missing_ocr_result OCR API failed or empty Check Azure credentials, retry
failed_gt_verification GT text not found in OCR Review fuzzy match threshold, inspect HTML
rendering_timeout PDF render exceeded timeout Increase timeout, simplify HTML
llm_parsing_error Failed to extract HTML/GT Review LLM response format, update regex
bbox_extraction_failed PyMuPDF extraction error Check PDF validity, inspect fonts
handwriting_generation_failed Diffusion model error Check model checkpoint, GPU availability
visual_element_generation_failed VE creation error Check prefab directories, image validity

Debug Visualizations (Stage 19)

Generated debug outputs:

  1. PDF BBoxes: Verify PyMuPDF extraction quality
  2. Visual Element BBoxes: Verify VE extraction and positioning
  3. Handwriting Insertion: Verify handwriting placement
  4. Final BBoxes on Images: Verify OCR accuracy

Enable debug mode:

# In pipeline execution
pipeline_params = PipelineParameters(
    debug_mode=True,
    # ... other params
)

Logging

Log locations:

data/datasets/synthesized_datasets/<dataset_name>/logs/
β”œβ”€β”€ pipeline_01/
β”‚   └── seed_selection.log
β”œβ”€β”€ pipeline_02/
β”‚   └── prompting.log
β”œβ”€β”€ pipeline_03/
β”‚   └── message_processing/
β”‚       β”œβ”€β”€ batch_001.log
β”‚       └── batch_002.log
β”œβ”€β”€ ...
└── pipeline_19/
    └── debug_generation.log

Log format:

[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout

Common Issues & Solutions

Issue: Multi-page PDFs

  • Cause: HTML content exceeds page size
  • Solution: Reduce content, check for long tables, use CSS overflow: hidden

Issue: Handwriting not appearing

  • Cause: Character bboxes not available (stage 05)
  • Solution: Use simpler fonts in HTML, ensure PyMuPDF can extract chars

Issue: OCR missing text

  • Cause: Low image DPI, poor contrast
  • Solution: Increase IMAGE_DPI (stage 14), adjust HTML styling

Issue: GT verification failures

  • Cause: OCR discrepancies, fuzzy match threshold too high
  • Solution: Lower FUZZY_MATCH_THRESHOLD, improve HTML text rendering

Issue: Claude API timeout

  • Cause: Large seed images, complex prompts
  • Solution: Compress seed images, simplify prompt template

Usage Examples

Full Pipeline Execution

from docgenie.generation import pipeline_01_select_seeds
from docgenie.generation.models import SynDatasetDefinition, PipelineParameters

# Load configuration
syndatadef = SynDatasetDefinition.from_yaml(
    "data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
)

# Configure pipeline
params = PipelineParameters(
    start_stage=1,
    end_stage=19,
    skip_existing=True,
    debug_mode=False,
    max_workers=10
)

# Execute pipeline stages
for stage in range(params.start_stage, params.end_stage + 1):
    print(f"Executing stage {stage:02d}...")
    
    # Import and run stage module
    stage_module = importlib.import_module(
        f"docgenie.generation.pipeline_{stage:02d}_*"
    )
    stage_module.main(syndatadef, params)
    
    print(f"Stage {stage:02d} complete.")

print("Pipeline execution complete!")

API Usage

See API Implementation section for detailed examples.


Custom Dataset Definition

# data/syn_dataset_definitions/custom_invoices.yaml

name: "custom_invoices_v1"
task: "kie"
base_dataset: "cord"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "ClaudeRefined12"
num_solutions: 1

language: "English"
doc_type: "invoices and receipts"
gt_type: "Key-value pairs (entity extraction)"
gt_format: '{"entity_name": "entity_value", ...}'

num_samples: 500
alpha: 0.75
seeds_per_cluster: 5
clustering_method: "kmeans"

# KIE-specific
valid_labels: null  # Not used for KIE

output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"

Conclusion

This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:

  • Pipeline Integration Guide: api/PIPELINE_INTEGRATION.md
  • API README: api/README.md
  • Source Code: docgenie/generation/ and api/

Key Takeaways:

  1. 19-Stage Pipeline: Modular design from seed selection to GT verification
  2. Multi-Task Support: QA, KIE, DLA, CLS with task-specific handling
  3. Realistic Documents: LLM-generated content, diffusion handwriting, visual elements
  4. Quality Assurance: Comprehensive validation, OCR verification, error tracking
  5. API Integration: FastAPI service for synchronous document generation (stages 01-06)
  6. Extensibility: Modular code, clear interfaces, easy to extend

Pipeline Strengths:

  • Task-agnostic core with task-specific adapters
  • Extensive logging and error tracking
  • Parallel processing where applicable
  • Debug visualizations at every stage
  • Integration with state-of-the-art models

Future Directions:

  • Full API integration (stages 07-19)
  • Additional document types (forms, legal, medical)
  • Multi-language support expansion
  • Enhanced visual element generation (charts, diagrams)
  • Real-time generation optimization