Spaces:

Text-to-Document-Generation
/

Docgenie-API

Paused

App Files Files Community

Docgenie-API / GENERATION_PIPELINE_DOCUMENTATION.md

Ahadhassan-2003

deploy: update HF Space

dc4e6da 8 days ago

preview code

raw

history blame contribute delete

97.8 kB

DocGenie Generation Pipeline & API Documentation

Version: 1.0
Last Updated: February 7, 2026
Purpose: Comprehensive reference for the DocGenie synthetic document generation system

Overview
Pipeline Architecture
Pipeline Stages (01-19)
API Implementation
Core Models & Utilities
Configuration & Constants
Usage Examples
Error Handling & Debugging

Overview

DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:

Document Question Answering (QA)
Key Information Extraction (KIE)
Document Layout Analysis (DLA)
Document Classification (CLS)

Key Features

LLM-Powered Generation: Uses Claude/Gemini/Open-source models to generate diverse document content
Realistic Handwriting: Diffusion model-based handwriting synthesis with author-specific styles
Visual Element Integration: Stamps, logos, barcodes, charts, and photos
Multi-Task Support: Task-specific ground truth formatting and validation
Quality Assurance: Comprehensive validation, OCR verification, and error tracking
Modular Design: Each pipeline stage is independently executable with clear inputs/outputs

Technology Stack

LLM APIs: Claude (Anthropic), Gemini, DeepSeek, Qwen
PDF Rendering: Playwright (Chromium), PyMuPDF
OCR: Microsoft Azure OCR
Handwriting: Custom diffusion model
Image Processing: PIL, OpenCV
API Framework: FastAPI
Data Processing: Pandas, NumPy

Pipeline Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                        DOCGENIE GENERATION PIPELINE                      │
└─────────────────────────────────────────────────────────────────────────┘

┌──────────────────────┐
│  PHASE 1: SELECTION  │
└──────────────────────┘
    ↓
[01] Select Seeds ────────────► seeds.csv, clusters.csv
                                (Cluster-based diverse seed selection)

┌──────────────────────┐
│ PHASE 2: LLM GEN     │
└──────────────────────┘
    ↓
[02] Prompt LLM ──────────────► batch_results/ (JSON)
    │                           (Claude API batched calls)
    ↓
[03] Process Response ────────► raw_html/, raw_annotations/
                                (Extract HTML & GT from responses)

┌──────────────────────┐
│ PHASE 3: RENDERING   │
└──────────────────────┘
    ↓
[04] Render PDF Initial ──────► pdf_initial/, geometries/
    │                           (HTML→PDF with geometry extraction)
    ↓
[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/
    │                           (PyMuPDF text extraction)
    ↓
[06] Extract Layout ──────────► layout_element_definitions/
                                (DLA/KIE-specific annotations)

┌──────────────────────┐
│ PHASE 4: EXTRACTION  │
└──────────────────────┘
    ↓
[07] Extract Handwriting ─────► handwriting_definitions/
    │                           (Identify handwriting regions)
    ↓
[08] Extract Visual Elements ─► visual_element_definitions/
                                (Stamp/logo/barcode placeholders)

┌──────────────────────┐
│ PHASE 5: GENERATION  │
└──────────────────────┘
    ↓
[09] Create Handwriting ──────► handwriting_images/
    │                           (Diffusion model generation)
    ↓
[10] Create Visual Elements ──► visual_element_images/
                                (Generate/select stamps, logos, etc.)

┌──────────────────────┐
│ PHASE 6: COMPOSITION │
└──────────────────────┘
    ↓
[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/
    │                           (Remove handwriting placeholders)
    ↓
[12] Insert Handwriting ──────► pdf_with_handwriting/
    │                           (Overlay handwriting images)
    ↓
[13] Insert Visual Elements ──► pdf_final/
    │                           (Overlay stamps, logos, etc.)
    ↓
[14] Render Image ────────────► images/
                                (PDF→PNG conversion)

┌──────────────────────┐
│ PHASE 7: FINALIZATION│
└──────────────────────┘
    ↓
[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/
    │                           (Microsoft OCR)
    ↓
[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/
                                (Pixel→[0,1] coordinates)

┌──────────────────────┐
│ PHASE 8: VALIDATION  │
└──────────────────────┘
    ↓
[17] GT Preparation ──────────► verified_gt/
    │                           (Fuzzy matching, BIO tagging)
    ↓
[18] Analyze ─────────────────► dataset_log.json
    │                           (Statistics, cost analysis)
    ↓
[19] Create Debug Data ───────► debug/ subdirectories
                                (Visualizations for inspection)

Data Flow Between Stages

Seed Images ──┐
              ├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries
Prompt Params ┘                                         │
                                                        ├──► [05] ──► BBoxes
                                                        │               │
                                                        │               ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐
                                                        │               │                                              │
                                                        │               └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤
                                                        │                                                              │
                                                        └──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤
                                                                                    │           (Insert HW)            │
                                                                                    └──► [13] ◄────────────────────────┘
                                                                                             (Insert VE)
                                                                                                  ↓
                                                                                            [14] ──► Image
                                                                                                  ↓
                                                                                            [15] ──► OCR BBoxes
                                                                                                  ↓
                                                                                            [16] ──► Normalized
                                                                                                  ↓
                                                                                            [17] ──► Verified GT

Pipeline Stages (01-19)

Stage 01: Select Seeds

File: pipeline_01_select_seeds.py

Purpose: Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.

Key Functions:

main(): Orchestrates seed selection process
downscale_and_compress_seeds(): Prepares seed images for efficient API transmission
plot_class_distribution(): Visualizes class balance in selected seeds
visualize_cluster_histogram(): Shows distribution across clusters

Process:

Load embeddings from base dataset
Perform clustering (KMeans or other algorithms)
Sample N seeds per cluster
Downscale and compress images (JPEG, max dimension)
Save seed manifest and cluster assignments

Inputs:

SynDatasetDefinition configuration
Base dataset name (e.g., docvqa, cord, publaynet)
Clustering parameters from constants

Outputs:

seeds.csv                    # Selected seed document IDs per prompt call
clusters.csv                 # Cluster assignments for all documents
seeds/ (directory)           # Preprocessed seed images (JPEG, compressed)

Configuration Parameters:

EMBEDDING_MODEL: Specifies which embedding model was used
IMAGE_MAX_DIMENSION: Max width/height for compression
JPEG_QUALITY: Compression quality (0-100)

Example Usage:

from docgenie.generation import pipeline_01_select_seeds

pipeline_01_select_seeds.main(
    syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
    base_dataset="docvqa"
)

Stage 02: Prompt LLM

File: pipeline_02_prompt_llm.py

Purpose: Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.

Key Functions:

main(): Main orchestrator for LLM prompting
create_batched_messages(): Constructs API-compatible message batches
track_batch_completion(): Polls API for batch status
Cost calculation utilities in pipeline_01/cost.py

Process:

Load seed images and encode as base64
Build prompts from template with parameter injection
Create batched API requests (Claude Batch API for cost efficiency)
Submit batches and track completion
Save results for processing in stage 03

Inputs:

Prompt template from data/prompt_templates/<template_name>/
Seed images from stage 01
API credentials from environment variables
SynDatasetDefinition parameters

Outputs:

prompt_batches/              # Batch metadata (batch IDs, status)
message_results/             # JSON response files per batch
logs/                        # Prompting logs and progress

API Configuration:

Claude: Uses Batch API with prompt caching for cost efficiency
Batch Size: Configurable via BATCH_SIZE constant
Polling Interval: Configurable wait time between status checks
Model Selection: Specified in SynDatasetDefinition.llm_model

Cost Tracking:

Input/output token counts per request
Cached token usage (for Claude)
Total cost estimation per batch

Example Configuration:

# In syn_dataset_definition YAML
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1  # Documents per prompt
language: "English"
doc_type: "business and administrative"

Stage 03: Process Response

File: pipeline_03_process_response.py

Purpose: Extract and validate HTML documents and ground truth annotations from LLM responses.

Key Functions:

main(): Main processor
extract_html_from_message(): Regex-based HTML extraction from markdown code blocks
extract_gt_from_html(): Parse JSON ground truth from <script id="GT"> tags
validate_and_save_gt(): Task-specific validation and formatting

Process:

Load message results from stage 02
Extract HTML content using regex patterns
Parse and validate ground truth JSON
Apply task-specific formatting (QA, KIE, DLA, CLS)
Save raw HTML and annotations separately

Inputs:

Message results from message_results/ (stage 02)
SynDatasetDefinition task type and prompt format

Outputs:

raw_html/                    # HTML files (one per document)
raw_annotations/             # Ground truth JSON files
  qa/                        # QA format: {"question": "answer", ...}
  kie/                       # KIE format: {"entity": "value", ...}
  dla/                       # DLA format: {"element_id": "label", ...}
  cls/                       # CLS format: {"class": "category"}
logs/message_processing/     # Processing logs per message

Ground Truth Formats:

QA Format:

{
  "What is the invoice number?": "INV-12345",
  "What is the total amount?": "$1,234.56"
}

KIE Format:

{
  "company_name": "Acme Corp",
  "invoice_date": "2024-01-15",
  "total_amount": "1234.56"
}

DLA Format:

{
  "layout_0": "title",
  "layout_1": "text",
  "layout_2": "table"
}

Validation Checks:

Expected document count matches actual
Valid JSON structure in GT
Task-specific field presence
HTML completeness (valid tags)

Error Handling:

Malformed HTML → Logged, skipped
Missing GT → Document flagged in logs
Invalid JSON → Parsing error logged

Stage 04: Render PDF and Extract Geometries

File: pipeline_04_render_pdf_and_extract_geos.py

Purpose: Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.

Key Functions:

main(): Async batch rendering orchestrator
render_pdf_with_playwright(): Single PDF rendering using Playwright/Chromium
preprocess_css(): Remove conflicting CSS @page rules
validate_pdf(): Check page count and file integrity
extract_geometries(): Parse element positions from injected JavaScript

Process:

Load raw HTML from stage 03
Inject geometry extraction JavaScript
Preprocess CSS (remove fixed page sizes)
Render with Playwright in headless Chromium
Measure content dimensions dynamically
Export PDF with calculated page size
Extract and save element geometries

Inputs:

Raw HTML files from raw_html/ (stage 03)

Outputs:

pdf_initial/                 # Initial PDFs
geometries/                  # Element geometry JSON files
  {doc_id}.json              # Contains positions, dimensions, CSS classes
render_html/                 # HTML with geometry extraction scripts
debug/pdf_with_geos/         # Debug PDFs with geometry overlays (optional)
logs/rendering/              # Rendering logs and errors

Geometry JSON Structure:

{
  "page_width_mm": 210.0,
  "page_height_mm": 297.0,
  "elements": [
    {
      "id": "layout_0",
      "type": "div",
      "class": "title",
      "rect": {
        "x": 20.0,
        "y": 30.0,
        "width": 170.0,
        "height": 15.0
      },
      "text": "Invoice",
      "attributes": {
        "data-label": "title",
        "data-handwriting": null,
        "data-visual-element": null
      }
    }
  ]
}

Key Features:

Automatic Page Sizing: No fixed dimensions; page size adapts to content
Concurrent Rendering: Semaphore-controlled parallel processing
Geometry Extraction: JavaScript injected to capture element positions
CSS Coordinate Conversion: 96 DPI (CSS) → 72 DPI (PDF)
Retry Logic: Up to 3 attempts with configurable timeout
Debug Visualizations: Optional overlay of geometries on PDFs

Configuration Constants:

CHROMIUM_CONCURRENCY: 10 (parallel render limit)
PER_PDF_RENDER_TIMEOUT: 60 seconds
PER_PDF_RENDER_MAX_RETRIES: 3 attempts
PDF_POINT_SCALING: 72/96 for DPI conversion

Error Handling:

Timeout → Retry with increased timeout
Multi-page PDFs → Flagged and skipped
Missing geometries → Error logged, document marked invalid

Stage 05: Extract BBoxes from PDF

File: pipeline_05_extract_bboxes_from_pdf.py

Purpose: Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.

Key Functions:

main(): Main extraction orchestrator
extract_bboxes_from_pdf(): PyMuPDF-based extraction
verify_char_to_word_mapping(): Validate character-to-word relationships
create_bbox_debug_pdf(): Generate debug visualizations

Process:

Load PDFs from stage 04
Extract words with PyMuPDF get_text("words")
Extract characters with get_text("rawdict")
Map characters to parent words
Validate mappings (required for handwriting splitting)
Save both word and character bboxes

Inputs:

PDFs from pdf_initial/ (stage 04)

Outputs:

pdf_word_bboxes/             # Word-level bounding boxes
  {doc_id}.json              # Format: [x0, y0, x1, y1, text]
pdf_char_bboxes/             # Character-level bounding boxes (if mappable)
  {doc_id}.json              # Format: [x0, y0, x1, y1, char, word_idx]
debug/pdf_bboxes/            # Debug PDFs with bbox overlays (optional)
logs/bbox_extraction/        # Extraction logs

BBox JSON Format:

Word BBoxes:

[
  {"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
  {"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
]

Char BBoxes (with word mapping):

[
  {"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
  {"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
]

Key Features:

Dual-Level Extraction: Both word and character granularity
Mapping Validation: Ensures characters can be mapped to words (critical for handwriting)
Debug Visualizations: Color-coded bbox overlays on PDFs
PDF Coordinate System: Uses 72 DPI (PDF points)

Character-to-Word Mapping: Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.

Error Handling:

Empty PDFs → Skipped with warning
Unmappable characters → Word-level fallback
Invalid bboxes (negative dimensions) → Filtered out

Stage 06: Extract Layout Element Definitions and Annotation GT

File: pipeline_06_extract_layout_element_definitions_and_annotation_gt.py

Purpose: Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).

Key Functions:

main(): Task router
handle_dla(): Process DLA tasks
handle_kie(): Process KIE tasks
parse_layout_elements(): Extract layout elements from geometries
parse_kie_fields(): Extract KIE annotated fields from geometries

Process (DLA):

Load geometries from stage 04
Filter elements with data-label attribute
Validate labels against task-specific valid labels
Extract bounding boxes and labels
Save layout element definitions

Process (KIE):

Load geometries from stage 04
Filter elements with data-annotation attribute
Extract entity names and text values
Validate against expected entity schema
Save raw annotations for stage 17

Inputs:

Geometries from geometries/ (stage 04)
Valid labels from SynDatasetDefinition.valid_labels
Task type from SynDatasetDefinition.task

Outputs:

layout_element_definitions/  # For DLA tasks
  {doc_id}.json              # [{id, label, rect}, ...]
raw_annotations/             # For KIE tasks
  kie_annotations/
    {doc_id}.json            # {entity_name: text_value, ...}

Layout Element Definition Format (DLA):

[
  {
    "id": "layout_0",
    "label": "title",
    "rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
  },
  {
    "id": "layout_1",
    "label": "text",
    "rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
  }
]

KIE Annotation Format:

{
  "company_name": "Acme Corporation",
  "invoice_number": "INV-12345",
  "invoice_date": "January 15, 2024",
  "total_amount": "$1,234.56"
}

Validation Checks:

Missing labels: Elements without data-label → Warning
Invalid labels: Labels not in valid_labels → Error
Zero dimensions: Width or height = 0 → Error
Multiple labels (KIE only): One element, multiple entities → Error
Missing text: KIE elements without text → Warning

Task-Specific Handling:

DLA: Converts geometries to LayoutBBox format for stage 17
KIE: Extracts entity-value pairs for BIO tagging in stage 17
QA/CLS: No processing in this stage (uses raw GT from stage 03)

Stage 07: Extract Handwriting

File: pipeline_07_extract_handwriting.py

Purpose: Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.

Key Functions:

main(): Main extraction orchestrator
parse_handwriting_geometries(): Extract elements with data-handwriting attribute
map_handwriting_to_bboxes(): Match handwriting text to OCR bboxes spatially
split_long_words(): Split words exceeding diffusion model's character limit
extract_handwriting_style(): Parse author ID from CSS class

Process:

Load geometries from stage 04
Filter elements with data-handwriting attribute
Load word/character bboxes from stage 05
Spatially match handwriting regions to bboxes
Split long words for diffusion model constraints
Extract author ID for style consistency
Save handwriting definitions with bbox mappings

Inputs:

Geometries from geometries/ (stage 04)
Word bboxes from pdf_word_bboxes/ (stage 05)
Character bboxes from pdf_char_bboxes/ (stage 05, if available)

Outputs:

handwriting_definitions/
  {doc_id}.json              # Handwriting region definitions
logs/handwriting_extraction/ # Extraction logs and warnings

Handwriting Definition Format:

[
  {
    "id": "hw0",
    "text": "John Smith",
    "author_id": "author1",
    "bboxes": [
      "20.0,30.0,60.0,45.0,John",
      "65.0,30.0,95.0,45.0,Smith"
    ],
    "rect": {
      "x": 20.0,
      "y": 30.0,
      "width": 75.0,
      "height": 15.0
    },
    "is_signature": false
  }
]

Key Features:

Author ID Extraction: Parses CSS classes like handwriting-author1 for style consistency
Spatial Matching: Uses bbox overlap/proximity to map handwriting regions to text
Long Word Splitting: Splits words > MAX_HANDWRITING_CHARS (default: 7) into multiple images
Signature Detection: Identifies signature fields for special handling
Character-Level Fallback: Uses word bboxes if char-level mapping unavailable

Configuration Constants:

MAX_HANDWRITING_CHARS: 7 (max characters per diffusion generation)
SPATIAL_MATCH_THRESHOLD: Pixel tolerance for bbox matching

Mapping Logic:

Find all words whose bboxes overlap/intersect with handwriting rect
Extract text from matched bboxes
Compare with expected handwriting text
If char bboxes available: split long words at char boundaries
If char bboxes unavailable: split at word boundaries

Error Handling:

No bbox matches → Warning, handwriting region skipped
Text mismatch → Logged, uses best match
Missing char bboxes → Word-level fallback (may affect quality)

Stage 08: Extract Visual Element Definitions

File: pipeline_08_extract_visual_element_definitions.py

Purpose: Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.

Key Functions:

main(): Main extraction orchestrator
parse_visual_element_geometries(): Extract elements with data-visual-element attribute
parse_css_dimension(): Parse width/height from CSS
parse_css_rotation(): Extract rotation angle from CSS transform

Process:

Load geometries from stage 04
Filter elements with data-visual-element attribute
Parse element type (stamp, logo, barcode, etc.)
Extract dimensions and rotation from CSS
Parse content (e.g., stamp text, barcode value)
Validate element types and dimensions
Save visual element definitions

Inputs:

Geometries from geometries/ (stage 04)

Outputs:

visual_element_definitions/
  {doc_id}.json              # Visual element definitions
logs/visual_element_extraction/ # Extraction logs

Visual Element Definition Format:

[
  {
    "id": "ve0",
    "type": "stamp",
    "type_unmapped": "stamp",
    "content": "CONFIDENTIAL",
    "rect": {
      "x": 150.0,
      "y": 250.0,
      "width": 60.0,
      "height": 30.0
    },
    "rotation": -15.0
  },
  {
    "id": "ve1",
    "type": "barcode",
    "type_unmapped": "barcode",
    "content": "1234567890",
    "rect": {
      "x": 20.0,
      "y": 270.0,
      "width": 100.0,
      "height": 20.0
    },
    "rotation": 0.0
  }
]

Supported Visual Element Types:

stamp: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
logo: Company/brand logos (selected from prefabs)
barcode: Code128 barcodes
chart: Charts and graphs (selected from prefabs)
photo: Photographic images (selected from prefabs)

Type Mapping: Maps LLM-generated type names to standard types using VISUAL_ELEMENT_TYPE_MAPPING constant.

Key Features:

Content Extraction: Parses text content for stamps, values for barcodes
Rotation Support: Extracts CSS rotate() transform angles
Dimension Parsing: Handles CSS units (px, mm, %, etc.)
Type Validation: Warns about unknown types, uses fallback mapping

Validation Checks:

Unknown type → Logged, mapped to closest known type
Zero dimensions → Error, element skipped
Invalid rotation angle → Defaults to 0°
Missing content (for stamps/barcodes) → Warning

CSS Parsing Examples:

/* Rotation extraction */
transform: rotate(-15deg);        → -15.0
transform: rotate(0.5turn);       → 180.0

/* Dimension extraction */
width: 60mm;                      → 60.0 (mm)
height: 30px;                     → Converted to mm

Stage 09: Create Handwriting Images

File: pipeline_09_create_handwriting_images.py

Purpose: Generate realistic handwriting images using a diffusion model, with per-author style consistency.

Key Functions:

main(): Main generation orchestrator
generate_handwriting_diffusion(): Call diffusion model API/local model
add_handwriting_blur(): Optional post-processing for realism

Process:

Load handwriting definitions from stage 07
Group by author ID for style consistency
For each text segment:
- Generate handwriting image with diffusion model
- Apply optional blur for realism
- Save image with metadata
Track author-to-writer style mapping
Save generation logs

Inputs:

Handwriting definitions from handwriting_definitions/ (stage 07)
Diffusion model checkpoint (local or API)

Outputs:

handwriting_images/
  {doc_id}/
    hw0_0.png               # First bbox of handwriting region hw0
    hw0_1.png               # Second bbox (if multi-word)
    hw1_0.png
logs/handwriting_generation/ # Generation logs with author mappings

Handwriting Image Specifications:

Format: PNG with transparency
Height: 40 pixels (configurable via HANDWRITING_HEIGHT_PX)
Padding: 0 pixels horizontal (configurable via HANDWRITING_PADDING_PX)
Background: Transparent
Color: Black text on transparent

Diffusion Model Integration: Located in handwriting_diffusion/ module:

generate_handwriting_diffusion_raw.py: Core generation logic
text_encoder.py: Text encoding for model input
tokenizer.py: Character tokenization

Key Features:

Author-Style Consistency: Same author ID → consistent handwriting style
Writer Style Mapping: Maps author IDs to writer style IDs (e.g., "author1" → writer_42)
Batch Processing: Generates multiple images efficiently
Optional Blur: Post-processing for more realistic appearance
Quality Control: Validates generated images (non-empty, correct dimensions)

Configuration Constants:

HANDWRITING_HEIGHT_PX: 40 pixels
HANDWRITING_PADDING_PX: 0 pixels
DIFFUSION_NUM_INFERENCE_STEPS: 50 (generation quality/speed tradeoff)
HANDWRITING_BLUR_ENABLED: Optional blur toggle
HANDWRITING_STYLES: List of available writer styles

Author-to-Writer Mapping Example:

{
  "author1": "writer_42",
  "author2": "writer_17",
  "author3": "writer_89"
}

Error Handling:

Generation failure → Retry up to 3 times
Empty image → Logged, document flagged
Model timeout → Skipped, logged

Stage 10: Create Visual Elements

File: pipeline_10_create_visual_elements.py

Purpose: Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.

Key Functions:

main(): Main generation orchestrator
route_visual_element_generation(): Router for element type
generate_stamp(): Create stamp image with text
select_logo(): Random selection from logo prefabs
generate_barcode(): Create Code128 barcode
select_photo(): Random selection from photo prefabs
select_chart(): Random selection from chart prefabs

Process:

Load visual element definitions from stage 08
For each element:
- Route to type-specific generator
- Generate or select image
- Resize to target dimensions
- Save with transparency (if applicable)
Cache prefab directories for performance
Save generation logs

Inputs:

Visual element definitions from visual_element_definitions/ (stage 08)
Prefab directories in data/visual_element_prefabs/:
- logos/
- photos/
- charts/

Outputs:

visual_element_images/
  {doc_id}/
    ve0.png                 # Stamp image
    ve1.png                 # Logo image
    ve2.png                 # Barcode image
logs/visual_element_generation/ # Generation logs

Type-Specific Generation:

Stamps:

Font: Configurable (default: Arial Bold)
Color: Configurable (default: red/blue)
Background: Transparent
Border: Optional rounded rectangle
Rotation: Applied during insertion (stage 13)

Logos:

Source: Random selection from data/visual_element_prefabs/logos/
Format: PNG with transparency preserved
Caching: Directory contents cached after first scan

Barcodes:

Type: Code128 (supports alphanumeric)
Library: python-barcode
Background: White
Content validation: Numeric or alphanumeric

Photos:

Source: Random selection from data/visual_element_prefabs/photos/
Format: JPEG or PNG
Aspect ratio: Preserved during resize

Charts/Figures:

Source: Random selection from data/visual_element_prefabs/charts/
Format: PNG with transparency
Types: Bar charts, line graphs, pie charts, etc.

Key Features:

Type-Specific Logic: Each element type has dedicated generation function
Prefab Caching: Directory scans cached for performance
Transparent Backgrounds: Stamps and some logos support transparency
Content Validation: Barcodes validate numeric content
Aspect Ratio Preservation: Images scaled without distortion

Configuration Constants:

STAMP_FONT_SIZE: Calculated from target dimensions
STAMP_BORDER_WIDTH: 2 pixels
BARCODE_DPI: 300 for high quality
PREFAB_CACHE_SIZE: In-memory cache limit

Error Handling:

Missing prefab directory → Error, element skipped
Empty prefab directory → Warning, fallback placeholder
Invalid barcode content → Logged, uses fallback text
Image generation failure → Placeholder created

Stage 11: Render PDF Second Pass

File: pipeline_11_render_pdf_second_pass.py

Purpose: Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.

Key Functions:

main(): Main rendering orchestrator
render_pdf_playwright_async(): Async Playwright rendering
remove_handwriting_placeholders_from_html(): Strip handwriting elements

Process:

Load HTML from render_html/ (stage 04)
Remove all elements with data-handwriting attribute
Re-render PDF using same dimensions from stage 04
Validate PDF output
Save PDFs without handwriting placeholders

Inputs:

HTML from render_html/ (stage 04)
Page dimensions from stage 04 logs

Outputs:

pdf_without_handwriting_placeholder/
  {doc_id}.pdf              # PDFs with handwriting regions blank
logs/rendering_second_pass/ # Rendering logs

Key Differences from Stage 04:

Uses Pre-calculated Dimensions: No content measurement needed
Handwriting Elements Removed: Not just hidden (visibility: hidden), but removed from DOM
No Geometry Extraction: Geometries already saved in stage 04
Faster Rendering: No JavaScript injection for measurement

HTML Preprocessing:

<!-- Before (Stage 04) -->
<div data-handwriting="author1" class="handwriting">John Smith</div>

<!-- After (Stage 11) -->
<!-- Element completely removed -->

Why This Stage Exists: Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.

Rendering Configuration:

Same timeout and retry logic as stage 04
Reuses Playwright browser context for efficiency
No debug output (geometries not extracted)

Error Handling:

Multi-page PDF → Flagged and skipped
Rendering timeout → Retry with increased timeout
HTML parsing error → Logged, document marked invalid

Stage 12: Insert Handwriting Images

File: pipeline_12_insert_handwriting_images.py

Purpose: Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.

Key Functions:

main(): Main insertion orchestrator
insert_handwriting_into_pdf(): Per-document insertion
scale_image_with_aspect_ratio(): Resize images while preserving aspect
group_bboxes_by_line(): Group multi-word handwriting by line

Process:

Load PDFs from stage 11
Load handwriting images from stage 09
Load handwriting definitions (bbox mappings) from stage 07
For each handwriting region:
- Group bboxes by line/block
- Scale images to fit bboxes
- Apply random offsets for natural variation
- Insert images at calculated positions
Save PDFs with handwriting

Inputs:

PDFs from pdf_without_handwriting_placeholder/ (stage 11)
Handwriting images from handwriting_images/ (stage 09)
Handwriting definitions from handwriting_definitions/ (stage 07)

Outputs:

pdf_with_handwriting/
  {doc_id}.pdf              # PDFs with handwriting inserted
logs/handwriting_insertion/ # Insertion logs
debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)

Insertion Logic:

1. Image Scaling:

High-res scaling: 3x upsampling before insertion
Aspect ratio preserved
Left-aligned within bbox (respects layout rect)

2. Positioning:

X coordinate: Left edge of bbox + random offset
Y coordinate: Top edge of bbox + random offset
Multi-word lines: Consistent Y offset for line

3. Random Offsets:

x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)

4. Block/Line Grouping: For multi-word handwriting (e.g., "John Smith"):

Group bboxes by Y coordinate (same line)
Apply consistent Y offset to entire line
Individual X offsets per word for natural spacing

Key Features:

High-Resolution Insertion: 3x scaling for quality
Natural Variation: Random offsets simulate handwriting imperfection
Line Consistency: Multi-word lines maintain baseline
Aspect Ratio Preservation: Images not distorted
Transparency Support: PNG alpha channel preserved

Configuration Constants:

HANDWRITING_IMAGE_UPSCALE_FACTOR: 3x
MAX_HANDWRITING_RAND_X: ±2 pixels
MAX_HANDWRITING_RAND_Y: ±1 pixel
HANDWRITING_LINE_Y_CONSISTENCY: Same Y offset per line

Coordinate System:

PDF uses 72 DPI (points)
Bboxes from stage 07 are in PDF coordinates
No conversion needed

Error Handling:

Missing handwriting image → Warning, bbox skipped
Image too large for bbox → Scaled down with warning
PyMuPDF insertion failure → Logged, document flagged

Stage 13: Insert Visual Elements

File: pipeline_13_insert_visual_elements.py

Purpose: Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.

Key Functions:

main(): Main insertion orchestrator
insert_visual_elements_into_pdf(): Per-document insertion
scale_image_with_aspect_ratio(): Resize images (same as stage 12)

Process:

Load PDFs from stage 12
Load visual element images from stage 10
Load visual element definitions from stage 08
For each visual element:
- Scale image to fit bbox
- Calculate centered position
- Apply rotation (if specified)
- Insert image at calculated position
Save final PDFs
If no visual elements: copy PDF from stage 12

Inputs:

PDFs from pdf_with_handwriting/ (stage 12)
Visual element images from visual_element_images/ (stage 10)
Visual element definitions from visual_element_definitions/ (stage 08)

Outputs:

pdf_final/
  {doc_id}.pdf              # Final PDFs with all elements
logs/visual_element_insertion/ # Insertion logs
debug/visual_element_insertion/ # Debug PDFs (optional)

Insertion Logic:

1. Image Scaling:

High-res scaling: 3x upsampling
Aspect ratio preserved
Centered within bbox (not left-aligned like handwriting)

2. Positioning:

X coordinate: Center of bbox - half image width
Y coordinate: Center of bbox - half image height

3. Rotation:

Applied via PyMuPDF transformation matrix
Rotation around image center
Angle from visual element definition

Key Differences from Stage 12 (Handwriting):

Centered placement: Visual elements centered in bbox
No random offsets: Precise placement for logos/stamps
Rotation support: Stamps often rotated for "APPROVED" effect
Fallback: Copies PDF if no visual elements (ensures output exists)

Rotation Transformation:

# PyMuPDF rotation matrix
rotation_matrix = fitz.Matrix(rotation_angle)
image_rect = image_rect * rotation_matrix

Key Features:

High-Resolution Insertion: 3x scaling for quality
Centered Alignment: Visual elements centered in bboxes
Rotation Support: Arbitrary angles for stamps
Transparency Preservation: PNG alpha channel maintained
Fallback Handling: Copies PDF if no visual elements

Configuration Constants:

VISUAL_ELEMENT_UPSCALE_FACTOR: 3x
ROTATION_PRECISION: Angle precision in degrees

Coordinate System:

Same as stage 12 (PDF 72 DPI)
Rotation applied after positioning

Error Handling:

Missing visual element image → Warning, element skipped
Image too large for bbox → Scaled down with warning
PyMuPDF insertion failure → Logged, document flagged
No visual elements → PDF copied from stage 12

Stage 14: Render Image

File: pipeline_14_render_image.py

Purpose: Convert final PDFs to high-quality PNG images for OCR and dataset distribution.

Key Functions:

main(): Main conversion orchestrator
convert_pdf_to_image(): PDF to PNG conversion

Process:

Load PDFs from stage 13
Convert each PDF to PNG using custom PDF-to-image module
Validate image dimensions
Save images

Inputs:

PDFs from pdf_final/ (stage 13)

Outputs:

images/
  {doc_id}.png              # Final document images
logs/image_rendering/       # Conversion logs

Image Specifications:

Format: PNG
DPI: Configurable (default: 200 DPI for quality OCR)
Color Mode: RGB (24-bit)
Compression: PNG lossless

PDF-to-Image Module: Located in custom module (not standard library):

Handles PDF rendering at specified DPI
Single-page conversion only (multi-page PDFs skipped)
Uses PDF coordinate system: 72 DPI internally

Key Features:

High DPI: 200+ DPI for accurate OCR
Single Page Only: Multi-page PDFs flagged as errors
Lossless Compression: PNG preserves all details
Size Validation: Checks image dimensions match PDF

Configuration Constants:

IMAGE_DPI: 200 (OCR quality vs file size tradeoff)
IMAGE_MAX_DIMENSION: Optional max width/height

Coordinate System Conversion:

PDF: 210mm × 297mm @ 72 DPI → 595 × 842 points
PNG: 210mm × 297mm @ 200 DPI → 1654 × 2339 pixels

Why This Stage:

OCR performs better on high-DPI images
Images are final output format for datasets
PNG preserves quality better than JPEG for text

Error Handling:

Multi-page PDF → Flagged and skipped
Conversion failure → Logged, document marked invalid
Empty image → Error, document flagged
Dimension mismatch → Warning logged

Stage 15: Perform OCR

File: pipeline_15_perform_ocr.py

Purpose: Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.

Key Functions:

main(): Main OCR orchestrator
call_microsoft_ocr(): Microsoft Azure OCR API call
convert_ocr_to_bbox_format(): Transform OCR results to internal format
aggregate_words_to_segments(): Group words into lines

Process:

Determine which documents need OCR:
- Has handwriting → Requires OCR
- Has visual elements → Requires OCR
- Neither → Copy PDF bboxes from stage 05
For documents requiring OCR:
- Call Microsoft OCR service
- Parse word-level bboxes
- Aggregate into line-level segments
- Convert coordinates to PDF space
Save final bboxes

Inputs:

Images from images/ (stage 14)
Handwriting definitions from handwriting_definitions/ (stage 07)
Visual element definitions from visual_element_definitions/ (stage 08)
PDF bboxes from pdf_word_bboxes/ (stage 05, for non-OCR documents)

Outputs:

final_word_bboxes/
  {doc_id}.json             # Word-level bounding boxes
final_segment_bboxes/
  {doc_id}.json             # Line-level bounding boxes
ocr_results_cache/
  {doc_id}.json             # Raw OCR API responses (cached)
logs/ocr/                   # OCR logs and errors

OCR Decision Logic:

requires_ocr = (
    has_handwriting(doc_id) or 
    has_visual_elements(doc_id)
)

if requires_ocr:
    perform_microsoft_ocr()
else:
    copy_pdf_bboxes()  # Reuse stage 05 results

Why Handwriting/Visual Elements Require OCR:

Handwriting images inserted in stage 12 → Not in PDF text layer
Visual elements inserted in stage 13 → Not in PDF text layer
PDF text extraction (stage 05) misses these elements
OCR captures all visible text on rendered image

Microsoft OCR API:

Service: Azure Computer Vision (Read API)
Input: PNG image (from stage 14)
Output: Word polygons, text, confidence scores
Caching: Results cached to avoid re-processing

OCR Response Format:

{
  "readResults": [{
    "page": 1,
    "words": [
      {
        "boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
        "text": "Invoice",
        "confidence": 0.98
      }
    ],
    "lines": [
      {
        "boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
        "text": "Invoice Date",
        "words": [...]
      }
    ]
  }]
}

Coordinate Conversion: OCR returns pixel coordinates; converted to PDF points:

pdf_x = ocr_x * (pdf_width / image_width)
pdf_y = ocr_y * (pdf_height / image_height)

Segment Aggregation: Groups words into lines based on Y coordinate proximity.

Key Features:

Selective OCR: Only runs OCR when necessary (cost/time savings)
Result Caching: Avoids redundant API calls
Dual-Level Output: Word and line (segment) bboxes
Coordinate Conversion: OCR pixels → PDF points
Fallback: Copies PDF bboxes for documents without handwriting/VEs

Configuration:

OCR API key from environment variables
Timeout: 30 seconds per request
Retry: Up to 3 attempts

Error Handling:

OCR API failure → Retry, fallback to PDF bboxes if all retries fail
Empty OCR result → Warning, uses PDF bboxes
Coordinate conversion error → Logged, bbox skipped

Stage 16: Normalize BBoxes

File: pipeline_16_normalize_bboxes.py

Purpose: Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.

Key Functions:

main(): Main normalization orchestrator
normalize_word_and_segment_bboxes(): Normalize word/segment bboxes
normalize_layout_bboxes(): Normalize layout element bboxes (DLA only)
normalize_coordinates(): Core coordinate transformation

Process:

Load final bboxes from stage 15
Load image dimensions (for normalization denominators)
For each bbox:
- Transform: normalized_x = pixel_x / image_width
- Transform: normalized_y = pixel_y / image_height
- Preserve text content
Save normalized bboxes
For DLA tasks: also normalize layout element bboxes

Inputs:

Word bboxes from final_word_bboxes/ (stage 15)
Segment bboxes from final_segment_bboxes/ (stage 15)
Layout element definitions from layout_element_definitions/ (stage 06, for DLA)
Image dimensions from stage 14 logs

Outputs:

normalized_word_bboxes/
  {doc_id}.json             # Normalized word bboxes
normalized_segment_bboxes/
  {doc_id}.json             # Normalized segment bboxes
normalized_gt/              # For DLA tasks only
  {doc_id}.json             # Normalized layout elements
logs/normalization/         # Normalization logs

Normalization Formula:

normalized_bbox = {
    "x0": bbox["x0"] / image_width,
    "y0": bbox["y0"] / image_height,
    "x1": bbox["x1"] / image_width,
    "y1": bbox["y1"] / image_height,
    "text": bbox["text"]  # Preserved
}

Normalized BBox Format:

[
  {
    "x0": 0.095,    # Was: 20 pixels out of 210mm @ 200dpi
    "y0": 0.128,    # Was: 30 pixels
    "x1": 0.286,    # Was: 60 pixels
    "y1": 0.192,    # Was: 45 pixels
    "text": "Invoice"
  }
]

DLA-Specific Normalization: For Document Layout Analysis tasks, layout element bboxes are also normalized:

[
  {
    "label": "title",
    "bbox": [0.095, 0.128, 0.905, 0.192]  # [x0, y0, x1, y1]
  },
  {
    "label": "text",
    "bbox": [0.095, 0.213, 0.905, 0.534]
  }
]

Why Normalization:

Model Training: Most models expect [0, 1] coordinates
Resolution Independence: Works across different image sizes
Standard Format: Matches common dataset formats (e.g., LayoutLM)

Coordinate System Mapping:

PDF Points (72 DPI):
  595 × 842 points (A4)
  ↓ (stage 14 conversion @ 200 DPI)
Image Pixels (200 DPI):
  1654 × 2339 pixels
  ↓ (stage 16 normalization)
Normalized [0, 1]:
  0.0-1.0 × 0.0-1.0

Key Features:

Preserves Text: Text content unchanged during normalization
Task-Specific: DLA tasks get additional layout bbox normalization
Validation: Checks for out-of-bounds coordinates (clamps to [0, 1])
Precision: Full float precision maintained

Error Handling:

Out-of-bounds coordinates → Clamped to [0, 1] with warning
Missing image dimensions → Error, document skipped
Zero image dimensions → Error, document marked invalid

Stage 17: GT Preparation & Verification

File: pipeline_17_gt_preparation_verification.py

Purpose: Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.

Key Functions:

main(): Main verification orchestrator
route_task(): Route to task-specific handler
handle_qa(): QA ground truth processing
handle_kie(): KIE ground truth with BIO tagging
handle_dla(): DLA ground truth processing
fuzzy_match_text(): Levenshtein-based text matching

Process:

Load raw GT from stage 03/06
Load final word bboxes from stage 15
Route to task-specific handler
Perform fuzzy text matching (GT text → OCR text)
Map GT annotations to bbox indices
Apply task-specific formatting
Validate and save verified GT

Inputs:

Raw GT from raw_annotations/ (stage 03/06)
Word bboxes from final_word_bboxes/ (stage 15)
Layout elements from layout_element_definitions/ (stage 06, for DLA)
Visual elements from visual_element_definitions/ (stage 08, for DLA)

Outputs:

verified_gt/
  qa/
    {doc_id}.json           # QA format
  kie/
    {doc_id}.json           # KIE format with BIO tagging
  dla/
    {doc_id}.json           # DLA format with normalized bboxes
  cls/
    {doc_id}.json           # Classification format
logs/gt_verification/       # Verification logs with match statistics

QA (Question Answering) Format

Process:

Load QA pairs: {"question": "answer", ...}
For each answer:
- Find words in bboxes matching answer text (fuzzy)
- Record bbox indices of matching words
Save verified GT with bbox mappings

Output Format:

{
  "questions": [
    {
      "question": "What is the invoice number?",
      "answer": "INV-12345",
      "answer_bbox_indices": [15, 16]  # Word indices in final_word_bboxes
    },
    {
      "question": "What is the total amount?",
      "answer": "$1,234.56",
      "answer_bbox_indices": [42]
    }
  ]
}

Fuzzy Matching: Uses Levenshtein distance with 0.85 similarity cutoff:

similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
if similarity >= 0.85:
    match_found = True

KIE (Key Information Extraction) Format

Process:

Load entity annotations: {entity_name: text_value, ...}
For each entity:
- Find words matching entity value (fuzzy)
- Generate BIO tags for all words
Save verified GT with BIO tagging

Output Format:

{
  "entities": [
    {
      "entity": "company_name",
      "value": "Acme Corporation",
      "bbox_indices": [5, 6]
    },
    {
      "entity": "invoice_date",
      "value": "January 15, 2024",
      "bbox_indices": [20, 21, 22]
    }
  ],
  "word_labels": [
    "O", "O", "O", "O", "O",           # Words 0-4: Outside entities
    "B-company_name", "I-company_name", # Words 5-6: Company name
    "O", "O", ...,                      # Words 7-19: Outside
    "B-invoice_date", "I-invoice_date", "I-invoice_date"  # Words 20-22
  ]
}

BIO Tagging:

B-entity: Beginning of entity
I-entity: Inside entity (continuation)
O: Outside any entity

DLA (Document Layout Analysis) Format

Process:

Load layout element definitions from stage 06
Load visual element definitions from stage 08
Validate labels (must be in valid_labels)
Check spatial constraints (no containment, minimal overlap)
Merge visual elements into layout annotations
Normalize bboxes to [0, 1]
Save verified GT

Output Format:

{
  "layout_elements": [
    {
      "id": "layout_0",
      "label": "title",
      "bbox": [0.095, 0.128, 0.905, 0.192]  # Normalized [x0, y0, x1, y1]
    },
    {
      "id": "layout_1",
      "label": "text",
      "bbox": [0.095, 0.213, 0.905, 0.534]
    },
    {
      "id": "ve0",
      "label": "figure",  # Visual element mapped to layout label
      "bbox": [0.714, 0.895, 0.905, 0.980]
    }
  ]
}

Visual Element Merging: Visual elements from stage 08 are converted to layout labels:

stamp → figure (or custom mapping)
logo → figure
chart → figure
barcode → figure
photo → figure

Spatial Validation:

No containment: One bbox fully inside another → Error
Minimal overlap: Overlap area < 5% of smaller bbox → Warning
Valid labels: All labels must be in SynDatasetDefinition.valid_labels

CLS (Classification) Format

Process:

Load classification label from raw GT
Validate label against expected classes
Save verified GT

Output Format:

{
  "document_class": "invoice",
  "confidence": 1.0
}

Fuzzy Matching Details:

Uses Levenshtein distance (via fuzz library) to handle OCR discrepancies:

# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
# similarity = 0.89 (above 0.85 threshold) → Match!

Match Statistics Logged:

Total GT annotations
Successfully matched annotations
Failed matches (similarity < 0.85)
Average similarity score

Key Features:

Fuzzy Matching: Handles OCR errors gracefully
Task-Specific Formatting: QA, KIE, DLA, CLS all handled differently
BIO Tagging: Automatic generation for KIE
Visual Element Integration: DLA merges visual elements as layout annotations
Spatial Validation: Detects overlapping/contained layout elements
Similarity Tracking: Logs match quality for analysis

Error Handling:

Match failure (similarity < 0.85) → Logged, annotation skipped
Invalid labels (DLA) → Error, document marked invalid
Spatial violations (DLA) → Warning, elements flagged
Missing bboxes → Error, document marked invalid

Stage 18: Analyze

File: pipeline_18_analyze.py

Purpose: Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.

Key Functions:

main(): Main analysis orchestrator
calculate_api_costs(): Compute LLM API costs from token usage

Process:

Load all document logs from stages 01-17
Categorize documents: valid vs. invalid
Calculate error distributions
Compute API usage and costs
Generate statistics (handwriting, visual elements, annotations)
Save comprehensive dataset log

Inputs:

All document logs from previous stages
Batch results from stage 02 (for cost calculation)
Message processing logs from stage 03
Prompt usage statistics

Outputs:

dataset_log.json            # Comprehensive dataset statistics
logs/analysis/              # Analysis logs

Dataset Log Structure:

{
  "metadata": {
    "syndatadef_name": "docvqa_alpha=1.0",
    "task": "qa",
    "total_documents_requested": 1000,
    "generation_date": "2026-02-07"
  },
  
  "prompting": {
    "total_prompts": 100,
    "total_batches": 10,
    "llm_model": "claude-sonnet-4-20250514"
  },
  
  "total_cost_summary": {
    "total_cost_usd": 123.45,
    "input_tokens": 1500000,
    "output_tokens": 800000,
    "cached_tokens": 500000,
    "cost_per_document": 0.12
  },
  
  "valid_samples_stats": {
    "total_valid": 847,
    "total_invalid": 153,
    "validity_rate": 0.847,
    "avg_handwriting_regions_per_doc": 2.3,
    "avg_visual_elements_per_doc": 1.5,
    "avg_annotations_per_doc": 8.7,
    "documents_with_handwriting": 654,
    "documents_with_visual_elements": 512
  },
  
  "valid_samples": [
    {
      "doc_id": "doc_0001",
      "seed_image": "docvqa_train_12345",
      "has_handwriting": true,
      "has_visual_elements": true,
      "num_annotations": 10,
      "num_words": 247,
      "image_size": [1654, 2339]
    }
  ],
  
  "valid_samples_by_category": {
    "invoice": 234,
    "receipt": 198,
    "form": 415
  },
  
  "errors": {
    "multipage_pdf": 12,
    "missing_ocr_result": 5,
    "failed_gt_verification": 38,
    "rendering_timeout": 8,
    "llm_parsing_error": 23,
    "bbox_extraction_failed": 4,
    "handwriting_generation_failed": 7,
    "visual_element_generation_failed": 3,
    "other": 53
  }
}

Error Categories:

Error Category	Description
`multipage_pdf`	PDF rendered with multiple pages (invalid)
`missing_ocr_result`	OCR failed or returned empty result
`failed_gt_verification`	GT matching/validation failed in stage 17
`rendering_timeout`	PDF rendering exceeded timeout
`llm_parsing_error`	Failed to extract HTML/GT from LLM response
`bbox_extraction_failed`	PyMuPDF failed to extract bboxes
`handwriting_generation_failed`	Diffusion model failed
`visual_element_generation_failed`	VE generation/selection failed
`other`	Miscellaneous errors

Cost Calculation:

Claude API Pricing (example):

costs = {
    "input": input_tokens * 0.003 / 1000,      # $3 per 1M tokens
    "output": output_tokens * 0.015 / 1000,    # $15 per 1M tokens
    "cached": cached_tokens * 0.0003 / 1000    # $0.30 per 1M tokens (prompt caching)
}
total_cost = costs["input"] + costs["output"] + costs["cached"]

Statistics Computed:

Validity Rate: Percentage of documents passing all stages
Handwriting Stats: Documents with handwriting, avg regions per doc
Visual Element Stats: Documents with VEs, avg elements per doc
Annotation Stats: Avg QA pairs/KIE entities/DLA elements per doc
Token Usage: Input/output/cached token totals
Cost Metrics: Total cost, cost per document, cost per valid document

Key Features:

Comprehensive Error Tracking: All error categories logged
Cost Transparency: Token-level cost breakdown
Quality Metrics: Validity rate, avg annotations, etc.
Category Breakdown: Valid samples grouped by document type
Per-Document Tracking: Each valid document's metadata saved

Usage: This log is essential for:

Understanding dataset quality
Optimizing pipeline (identify bottlenecks)
Cost estimation for future runs
Debugging (error distribution analysis)

Stage 19: Create Debug Data

File: pipeline_19_create_debug_data.py

Purpose: Generate comprehensive debug visualizations for manual inspection and quality assurance.

Key Functions:

main(): Main debug generator
visualize_visual_element_bboxes(): Overlay VE bboxes on PDFs
visualize_final_bboxes_on_images(): Overlay OCR bboxes on images
visualize_pdf_bboxes(): Overlay PDF bboxes

Process:

Load all generated documents
Create debug subdirectories
For each document:
- Generate PDF bbox overlays
- Generate VE bbox overlays
- Generate handwriting insertion region overlays
- Generate final OCR bbox overlays on images
Copy raw HTML with debug.js script for browser inspection

Inputs:

All intermediate and final outputs from stages 01-18

Outputs:

debug/
  pdf_bboxes/               # PDF word bboxes overlaid (stage 05)
    {doc_id}.pdf
  visual_element_bboxes/    # VE bboxes overlaid (stage 08)
    {doc_id}.pdf
  handwriting_insertion/    # Handwriting regions overlaid (stage 12)
    {doc_id}.pdf
  final_bboxes_on_images/   # OCR bboxes on final images (stage 15)
    {doc_id}.png
  html_with_debug/          # Raw HTML + debug.js
    {doc_id}.html
    debug.js                # Browser-based inspection script

Debug Visualizations:

1. PDF BBoxes (Stage 05):

Red rectangles: Word bounding boxes from PyMuPDF
Annotated with word text
Purpose: Verify PDF text extraction quality

2. Visual Element BBoxes (Stage 08):

Blue rectangles: Visual element placeholder regions
Annotated with element type (stamp, logo, etc.)
Purpose: Verify VE extraction and positioning

3. Handwriting Insertion Regions (Stage 12):

Green rectangles: Handwriting bbox regions
Annotated with handwriting text
Purpose: Verify handwriting placement accuracy

4. Final BBoxes on Images (Stage 15):

Orange rectangles: OCR word bounding boxes
Overlaid on final rendered images
Purpose: Verify OCR accuracy and coverage

Debug JavaScript (debug.js):

// Browser-based inspection tool
// Features:
// - Highlight elements on hover
// - Show geometry data in console
// - Toggle element visibility
// - Measure element dimensions

Key Features:

Color-Coded Overlays: Different colors for different stages
Text Annotations: Bboxes labeled with content
Multi-Format: Both PDF and PNG visualizations
Browser Inspection: HTML with interactive debug script
Selective Generation: Only enabled with DEBUG_MODE=true flag

Configuration:

DEBUG_MODE: Enable/disable debug output (default: false)
DEBUG_BBOX_LINE_WIDTH: Line thickness for overlays (default: 2)
DEBUG_BBOX_OPACITY: Overlay transparency (default: 0.5)

When to Use:

Visual quality assurance
Debugging bbox extraction issues
Verifying handwriting/VE insertion
Identifying OCR problems
Manual inspection of edge cases

Performance Note: Debug generation adds ~20% processing time; disabled by default in production.

API Implementation

The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                         API ARCHITECTURE                         │
└─────────────────────────────────────────────────────────────────┘

Client Request
     │
     ↓
┌─────────────────────┐
│   FastAPI Server    │
│   (main.py)         │
└─────────────────────┘
     │
     ├──► Validate Request (schemas.py)
     │
     ├──► Download Seed Images (utils.py)
     │
     ├──► Build Prompt (utils.py)
     │
     ├──► Call Claude API (Synchronous)
     │    └──► Claude Sonnet 4.5
     │
     ├──► Extract HTML & GT (pipeline_03 functions)
     │
     ├──► Render PDF (pipeline_04 functions)
     │    └──► Playwright/Chromium
     │
     ├──► Extract BBoxes (pipeline_05 functions)
     │    └──► PyMuPDF
     │
     └──► Return Response
          └──► JSON with base64-encoded PDFs

Endpoints

GET /

Health check endpoint.

Response:

{
  "message": "DocGenie API is running",
  "version": "1.0",
  "status": "healthy"
}

GET /health

Detailed health status.

Response:

{
  "status": "healthy",
  "api_version": "1.0",
  "playwright_available": true,
  "llm_model": "claude-sonnet-4-5-20250929"
}

POST /generate

Generate documents with ground truth annotations.

Request Body (schemas.GenerateDocumentRequest):

{
  "seed_images": [
    "https://example.com/seed1.jpg",
    "https://example.com/seed2.jpg"
  ],
  "prompt_params": {
    "language": "English",
    "doc_type": "business and administrative",
    "gt_type": "Multiple questions and their answers",
    "gt_format": "{\"question\": \"answer\", ...}",
    "num_solutions": 2
  }
}

Request Validation:

seed_images: 1-10 URLs (HTTPS only)
num_solutions: 1-5 documents
All prompt parameters required

Response (schemas.GenerateDocumentResponse):

{
  "success": true,
  "message": "Successfully generated 2 documents",
  "documents": [
    {
      "document_id": "doc_20260207_001",
      "html": "<html>...</html>",
      "css": "body { ... }",
      "ground_truth": {
        "What is the invoice number?": "INV-12345"
      },
      "pdf_base64": "JVBERi0xLjQK...",
      "bboxes": [
        {
          "x0": 20.0,
          "y0": 30.0,
          "x1": 60.0,
          "y1": 45.0,
          "text": "Invoice"
        }
      ],
      "page_width_mm": 210.0,
      "page_height_mm": 297.0
    }
  ],
  "total_documents": 2
}

POST /generate-files

Generate documents and return as downloadable files.

Request: Same as /generate

Response:

Content-Type: application/zip
File: ZIP archive containing:
- doc_001.pdf
- doc_001_gt.json
- doc_001_bboxes.json
- doc_002.pdf
- ...

File Structure:

generated_documents.zip
├── doc_001.pdf
├── doc_001_gt.json
├── doc_001_bboxes.json
├── doc_001_metadata.json
├── doc_002.pdf
├── ...

Request/Response Schemas

Defined in api/schemas.py:

PromptParameters

class PromptParameters(BaseModel):
    language: str = "English"
    doc_type: str = "business and administrative"
    gt_type: str = "Multiple questions and their answers"
    gt_format: str = '{"question": "answer", ...}'
    num_solutions: int = Field(default=1, ge=1, le=5)

GenerateDocumentRequest

class GenerateDocumentRequest(BaseModel):
    seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
    prompt_params: PromptParameters

BoundingBox

class BoundingBox(BaseModel):
    x0: float
    y0: float
    x1: float
    y1: float
    text: str

DocumentResult

class DocumentResult(BaseModel):
    document_id: str
    html: str
    css: str
    ground_truth: Optional[dict]
    pdf_base64: str
    bboxes: List[BoundingBox]
    page_width_mm: float
    page_height_mm: float

GenerateDocumentResponse

class GenerateDocumentResponse(BaseModel):
    success: bool
    message: str
    documents: List[DocumentResult]
    total_documents: int

API Pipeline Flow

Detailed integration with pipeline stages:

# Simplified API flow (from api/main.py)

@app.post("/generate", response_model=GenerateDocumentResponse)
async def generate_documents(request: GenerateDocumentRequest):
    # 1. Download and encode seed images
    seed_images_base64 = await download_and_encode_images(request.seed_images)
    
    # 2. Build prompt from template
    prompt = build_prompt_from_template(
        seed_images=seed_images_base64,
        params=request.prompt_params
    )
    
    # 3. Call Claude API (synchronous, not batched)
    llm_response = await call_claude_api(
        prompt=prompt,
        model="claude-sonnet-4-5-20250929"
    )
    
    # 4. Extract HTML and GT (from pipeline_03)
    documents_html = extract_html_from_message(llm_response)
    documents_gt = extract_gt_from_html(documents_html)
    
    # 5. Validate HTML (from utils.py)
    validate_html_structure(documents_html)
    
    # 6. Render PDFs (from pipeline_04)
    pdfs = await render_pdfs_with_playwright(documents_html)
    
    # 7. Validate PDFs (from utils.py)
    validate_pdf_pages(pdfs)
    
    # 8. Extract bboxes (from pipeline_05)
    bboxes = extract_bboxes_from_pdfs(pdfs)
    
    # 9. Validate bboxes (from utils.py)
    validate_bbox_completeness(bboxes)
    
    # 10. Encode PDFs to base64
    pdfs_base64 = encode_pdfs_to_base64(pdfs)
    
    # 11. Build response
    return GenerateDocumentResponse(
        success=True,
        documents=[...],
        total_documents=len(documents_html)
    )

Integration with Pipeline Functions

Reused from Pipeline:

extract_html_from_message() from pipeline_03
extract_gt_from_html() from pipeline_03
render_pdf_with_playwright() from pipeline_04
extract_bboxes_from_pdf() from pipeline_05
Various utilities from docgenie.generation.utils

API-Specific Functions (api/utils.py):

download_seed_images(): Fetch images from URLs
encode_images_to_base64(): Convert images for API transmission
build_prompt_from_template(): Template-based prompt construction
call_claude_api_sync(): Synchronous Claude API call (non-batched)
encode_pdf_to_base64(): PDF encoding for response
validate_html_structure(): HTML validation
validate_pdf_pages(): PDF page count/size validation
validate_bbox_completeness(): Ensure bboxes extracted

Configuration

Environment Variables (.env)

ANTHROPIC_API_KEY=sk-ant-...            # Required for Claude API
LLM_MODEL=claude-sonnet-4-5-20250929   # Default model
API_PORT=8000                           # Server port
DEBUG_MODE=false                        # Enable debug logging

Prompt Templates

Located in data/prompt_templates/:

Template Structure:

data/prompt_templates/<template_name>/
├── system_prompt.txt                   # System message
├── user_prompt_template.txt            # User message template
└── example_output.html                 # Example for few-shot

Placeholder Substitution:

# In user_prompt_template.txt
"""
Please generate {num_solutions} {doc_type} documents in {language}.

Ground truth format: {gt_format}
Ground truth type: {gt_type}
"""

# Substituted with request.prompt_params

CORS Configuration

# In api/main.py
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],        # Open for development
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Authentication & Security

Current State:

No built-in authentication: API is open (for development)
API Key Management: Claude API key in .env, not passed in requests
CORS: Open (allow_origins=["*"]) for development

Production Recommendations:

Add API key authentication (e.g., Bearer tokens)
Restrict CORS origins to known frontends
Rate limiting (e.g., Redis-based)
Input sanitization for HTML injection prevention
HTTPS only (terminate SSL at reverse proxy)

Performance Considerations

Async Rendering:

Playwright rendering uses async/await
Concurrent requests supported by FastAPI
Semaphore control prevents resource exhaustion

Image Size Limits:

Seed images compressed before API transmission
Max dimension: 1024px (configurable)
JPEG quality: 85% (configurable)

Timeouts:

PDF render timeout: 60 seconds per document
Claude API timeout: 120 seconds
Total request timeout: 300 seconds (5 minutes)

Retry Logic:

PDF rendering: Up to 3 attempts
Claude API: Up to 2 retries on network errors
No retry on validation failures

Concurrency:

FastAPI default: Multiple workers (configurable)
Playwright: Semaphore-controlled (10 concurrent renders)

Limitations

Current Limitations:

Single-Page Only: Multi-page PDFs flagged as errors
Seed Image Limit: Maximum 10 seed images per request
Document Limit: Maximum 5 document variations (num_solutions)
No Handwriting/Visual Elements: Stages 07-13 not integrated (API stops at stage 06)
Synchronous LLM: No batching (higher cost per document)
No Dataset Export: No /export-dataset endpoint for full pipeline runs

Known Issues:

Large documents (>10 pages worth of content) may timeout
Complex CSS (animations, 3D transforms) may not render correctly
Some Unicode characters may not display in PDFs

Future Integration Plan

From api/PIPELINE_INTEGRATION.md:

Stage 3: Handwriting & Visual Elements (Stages 07-11)

New Request Parameters:

class PromptParameters(BaseModel):
    # ... existing ...
    enable_handwriting: bool = False
    handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
    enable_visual_elements: bool = False
    visual_element_types: List[str] = ["stamp", "logo"]

New Response Fields:

class DocumentResult(BaseModel):
    # ... existing ...
    handwriting_regions: Optional[List[HandwritingRegion]]
    visual_elements: Optional[List[VisualElement]]

Impact:

Longer processing time (diffusion model: ~5s per handwriting region)
Larger response size (additional images)

Stage 4: Image Finalization & OCR (Stages 12-15)

New Response Fields:

class DocumentResult(BaseModel):
    # ... existing ...
    image_base64: str                    # Final rendered image (PNG)
    ocr_text: str                        # Full OCR text
    ocr_confidence: float                # Average OCR confidence

Impact:

OCR API costs (~$1.50 per 1000 images)
Additional 2-3 seconds per document (OCR latency)

Stage 5: Dataset Packaging (Stages 16-19)

New Endpoint:

@app.post("/export-dataset")
async def export_dataset(request: ExportDatasetRequest):
    """
    Run full pipeline (stages 01-19) and return packaged dataset.
    """
    # Run pipeline with syndatadef
    # Return ZIP with images, GTs, statistics

Request:

class ExportDatasetRequest(BaseModel):
    syndatadef_config: dict              # Full SynDatasetDefinition
    output_format: str = "huggingface"   # "huggingface", "coco", "custom"

Response:

ZIP archive with full dataset
Includes dataset_log.json from stage 18

Example Usage

Python Client (api/example_usage.py)

import requests
import base64
from pathlib import Path

# API endpoint
API_URL = "http://localhost:8000/generate"

# Prepare request
request_data = {
    "seed_images": [
        "https://example.com/invoice_seed.jpg",
        "https://example.com/receipt_seed.jpg"
    ],
    "prompt_params": {
        "language": "English",
        "doc_type": "invoices and receipts",
        "gt_type": "Multiple questions and their answers",
        "gt_format": '{"question": "answer", ...}',
        "num_solutions": 3
    }
}

# Call API
response = requests.post(API_URL, json=request_data)
result = response.json()

# Process results
if result["success"]:
    for doc in result["documents"]:
        doc_id = doc["document_id"]
        
        # Save PDF
        pdf_data = base64.b64decode(doc["pdf_base64"])
        Path(f"{doc_id}.pdf").write_bytes(pdf_data)
        
        # Save GT
        Path(f"{doc_id}_gt.json").write_text(
            json.dumps(doc["ground_truth"], indent=2)
        )
        
        # Save HTML
        Path(f"{doc_id}.html").write_text(doc["html"])
        
        print(f"Saved: {doc_id}")
        print(f"  - BBoxes: {len(doc['bboxes'])}")
        print(f"  - GT Annotations: {len(doc['ground_truth'])}")

cURL Example

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "seed_images": [
      "https://example.com/seed.jpg"
    ],
    "prompt_params": {
      "language": "English",
      "doc_type": "business documents",
      "gt_type": "Questions and answers",
      "gt_format": "{\"question\": \"answer\"}",
      "num_solutions": 1
    }
  }'

JavaScript/TypeScript Client

const response = await fetch('http://localhost:8000/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    seed_images: ['https://example.com/seed.jpg'],
    prompt_params: {
      language: 'English',
      doc_type: 'invoices',
      gt_type: 'Questions and answers',
      gt_format: '{"question": "answer"}',
      num_solutions: 2
    }
  })
});

const result = await response.json();

// Decode PDF
const pdfBlob = new Blob(
  [Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
  { type: 'application/pdf' }
);

// Download PDF
const url = URL.createObjectURL(pdfBlob);
const a = document.createElement('a');
a.href = url;
a.download = `${result.documents[0].document_id}.pdf`;
a.click();

Testing

Test File: api/test_api.py

import pytest
from fastapi.testclient import TestClient
from api.main import app

client = TestClient(app)

def test_health_check():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

def test_generate_documents():
    request_data = {
        "seed_images": ["https://example.com/seed.jpg"],
        "prompt_params": {
            "language": "English",
            "doc_type": "invoices",
            "gt_type": "Questions",
            "gt_format": "{\"q\": \"a\"}",
            "num_solutions": 1
        }
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 200
    result = response.json()
    assert result["success"] is True
    assert len(result["documents"]) > 0

def test_invalid_request():
    request_data = {
        "seed_images": [],  # Invalid: empty
        "prompt_params": {"num_solutions": 10}  # Invalid: > 5
    }
    response = client.post("/generate", json=request_data)
    assert response.status_code == 422  # Validation error

Run Tests:

cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
pytest api/test_api.py -v

Deployment

Start Server:

cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
chmod +x start.sh
./start.sh

start.sh Contents:

#!/bin/bash
export $(cat .env | xargs)
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Production Deployment:

# With Gunicorn (multi-worker)
gunicorn main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 300

Docker Deployment:

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Core Models & Utilities

SynDatasetDefinition Model

File: docgenie/generation/models/_syndatadef.py

Purpose: Configuration object for synthetic dataset generation, loaded from YAML files.

Key Attributes:

@dataclass
class SynDatasetDefinition:
    # Dataset metadata
    name: str                           # "docvqa_alpha=1.0"
    task: TaskType                      # TaskType.QA, .KIE, .DLA, .CLS
    base_dataset: str                   # "docvqa", "cord", etc.
    
    # LLM configuration
    llm_model: str                      # "claude-sonnet-4-20250514"
    prompt_template: str                # "DocGenie"
    num_solutions: int                  # Documents per prompt
    
    # Prompting parameters
    language: str                       # "English", "German", etc.
    doc_type: str                       # "business and administrative"
    gt_type: str                        # Task-specific GT description
    gt_format: str                      # Expected GT format
    
    # Dataset parameters
    num_samples: int                    # Total documents to generate
    alpha: float                        # Clustering diversity parameter
    
    # Seed selection
    seeds_per_cluster: int              # Seeds sampled per cluster
    clustering_method: str              # "kmeans", "agglomerative"
    
    # Task-specific
    valid_labels: Optional[List[str]]   # For DLA: ["title", "text", ...]
    
    # Output paths
    output_dir: Path                    # Root output directory

Key Methods:

def get_file_structure(self) -> FileStructure:
    """Returns FileStructure manager for output directories."""

def build_prompt(self, seed_images: List[str]) -> str:
    """Builds prompt from template with parameter substitution."""

def iter_document_logs(self) -> Iterator[DocumentLog]:
    """Iterates over all document logs."""

def update_document_status(self, doc_id: str, status: Status):
    """Updates document status in log."""

@classmethod
def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
    """Load configuration from YAML file."""

Example YAML:

name: "docvqa_alpha=1.0"
task: "qa"
base_dataset: "docvqa"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1

language: "English"
doc_type: "business and administrative documents"
gt_type: "Multiple questions and their answers"
gt_format: '{"question": "answer", ...}'

num_samples: 1000
alpha: 1.0
seeds_per_cluster: 10
clustering_method: "kmeans"

output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"

PipelineParameters Model

File: docgenie/generation/models/_pipeline.py

Purpose: Runtime parameters for pipeline execution.

Attributes:

@dataclass
class PipelineParameters:
    # Execution control
    start_stage: int = 1                # First stage to execute
    end_stage: int = 19                 # Last stage to execute
    skip_existing: bool = True          # Skip documents with existing outputs
    
    # Parallelization
    max_workers: int = 10               # Concurrent processing
    chromium_concurrency: int = 10      # Parallel PDF renders
    
    # Debug mode
    debug_mode: bool = False            # Enable debug visualizations
    
    # Retry configuration
    max_retries: int = 3                # Retry attempts
    retry_delay: float = 2.0            # Seconds between retries
    
    # Timeouts
    pdf_render_timeout: int = 60        # Seconds
    ocr_timeout: int = 30               # Seconds

FileStructure Model

File: docgenie/generation/models/_file.py

Purpose: Manages directory structure for generated data.

Key Properties:

@dataclass
class FileStructure:
    root: Path                          # Root output directory
    
    # Directory properties (all return Path)
    seeds_directory: Path
    prompt_batches_directory: Path
    message_results_directory: Path
    raw_html_directory: Path
    raw_annotations_directory: Path
    geometries_directory: Path
    pdf_initial_directory: Path
    render_html_directory: Path
    pdf_word_bboxes_directory: Path
    pdf_char_bboxes_directory: Path
    layout_element_definitions_directory: Path
    handwriting_definitions_directory: Path
    visual_element_definitions_directory: Path
    handwriting_images_directory: Path
    visual_element_images_directory: Path
    pdf_without_handwriting_placeholder_directory: Path
    pdf_with_handwriting_directory: Path
    pdf_final_directory: Path
    images_directory: Path
    final_word_bboxes_directory: Path
    final_segment_bboxes_directory: Path
    normalized_word_bboxes_directory: Path
    normalized_segment_bboxes_directory: Path
    verified_gt_directory: Path
    
    # Debug subdirectories
    debug_directory: Path
    debug_pdf_bboxes_directory: Path
    debug_visual_element_bboxes_directory: Path
    debug_handwriting_insertion_directory: Path
    debug_final_bboxes_on_images_directory: Path
    debug_html_with_debug_directory: Path

Key Methods:

def create_all_directories(self):
    """Create all output directories."""

def get_document_path(self, doc_id: str, stage: str) -> Path:
    """Get path for specific document and stage."""

DocumentLog Model

File: docgenie/generation/models/_log.py

Purpose: Document-level metadata and status tracking.

Attributes:

@dataclass
class DocumentLog:
    doc_id: str                         # Unique document ID
    seed_image_id: str                  # Source seed image
    prompt_call_id: str                 # Prompt batch/call ID
    
    status: Status                      # VALID, INVALID, PROCESSING
    
    # Stage completion flags
    has_raw_html: bool = False
    has_raw_gt: bool = False
    has_pdf_initial: bool = False
    has_geometries: bool = False
    has_bboxes: bool = False
    has_handwriting: bool = False
    has_visual_elements: bool = False
    has_final_image: bool = False
    has_ocr_result: bool = False
    has_verified_gt: bool = False
    
    # Statistics
    num_words: int = 0
    num_annotations: int = 0
    num_handwriting_regions: int = 0
    num_visual_elements: int = 0
    
    # Error tracking
    error_stage: Optional[str] = None
    error_message: Optional[str] = None
    error_category: Optional[str] = None
    
    # Timestamps
    created_at: datetime
    updated_at: datetime

Key Methods:

def mark_stage_complete(self, stage: str):
    """Mark pipeline stage as complete."""

def mark_error(self, stage: str, error_msg: str, category: str):
    """Record error and mark document as invalid."""

def is_valid(self) -> bool:
    """Check if document passed all stages."""

BBox Model

File: docgenie/generation/models/_bbox.py

Purpose: Bounding box representation with text content.

Attributes:

@dataclass
class BBox:
    rect: Rect                          # {x0, y0, x1, y1}
    text: str                           # Text content
    metadata: Optional[dict] = None     # Additional data

Key Methods:

@property
def width(self) -> float:
    return self.rect["x1"] - self.rect["x0"]

@property
def height(self) -> float:
    return self.rect["y1"] - self.rect["y0"]

def normalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert to normalized [0, 1] coordinates."""

def unnormalize(self, image_width: float, image_height: float) -> "BBox":
    """Convert from normalized to pixel coordinates."""

def to_dict(self) -> dict:
    """Serialize to dictionary."""

@classmethod
def from_dict(cls, data: dict) -> "BBox":
    """Deserialize from dictionary."""

LayoutBBox Model

File: docgenie/generation/models/_bbox.py

Purpose: Layout element bounding box (for DLA tasks).

Attributes:

@dataclass
class LayoutBBox:
    label: str                          # "title", "text", "table", etc.
    rect: Rect                          # {x0, y0, x1, y1}
    metadata: Optional[dict] = None

Key Methods:

def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
    """Normalize coordinates."""

def contains(self, other: "LayoutBBox") -> bool:
    """Check if this bbox fully contains another."""

def overlaps(self, other: "LayoutBBox") -> bool:
    """Check if this bbox overlaps with another."""

def overlap_area(self, other: "LayoutBBox") -> float:
    """Calculate overlap area with another bbox."""

Utility Modules

BBox Utilities (utils/bboxes.py)

def load_bboxes_from_file(file_path: Path) -> List[BBox]:
    """Load bboxes from JSON file."""

def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
    """Save bboxes to JSON file."""

def visualize_bboxes_on_pdf(
    pdf_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "red"
):
    """Draw bbox overlays on PDF."""

def visualize_bboxes_on_image(
    image_path: Path,
    bboxes: List[BBox],
    output_path: Path,
    color: str = "orange"
):
    """Draw bbox overlays on image."""

def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
    """Check if bbox1 contains bbox2."""

Geometry Utilities (utils/geos.py)

def filter_layout_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-label attribute."""

def filter_handwriting_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-handwriting attribute."""

def filter_visual_elements(geometries: dict) -> List[dict]:
    """Extract elements with data-visual-element attribute."""

def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
    """Extract elements with specific CSS class."""

Serialization Utilities (utils/serialization.py)

def encode_image_to_base64(image_path: Path) -> str:
    """Encode image file to base64 string."""

def decode_base64_to_image(base64_str: str, output_path: Path):
    """Decode base64 string and save as image."""

def serialize_dataclass(obj: Any) -> dict:
    """Serialize dataclass to dictionary."""

def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
    """Deserialize dictionary to dataclass."""

Configuration & Constants

Key Constants (generation/constants.py)

# Document processing
HTML_PARSER = "html.parser"           # BeautifulSoup parser
PDF_POINT_SCALING = 72 / 96           # CSS DPI to PDF DPI

# Bounding boxes
BBOX_OVERLAP_THRESHOLD = 0.05         # 5% overlap tolerance
SPATIAL_MATCH_THRESHOLD = 10.0        # Pixel tolerance for matching

# PDF rendering
CHROMIUM_CONCURRENCY = 10             # Parallel renders
PER_PDF_RENDER_TIMEOUT = 60           # Seconds
PER_PDF_RENDER_MAX_RETRIES = 3        # Retry attempts

# Handwriting
MAX_HANDWRITING_CHARS = 7             # Max chars per diffusion generation
HANDWRITING_HEIGHT_PX = 40            # Image height
HANDWRITING_PADDING_PX = 0            # Horizontal padding
DIFFUSION_NUM_INFERENCE_STEPS = 50    # Generation quality
HANDWRITING_IMAGE_UPSCALE_FACTOR = 3  # Insertion scaling
MAX_HANDWRITING_RAND_X = 2            # Random X offset (pixels)
MAX_HANDWRITING_RAND_Y = 1            # Random Y offset (pixels)

# Visual elements
VISUAL_ELEMENT_UPSCALE_FACTOR = 3     # Insertion scaling
STAMP_BORDER_WIDTH = 2                # Stamp border thickness
BARCODE_DPI = 300                     # Barcode image quality

# OCR
IMAGE_DPI = 200                       # Final image DPI
OCR_CONFIDENCE_THRESHOLD = 0.8        # Min confidence

# Ground truth
FUZZY_MATCH_THRESHOLD = 0.85          # Levenshtein similarity cutoff

# Handwriting styles
HANDWRITING_STYLES = [
    "writer_0", "writer_1", "writer_2", ..., "writer_99"
]

# Visual element types
VISUAL_ELEMENT_TYPES = [
    "stamp", "logo", "barcode", "chart", "photo"
]

# Visual element type mapping (LLM output → standard)
VISUAL_ELEMENT_TYPE_MAPPING = {
    "stamp": "stamp",
    "company_stamp": "stamp",
    "approval_stamp": "stamp",
    "logo": "logo",
    "company_logo": "logo",
    "brand_logo": "logo",
    "barcode": "barcode",
    "code128": "barcode",
    "chart": "chart",
    "graph": "chart",
    "figure": "chart",
    "photo": "photo",
    "image": "photo",
    "picture": "photo"
}

Environment Variables

Required:

ANTHROPIC_API_KEY=sk-ant-...          # Claude API key
MICROSOFT_AZURE_OCR_KEY=...           # Azure OCR key
MICROSOFT_AZURE_OCR_ENDPOINT=...      # Azure OCR endpoint

Optional:

LLM_MODEL=claude-sonnet-4-20250514    # Override model
DEBUG_MODE=false                      # Enable debug output
LOG_LEVEL=INFO                        # Logging verbosity
CHROMIUM_CONCURRENCY=10               # Override concurrency

Error Handling & Debugging

Error Categories

Comprehensive error tracking in stage 18:

Category	Description	Resolution
`multipage_pdf`	PDF rendered with >1 page	Check HTML content size, CSS page breaks
`missing_ocr_result`	OCR API failed or empty	Check Azure credentials, retry
`failed_gt_verification`	GT text not found in OCR	Review fuzzy match threshold, inspect HTML
`rendering_timeout`	PDF render exceeded timeout	Increase timeout, simplify HTML
`llm_parsing_error`	Failed to extract HTML/GT	Review LLM response format, update regex
`bbox_extraction_failed`	PyMuPDF extraction error	Check PDF validity, inspect fonts
`handwriting_generation_failed`	Diffusion model error	Check model checkpoint, GPU availability
`visual_element_generation_failed`	VE creation error	Check prefab directories, image validity

Debug Visualizations (Stage 19)

Generated debug outputs:

PDF BBoxes: Verify PyMuPDF extraction quality
Visual Element BBoxes: Verify VE extraction and positioning
Handwriting Insertion: Verify handwriting placement
Final BBoxes on Images: Verify OCR accuracy

Enable debug mode:

# In pipeline execution
pipeline_params = PipelineParameters(
    debug_mode=True,
    # ... other params
)

Logging

Log locations:

data/datasets/synthesized_datasets/<dataset_name>/logs/
├── pipeline_01/
│   └── seed_selection.log
├── pipeline_02/
│   └── prompting.log
├── pipeline_03/
│   └── message_processing/
│       ├── batch_001.log
│       └── batch_002.log
├── ...
└── pipeline_19/
    └── debug_generation.log

Log format:

[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout

Common Issues & Solutions

Issue: Multi-page PDFs

Cause: HTML content exceeds page size
Solution: Reduce content, check for long tables, use CSS overflow: hidden

Issue: Handwriting not appearing

Cause: Character bboxes not available (stage 05)
Solution: Use simpler fonts in HTML, ensure PyMuPDF can extract chars

Issue: OCR missing text

Cause: Low image DPI, poor contrast
Solution: Increase IMAGE_DPI (stage 14), adjust HTML styling

Issue: GT verification failures

Cause: OCR discrepancies, fuzzy match threshold too high
Solution: Lower FUZZY_MATCH_THRESHOLD, improve HTML text rendering

Issue: Claude API timeout

Cause: Large seed images, complex prompts
Solution: Compress seed images, simplify prompt template

Usage Examples

Full Pipeline Execution

from docgenie.generation import pipeline_01_select_seeds
from docgenie.generation.models import SynDatasetDefinition, PipelineParameters

# Load configuration
syndatadef = SynDatasetDefinition.from_yaml(
    "data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
)

# Configure pipeline
params = PipelineParameters(
    start_stage=1,
    end_stage=19,
    skip_existing=True,
    debug_mode=False,
    max_workers=10
)

# Execute pipeline stages
for stage in range(params.start_stage, params.end_stage + 1):
    print(f"Executing stage {stage:02d}...")
    
    # Import and run stage module
    stage_module = importlib.import_module(
        f"docgenie.generation.pipeline_{stage:02d}_*"
    )
    stage_module.main(syndatadef, params)
    
    print(f"Stage {stage:02d} complete.")

print("Pipeline execution complete!")

API Usage

See API Implementation section for detailed examples.

Custom Dataset Definition

# data/syn_dataset_definitions/custom_invoices.yaml

name: "custom_invoices_v1"
task: "kie"
base_dataset: "cord"

llm_model: "claude-sonnet-4-20250514"
prompt_template: "ClaudeRefined12"
num_solutions: 1

language: "English"
doc_type: "invoices and receipts"
gt_type: "Key-value pairs (entity extraction)"
gt_format: '{"entity_name": "entity_value", ...}'

num_samples: 500
alpha: 0.75
seeds_per_cluster: 5
clustering_method: "kmeans"

# KIE-specific
valid_labels: null  # Not used for KIE

output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"

Conclusion

This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:

Pipeline Integration Guide: api/PIPELINE_INTEGRATION.md
API README: api/README.md
Source Code: docgenie/generation/ and api/

Key Takeaways:

19-Stage Pipeline: Modular design from seed selection to GT verification
Multi-Task Support: QA, KIE, DLA, CLS with task-specific handling
Realistic Documents: LLM-generated content, diffusion handwriting, visual elements
Quality Assurance: Comprehensive validation, OCR verification, error tracking
API Integration: FastAPI service for synchronous document generation (stages 01-06)
Extensibility: Modular code, clear interfaces, easy to extend

Pipeline Strengths:

Task-agnostic core with task-specific adapters
Extensive logging and error tracking
Parallel processing where applicable
Debug visualizations at every stage
Integration with state-of-the-art models

Future Directions:

Full API integration (stages 07-19)
Additional document types (forms, legal, medical)
Multi-language support expansion
Enhanced visual element generation (charts, diagrams)
Real-time generation optimization

DocGenie Generation Pipeline & API Documentation

Table of Contents

Overview

Key Features

Technology Stack

Pipeline Architecture

High-Level Flow

Data Flow Between Stages

Pipeline Stages (01-19)

Stage 01: Select Seeds

Stage 02: Prompt LLM

Stage 03: Process Response

Stage 04: Render PDF and Extract Geometries

Stage 05: Extract BBoxes from PDF

Stage 06: Extract Layout Element Definitions and Annotation GT

Stage 07: Extract Handwriting

Stage 08: Extract Visual Element Definitions

Stage 09: Create Handwriting Images

Stage 10: Create Visual Elements

Stage 11: Render PDF Second Pass

Stage 12: Insert Handwriting Images

Stage 13: Insert Visual Elements

Stage 14: Render Image

Stage 15: Perform OCR

Stage 16: Normalize BBoxes

Stage 17: GT Preparation & Verification

QA (Question Answering) Format

KIE (Key Information Extraction) Format

DLA (Document Layout Analysis) Format

CLS (Classification) Format

Stage 18: Analyze

Stage 19: Create Debug Data

API Implementation

Architecture Overview

Endpoints

GET /

GET /health

POST /generate

POST /generate-files

Request/Response Schemas

PromptParameters

GenerateDocumentRequest

BoundingBox

DocumentResult

GenerateDocumentResponse

API Pipeline Flow

Integration with Pipeline Functions

Configuration

Environment Variables (.env)

Prompt Templates

CORS Configuration

Authentication & Security

Performance Considerations

Limitations

Future Integration Plan

Stage 3: Handwriting & Visual Elements (Stages 07-11)

Stage 4: Image Finalization & OCR (Stages 12-15)

Stage 5: Dataset Packaging (Stages 16-19)

Example Usage

Python Client (api/example_usage.py)

cURL Example

JavaScript/TypeScript Client

Testing

Deployment

Core Models & Utilities

SynDatasetDefinition Model

PipelineParameters Model

FileStructure Model

DocumentLog Model

BBox Model

LayoutBBox Model

Utility Modules

BBox Utilities (utils/bboxes.py)

Geometry Utilities (utils/geos.py)

Serialization Utilities (utils/serialization.py)

Configuration & Constants

Key Constants (generation/constants.py)

Environment Variables

Error Handling & Debugging

Error Categories