DocGenie Generation Pipeline & API Documentation
Version: 1.0
Last Updated: February 7, 2026
Purpose: Comprehensive reference for the DocGenie synthetic document generation system
Table of Contents
- Overview
- Pipeline Architecture
- Pipeline Stages (01-19)
- API Implementation
- Core Models & Utilities
- Configuration & Constants
- Usage Examples
- Error Handling & Debugging
Overview
DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:
- Document Question Answering (QA)
- Key Information Extraction (KIE)
- Document Layout Analysis (DLA)
- Document Classification (CLS)
Key Features
- LLM-Powered Generation: Uses Claude/Gemini/Open-source models to generate diverse document content
- Realistic Handwriting: Diffusion model-based handwriting synthesis with author-specific styles
- Visual Element Integration: Stamps, logos, barcodes, charts, and photos
- Multi-Task Support: Task-specific ground truth formatting and validation
- Quality Assurance: Comprehensive validation, OCR verification, and error tracking
- Modular Design: Each pipeline stage is independently executable with clear inputs/outputs
Technology Stack
- LLM APIs: Claude (Anthropic), Gemini, DeepSeek, Qwen
- PDF Rendering: Playwright (Chromium), PyMuPDF
- OCR: Microsoft Azure OCR
- Handwriting: Custom diffusion model
- Image Processing: PIL, OpenCV
- API Framework: FastAPI
- Data Processing: Pandas, NumPy
Pipeline Architecture
High-Level Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCGENIE GENERATION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ
β PHASE 1: SELECTION β
ββββββββββββββββββββββββ
β
[01] Select Seeds βββββββββββββΊ seeds.csv, clusters.csv
(Cluster-based diverse seed selection)
ββββββββββββββββββββββββ
β PHASE 2: LLM GEN β
ββββββββββββββββββββββββ
β
[02] Prompt LLM βββββββββββββββΊ batch_results/ (JSON)
β (Claude API batched calls)
β
[03] Process Response βββββββββΊ raw_html/, raw_annotations/
(Extract HTML & GT from responses)
ββββββββββββββββββββββββ
β PHASE 3: RENDERING β
ββββββββββββββββββββββββ
β
[04] Render PDF Initial βββββββΊ pdf_initial/, geometries/
β (HTMLβPDF with geometry extraction)
β
[05] Extract BBoxes βββββββββββΊ pdf_word_bboxes/, pdf_char_bboxes/
β (PyMuPDF text extraction)
β
[06] Extract Layout βββββββββββΊ layout_element_definitions/
(DLA/KIE-specific annotations)
ββββββββββββββββββββββββ
β PHASE 4: EXTRACTION β
ββββββββββββββββββββββββ
β
[07] Extract Handwriting ββββββΊ handwriting_definitions/
β (Identify handwriting regions)
β
[08] Extract Visual Elements ββΊ visual_element_definitions/
(Stamp/logo/barcode placeholders)
ββββββββββββββββββββββββ
β PHASE 5: GENERATION β
ββββββββββββββββββββββββ
β
[09] Create Handwriting βββββββΊ handwriting_images/
β (Diffusion model generation)
β
[10] Create Visual Elements βββΊ visual_element_images/
(Generate/select stamps, logos, etc.)
ββββββββββββββββββββββββ
β PHASE 6: COMPOSITION β
ββββββββββββββββββββββββ
β
[11] Render PDF (2nd Pass) ββββΊ pdf_without_handwriting_placeholder/
β (Remove handwriting placeholders)
β
[12] Insert Handwriting βββββββΊ pdf_with_handwriting/
β (Overlay handwriting images)
β
[13] Insert Visual Elements βββΊ pdf_final/
β (Overlay stamps, logos, etc.)
β
[14] Render Image βββββββββββββΊ images/
(PDFβPNG conversion)
ββββββββββββββββββββββββ
β PHASE 7: FINALIZATIONβ
ββββββββββββββββββββββββ
β
[15] Perform OCR ββββββββββββββΊ final_word_bboxes/, final_segment_bboxes/
β (Microsoft OCR)
β
[16] Normalize BBoxes βββββββββΊ normalized_word_bboxes/, normalized_segment_bboxes/
(Pixelβ[0,1] coordinates)
ββββββββββββββββββββββββ
β PHASE 8: VALIDATION β
ββββββββββββββββββββββββ
β
[17] GT Preparation βββββββββββΊ verified_gt/
β (Fuzzy matching, BIO tagging)
β
[18] Analyze ββββββββββββββββββΊ dataset_log.json
β (Statistics, cost analysis)
β
[19] Create Debug Data ββββββββΊ debug/ subdirectories
(Visualizations for inspection)
Data Flow Between Stages
Seed Images βββ
ββββΊ [02] βββΊ HTML + GT βββΊ [04] βββΊ PDF + Geometries
Prompt Params β β
ββββΊ [05] βββΊ BBoxes
β β
β ββββΊ [07] βββΊ HW Defs βββΊ [09] βββΊ HW Images βββ
β β β
β ββββΊ [08] βββΊ VE Defs βββΊ [10] βββΊ VE Images βββ€
β β
ββββΊ [11] βββΊ PDF (no HW) βββ¬βββΊ [12] βββββββββββββββββββββββββββ€
β (Insert HW) β
ββββΊ [13] ββββββββββββββββββββββββββ
(Insert VE)
β
[14] βββΊ Image
β
[15] βββΊ OCR BBoxes
β
[16] βββΊ Normalized
β
[17] βββΊ Verified GT
Pipeline Stages (01-19)
Stage 01: Select Seeds
File: pipeline_01_select_seeds.py
Purpose: Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.
Key Functions:
main(): Orchestrates seed selection processdownscale_and_compress_seeds(): Prepares seed images for efficient API transmissionplot_class_distribution(): Visualizes class balance in selected seedsvisualize_cluster_histogram(): Shows distribution across clusters
Process:
- Load embeddings from base dataset
- Perform clustering (KMeans or other algorithms)
- Sample N seeds per cluster
- Downscale and compress images (JPEG, max dimension)
- Save seed manifest and cluster assignments
Inputs:
SynDatasetDefinitionconfiguration- Base dataset name (e.g.,
docvqa,cord,publaynet) - Clustering parameters from constants
Outputs:
seeds.csv # Selected seed document IDs per prompt call
clusters.csv # Cluster assignments for all documents
seeds/ (directory) # Preprocessed seed images (JPEG, compressed)
Configuration Parameters:
EMBEDDING_MODEL: Specifies which embedding model was usedIMAGE_MAX_DIMENSION: Max width/height for compressionJPEG_QUALITY: Compression quality (0-100)
Example Usage:
from docgenie.generation import pipeline_01_select_seeds
pipeline_01_select_seeds.main(
syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
base_dataset="docvqa"
)
Stage 02: Prompt LLM
File: pipeline_02_prompt_llm.py
Purpose: Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.
Key Functions:
main(): Main orchestrator for LLM promptingcreate_batched_messages(): Constructs API-compatible message batchestrack_batch_completion(): Polls API for batch status- Cost calculation utilities in
pipeline_01/cost.py
Process:
- Load seed images and encode as base64
- Build prompts from template with parameter injection
- Create batched API requests (Claude Batch API for cost efficiency)
- Submit batches and track completion
- Save results for processing in stage 03
Inputs:
- Prompt template from
data/prompt_templates/<template_name>/ - Seed images from stage 01
- API credentials from environment variables
SynDatasetDefinitionparameters
Outputs:
prompt_batches/ # Batch metadata (batch IDs, status)
message_results/ # JSON response files per batch
logs/ # Prompting logs and progress
API Configuration:
- Claude: Uses Batch API with prompt caching for cost efficiency
- Batch Size: Configurable via
BATCH_SIZEconstant - Polling Interval: Configurable wait time between status checks
- Model Selection: Specified in
SynDatasetDefinition.llm_model
Cost Tracking:
- Input/output token counts per request
- Cached token usage (for Claude)
- Total cost estimation per batch
Example Configuration:
# In syn_dataset_definition YAML
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1 # Documents per prompt
language: "English"
doc_type: "business and administrative"
Stage 03: Process Response
File: pipeline_03_process_response.py
Purpose: Extract and validate HTML documents and ground truth annotations from LLM responses.
Key Functions:
main(): Main processorextract_html_from_message(): Regex-based HTML extraction from markdown code blocksextract_gt_from_html(): Parse JSON ground truth from<script id="GT">tagsvalidate_and_save_gt(): Task-specific validation and formatting
Process:
- Load message results from stage 02
- Extract HTML content using regex patterns
- Parse and validate ground truth JSON
- Apply task-specific formatting (QA, KIE, DLA, CLS)
- Save raw HTML and annotations separately
Inputs:
- Message results from
message_results/(stage 02) SynDatasetDefinitiontask type and prompt format
Outputs:
raw_html/ # HTML files (one per document)
raw_annotations/ # Ground truth JSON files
qa/ # QA format: {"question": "answer", ...}
kie/ # KIE format: {"entity": "value", ...}
dla/ # DLA format: {"element_id": "label", ...}
cls/ # CLS format: {"class": "category"}
logs/message_processing/ # Processing logs per message
Ground Truth Formats:
QA Format:
{
"What is the invoice number?": "INV-12345",
"What is the total amount?": "$1,234.56"
}
KIE Format:
{
"company_name": "Acme Corp",
"invoice_date": "2024-01-15",
"total_amount": "1234.56"
}
DLA Format:
{
"layout_0": "title",
"layout_1": "text",
"layout_2": "table"
}
Validation Checks:
- Expected document count matches actual
- Valid JSON structure in GT
- Task-specific field presence
- HTML completeness (valid tags)
Error Handling:
- Malformed HTML β Logged, skipped
- Missing GT β Document flagged in logs
- Invalid JSON β Parsing error logged
Stage 04: Render PDF and Extract Geometries
File: pipeline_04_render_pdf_and_extract_geos.py
Purpose: Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.
Key Functions:
main(): Async batch rendering orchestratorrender_pdf_with_playwright(): Single PDF rendering using Playwright/Chromiumpreprocess_css(): Remove conflicting CSS@pagerulesvalidate_pdf(): Check page count and file integrityextract_geometries(): Parse element positions from injected JavaScript
Process:
- Load raw HTML from stage 03
- Inject geometry extraction JavaScript
- Preprocess CSS (remove fixed page sizes)
- Render with Playwright in headless Chromium
- Measure content dimensions dynamically
- Export PDF with calculated page size
- Extract and save element geometries
Inputs:
- Raw HTML files from
raw_html/(stage 03)
Outputs:
pdf_initial/ # Initial PDFs
geometries/ # Element geometry JSON files
{doc_id}.json # Contains positions, dimensions, CSS classes
render_html/ # HTML with geometry extraction scripts
debug/pdf_with_geos/ # Debug PDFs with geometry overlays (optional)
logs/rendering/ # Rendering logs and errors
Geometry JSON Structure:
{
"page_width_mm": 210.0,
"page_height_mm": 297.0,
"elements": [
{
"id": "layout_0",
"type": "div",
"class": "title",
"rect": {
"x": 20.0,
"y": 30.0,
"width": 170.0,
"height": 15.0
},
"text": "Invoice",
"attributes": {
"data-label": "title",
"data-handwriting": null,
"data-visual-element": null
}
}
]
}
Key Features:
- Automatic Page Sizing: No fixed dimensions; page size adapts to content
- Concurrent Rendering: Semaphore-controlled parallel processing
- Geometry Extraction: JavaScript injected to capture element positions
- CSS Coordinate Conversion: 96 DPI (CSS) β 72 DPI (PDF)
- Retry Logic: Up to 3 attempts with configurable timeout
- Debug Visualizations: Optional overlay of geometries on PDFs
Configuration Constants:
CHROMIUM_CONCURRENCY: 10 (parallel render limit)PER_PDF_RENDER_TIMEOUT: 60 secondsPER_PDF_RENDER_MAX_RETRIES: 3 attemptsPDF_POINT_SCALING: 72/96 for DPI conversion
Error Handling:
- Timeout β Retry with increased timeout
- Multi-page PDFs β Flagged and skipped
- Missing geometries β Error logged, document marked invalid
Stage 05: Extract BBoxes from PDF
File: pipeline_05_extract_bboxes_from_pdf.py
Purpose: Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.
Key Functions:
main(): Main extraction orchestratorextract_bboxes_from_pdf(): PyMuPDF-based extractionverify_char_to_word_mapping(): Validate character-to-word relationshipscreate_bbox_debug_pdf(): Generate debug visualizations
Process:
- Load PDFs from stage 04
- Extract words with PyMuPDF
get_text("words") - Extract characters with
get_text("rawdict") - Map characters to parent words
- Validate mappings (required for handwriting splitting)
- Save both word and character bboxes
Inputs:
- PDFs from
pdf_initial/(stage 04)
Outputs:
pdf_word_bboxes/ # Word-level bounding boxes
{doc_id}.json # Format: [x0, y0, x1, y1, text]
pdf_char_bboxes/ # Character-level bounding boxes (if mappable)
{doc_id}.json # Format: [x0, y0, x1, y1, char, word_idx]
debug/pdf_bboxes/ # Debug PDFs with bbox overlays (optional)
logs/bbox_extraction/ # Extraction logs
BBox JSON Format:
Word BBoxes:
[
{"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
{"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
]
Char BBoxes (with word mapping):
[
{"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
{"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
]
Key Features:
- Dual-Level Extraction: Both word and character granularity
- Mapping Validation: Ensures characters can be mapped to words (critical for handwriting)
- Debug Visualizations: Color-coded bbox overlays on PDFs
- PDF Coordinate System: Uses 72 DPI (PDF points)
Character-to-Word Mapping: Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.
Error Handling:
- Empty PDFs β Skipped with warning
- Unmappable characters β Word-level fallback
- Invalid bboxes (negative dimensions) β Filtered out
Stage 06: Extract Layout Element Definitions and Annotation GT
File: pipeline_06_extract_layout_element_definitions_and_annotation_gt.py
Purpose: Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).
Key Functions:
main(): Task routerhandle_dla(): Process DLA taskshandle_kie(): Process KIE tasksparse_layout_elements(): Extract layout elements from geometriesparse_kie_fields(): Extract KIE annotated fields from geometries
Process (DLA):
- Load geometries from stage 04
- Filter elements with
data-labelattribute - Validate labels against task-specific valid labels
- Extract bounding boxes and labels
- Save layout element definitions
Process (KIE):
- Load geometries from stage 04
- Filter elements with
data-annotationattribute - Extract entity names and text values
- Validate against expected entity schema
- Save raw annotations for stage 17
Inputs:
- Geometries from
geometries/(stage 04) - Valid labels from
SynDatasetDefinition.valid_labels - Task type from
SynDatasetDefinition.task
Outputs:
layout_element_definitions/ # For DLA tasks
{doc_id}.json # [{id, label, rect}, ...]
raw_annotations/ # For KIE tasks
kie_annotations/
{doc_id}.json # {entity_name: text_value, ...}
Layout Element Definition Format (DLA):
[
{
"id": "layout_0",
"label": "title",
"rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
},
{
"id": "layout_1",
"label": "text",
"rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
}
]
KIE Annotation Format:
{
"company_name": "Acme Corporation",
"invoice_number": "INV-12345",
"invoice_date": "January 15, 2024",
"total_amount": "$1,234.56"
}
Validation Checks:
- Missing labels: Elements without
data-labelβ Warning - Invalid labels: Labels not in valid_labels β Error
- Zero dimensions: Width or height = 0 β Error
- Multiple labels (KIE only): One element, multiple entities β Error
- Missing text: KIE elements without text β Warning
Task-Specific Handling:
- DLA: Converts geometries to
LayoutBBoxformat for stage 17 - KIE: Extracts entity-value pairs for BIO tagging in stage 17
- QA/CLS: No processing in this stage (uses raw GT from stage 03)
Stage 07: Extract Handwriting
File: pipeline_07_extract_handwriting.py
Purpose: Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.
Key Functions:
main(): Main extraction orchestratorparse_handwriting_geometries(): Extract elements withdata-handwritingattributemap_handwriting_to_bboxes(): Match handwriting text to OCR bboxes spatiallysplit_long_words(): Split words exceeding diffusion model's character limitextract_handwriting_style(): Parse author ID from CSS class
Process:
- Load geometries from stage 04
- Filter elements with
data-handwritingattribute - Load word/character bboxes from stage 05
- Spatially match handwriting regions to bboxes
- Split long words for diffusion model constraints
- Extract author ID for style consistency
- Save handwriting definitions with bbox mappings
Inputs:
- Geometries from
geometries/(stage 04) - Word bboxes from
pdf_word_bboxes/(stage 05) - Character bboxes from
pdf_char_bboxes/(stage 05, if available)
Outputs:
handwriting_definitions/
{doc_id}.json # Handwriting region definitions
logs/handwriting_extraction/ # Extraction logs and warnings
Handwriting Definition Format:
[
{
"id": "hw0",
"text": "John Smith",
"author_id": "author1",
"bboxes": [
"20.0,30.0,60.0,45.0,John",
"65.0,30.0,95.0,45.0,Smith"
],
"rect": {
"x": 20.0,
"y": 30.0,
"width": 75.0,
"height": 15.0
},
"is_signature": false
}
]
Key Features:
- Author ID Extraction: Parses CSS classes like
handwriting-author1for style consistency - Spatial Matching: Uses bbox overlap/proximity to map handwriting regions to text
- Long Word Splitting: Splits words >
MAX_HANDWRITING_CHARS(default: 7) into multiple images - Signature Detection: Identifies signature fields for special handling
- Character-Level Fallback: Uses word bboxes if char-level mapping unavailable
Configuration Constants:
MAX_HANDWRITING_CHARS: 7 (max characters per diffusion generation)SPATIAL_MATCH_THRESHOLD: Pixel tolerance for bbox matching
Mapping Logic:
- Find all words whose bboxes overlap/intersect with handwriting rect
- Extract text from matched bboxes
- Compare with expected handwriting text
- If char bboxes available: split long words at char boundaries
- If char bboxes unavailable: split at word boundaries
Error Handling:
- No bbox matches β Warning, handwriting region skipped
- Text mismatch β Logged, uses best match
- Missing char bboxes β Word-level fallback (may affect quality)
Stage 08: Extract Visual Element Definitions
File: pipeline_08_extract_visual_element_definitions.py
Purpose: Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.
Key Functions:
main(): Main extraction orchestratorparse_visual_element_geometries(): Extract elements withdata-visual-elementattributeparse_css_dimension(): Parse width/height from CSSparse_css_rotation(): Extract rotation angle from CSS transform
Process:
- Load geometries from stage 04
- Filter elements with
data-visual-elementattribute - Parse element type (stamp, logo, barcode, etc.)
- Extract dimensions and rotation from CSS
- Parse content (e.g., stamp text, barcode value)
- Validate element types and dimensions
- Save visual element definitions
Inputs:
- Geometries from
geometries/(stage 04)
Outputs:
visual_element_definitions/
{doc_id}.json # Visual element definitions
logs/visual_element_extraction/ # Extraction logs
Visual Element Definition Format:
[
{
"id": "ve0",
"type": "stamp",
"type_unmapped": "stamp",
"content": "CONFIDENTIAL",
"rect": {
"x": 150.0,
"y": 250.0,
"width": 60.0,
"height": 30.0
},
"rotation": -15.0
},
{
"id": "ve1",
"type": "barcode",
"type_unmapped": "barcode",
"content": "1234567890",
"rect": {
"x": 20.0,
"y": 270.0,
"width": 100.0,
"height": 20.0
},
"rotation": 0.0
}
]
Supported Visual Element Types:
stamp: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")logo: Company/brand logos (selected from prefabs)barcode: Code128 barcodeschart: Charts and graphs (selected from prefabs)photo: Photographic images (selected from prefabs)
Type Mapping:
Maps LLM-generated type names to standard types using VISUAL_ELEMENT_TYPE_MAPPING constant.
Key Features:
- Content Extraction: Parses text content for stamps, values for barcodes
- Rotation Support: Extracts CSS
rotate()transform angles - Dimension Parsing: Handles CSS units (px, mm, %, etc.)
- Type Validation: Warns about unknown types, uses fallback mapping
Validation Checks:
- Unknown type β Logged, mapped to closest known type
- Zero dimensions β Error, element skipped
- Invalid rotation angle β Defaults to 0Β°
- Missing content (for stamps/barcodes) β Warning
CSS Parsing Examples:
/* Rotation extraction */
transform: rotate(-15deg); β -15.0
transform: rotate(0.5turn); β 180.0
/* Dimension extraction */
width: 60mm; β 60.0 (mm)
height: 30px; β Converted to mm
Stage 09: Create Handwriting Images
File: pipeline_09_create_handwriting_images.py
Purpose: Generate realistic handwriting images using a diffusion model, with per-author style consistency.
Key Functions:
main(): Main generation orchestratorgenerate_handwriting_diffusion(): Call diffusion model API/local modeladd_handwriting_blur(): Optional post-processing for realism
Process:
- Load handwriting definitions from stage 07
- Group by author ID for style consistency
- For each text segment:
- Generate handwriting image with diffusion model
- Apply optional blur for realism
- Save image with metadata
- Track author-to-writer style mapping
- Save generation logs
Inputs:
- Handwriting definitions from
handwriting_definitions/(stage 07) - Diffusion model checkpoint (local or API)
Outputs:
handwriting_images/
{doc_id}/
hw0_0.png # First bbox of handwriting region hw0
hw0_1.png # Second bbox (if multi-word)
hw1_0.png
logs/handwriting_generation/ # Generation logs with author mappings
Handwriting Image Specifications:
- Format: PNG with transparency
- Height: 40 pixels (configurable via
HANDWRITING_HEIGHT_PX) - Padding: 0 pixels horizontal (configurable via
HANDWRITING_PADDING_PX) - Background: Transparent
- Color: Black text on transparent
Diffusion Model Integration:
Located in handwriting_diffusion/ module:
generate_handwriting_diffusion_raw.py: Core generation logictext_encoder.py: Text encoding for model inputtokenizer.py: Character tokenization
Key Features:
- Author-Style Consistency: Same author ID β consistent handwriting style
- Writer Style Mapping: Maps author IDs to writer style IDs (e.g., "author1" β writer_42)
- Batch Processing: Generates multiple images efficiently
- Optional Blur: Post-processing for more realistic appearance
- Quality Control: Validates generated images (non-empty, correct dimensions)
Configuration Constants:
HANDWRITING_HEIGHT_PX: 40 pixelsHANDWRITING_PADDING_PX: 0 pixelsDIFFUSION_NUM_INFERENCE_STEPS: 50 (generation quality/speed tradeoff)HANDWRITING_BLUR_ENABLED: Optional blur toggleHANDWRITING_STYLES: List of available writer styles
Author-to-Writer Mapping Example:
{
"author1": "writer_42",
"author2": "writer_17",
"author3": "writer_89"
}
Error Handling:
- Generation failure β Retry up to 3 times
- Empty image β Logged, document flagged
- Model timeout β Skipped, logged
Stage 10: Create Visual Elements
File: pipeline_10_create_visual_elements.py
Purpose: Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.
Key Functions:
main(): Main generation orchestratorroute_visual_element_generation(): Router for element typegenerate_stamp(): Create stamp image with textselect_logo(): Random selection from logo prefabsgenerate_barcode(): Create Code128 barcodeselect_photo(): Random selection from photo prefabsselect_chart(): Random selection from chart prefabs
Process:
- Load visual element definitions from stage 08
- For each element:
- Route to type-specific generator
- Generate or select image
- Resize to target dimensions
- Save with transparency (if applicable)
- Cache prefab directories for performance
- Save generation logs
Inputs:
- Visual element definitions from
visual_element_definitions/(stage 08) - Prefab directories in
data/visual_element_prefabs/:logos/photos/charts/
Outputs:
visual_element_images/
{doc_id}/
ve0.png # Stamp image
ve1.png # Logo image
ve2.png # Barcode image
logs/visual_element_generation/ # Generation logs
Type-Specific Generation:
Stamps:
- Font: Configurable (default: Arial Bold)
- Color: Configurable (default: red/blue)
- Background: Transparent
- Border: Optional rounded rectangle
- Rotation: Applied during insertion (stage 13)
Logos:
- Source: Random selection from
data/visual_element_prefabs/logos/ - Format: PNG with transparency preserved
- Caching: Directory contents cached after first scan
Barcodes:
- Type: Code128 (supports alphanumeric)
- Library:
python-barcode - Background: White
- Content validation: Numeric or alphanumeric
Photos:
- Source: Random selection from
data/visual_element_prefabs/photos/ - Format: JPEG or PNG
- Aspect ratio: Preserved during resize
Charts/Figures:
- Source: Random selection from
data/visual_element_prefabs/charts/ - Format: PNG with transparency
- Types: Bar charts, line graphs, pie charts, etc.
Key Features:
- Type-Specific Logic: Each element type has dedicated generation function
- Prefab Caching: Directory scans cached for performance
- Transparent Backgrounds: Stamps and some logos support transparency
- Content Validation: Barcodes validate numeric content
- Aspect Ratio Preservation: Images scaled without distortion
Configuration Constants:
STAMP_FONT_SIZE: Calculated from target dimensionsSTAMP_BORDER_WIDTH: 2 pixelsBARCODE_DPI: 300 for high qualityPREFAB_CACHE_SIZE: In-memory cache limit
Error Handling:
- Missing prefab directory β Error, element skipped
- Empty prefab directory β Warning, fallback placeholder
- Invalid barcode content β Logged, uses fallback text
- Image generation failure β Placeholder created
Stage 11: Render PDF Second Pass
File: pipeline_11_render_pdf_second_pass.py
Purpose: Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.
Key Functions:
main(): Main rendering orchestratorrender_pdf_playwright_async(): Async Playwright renderingremove_handwriting_placeholders_from_html(): Strip handwriting elements
Process:
- Load HTML from
render_html/(stage 04) - Remove all elements with
data-handwritingattribute - Re-render PDF using same dimensions from stage 04
- Validate PDF output
- Save PDFs without handwriting placeholders
Inputs:
- HTML from
render_html/(stage 04) - Page dimensions from stage 04 logs
Outputs:
pdf_without_handwriting_placeholder/
{doc_id}.pdf # PDFs with handwriting regions blank
logs/rendering_second_pass/ # Rendering logs
Key Differences from Stage 04:
- Uses Pre-calculated Dimensions: No content measurement needed
- Handwriting Elements Removed: Not just hidden (visibility: hidden), but removed from DOM
- No Geometry Extraction: Geometries already saved in stage 04
- Faster Rendering: No JavaScript injection for measurement
HTML Preprocessing:
<!-- Before (Stage 04) -->
<div data-handwriting="author1" class="handwriting">John Smith</div>
<!-- After (Stage 11) -->
<!-- Element completely removed -->
Why This Stage Exists: Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.
Rendering Configuration:
- Same timeout and retry logic as stage 04
- Reuses Playwright browser context for efficiency
- No debug output (geometries not extracted)
Error Handling:
- Multi-page PDF β Flagged and skipped
- Rendering timeout β Retry with increased timeout
- HTML parsing error β Logged, document marked invalid
Stage 12: Insert Handwriting Images
File: pipeline_12_insert_handwriting_images.py
Purpose: Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.
Key Functions:
main(): Main insertion orchestratorinsert_handwriting_into_pdf(): Per-document insertionscale_image_with_aspect_ratio(): Resize images while preserving aspectgroup_bboxes_by_line(): Group multi-word handwriting by line
Process:
- Load PDFs from stage 11
- Load handwriting images from stage 09
- Load handwriting definitions (bbox mappings) from stage 07
- For each handwriting region:
- Group bboxes by line/block
- Scale images to fit bboxes
- Apply random offsets for natural variation
- Insert images at calculated positions
- Save PDFs with handwriting
Inputs:
- PDFs from
pdf_without_handwriting_placeholder/(stage 11) - Handwriting images from
handwriting_images/(stage 09) - Handwriting definitions from
handwriting_definitions/(stage 07)
Outputs:
pdf_with_handwriting/
{doc_id}.pdf # PDFs with handwriting inserted
logs/handwriting_insertion/ # Insertion logs
debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)
Insertion Logic:
1. Image Scaling:
- High-res scaling: 3x upsampling before insertion
- Aspect ratio preserved
- Left-aligned within bbox (respects layout rect)
2. Positioning:
- X coordinate: Left edge of bbox + random offset
- Y coordinate: Top edge of bbox + random offset
- Multi-word lines: Consistent Y offset for line
3. Random Offsets:
x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)
4. Block/Line Grouping: For multi-word handwriting (e.g., "John Smith"):
- Group bboxes by Y coordinate (same line)
- Apply consistent Y offset to entire line
- Individual X offsets per word for natural spacing
Key Features:
- High-Resolution Insertion: 3x scaling for quality
- Natural Variation: Random offsets simulate handwriting imperfection
- Line Consistency: Multi-word lines maintain baseline
- Aspect Ratio Preservation: Images not distorted
- Transparency Support: PNG alpha channel preserved
Configuration Constants:
HANDWRITING_IMAGE_UPSCALE_FACTOR: 3xMAX_HANDWRITING_RAND_X: Β±2 pixelsMAX_HANDWRITING_RAND_Y: Β±1 pixelHANDWRITING_LINE_Y_CONSISTENCY: Same Y offset per line
Coordinate System:
- PDF uses 72 DPI (points)
- Bboxes from stage 07 are in PDF coordinates
- No conversion needed
Error Handling:
- Missing handwriting image β Warning, bbox skipped
- Image too large for bbox β Scaled down with warning
- PyMuPDF insertion failure β Logged, document flagged
Stage 13: Insert Visual Elements
File: pipeline_13_insert_visual_elements.py
Purpose: Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.
Key Functions:
main(): Main insertion orchestratorinsert_visual_elements_into_pdf(): Per-document insertionscale_image_with_aspect_ratio(): Resize images (same as stage 12)
Process:
- Load PDFs from stage 12
- Load visual element images from stage 10
- Load visual element definitions from stage 08
- For each visual element:
- Scale image to fit bbox
- Calculate centered position
- Apply rotation (if specified)
- Insert image at calculated position
- Save final PDFs
- If no visual elements: copy PDF from stage 12
Inputs:
- PDFs from
pdf_with_handwriting/(stage 12) - Visual element images from
visual_element_images/(stage 10) - Visual element definitions from
visual_element_definitions/(stage 08)
Outputs:
pdf_final/
{doc_id}.pdf # Final PDFs with all elements
logs/visual_element_insertion/ # Insertion logs
debug/visual_element_insertion/ # Debug PDFs (optional)
Insertion Logic:
1. Image Scaling:
- High-res scaling: 3x upsampling
- Aspect ratio preserved
- Centered within bbox (not left-aligned like handwriting)
2. Positioning:
- X coordinate: Center of bbox - half image width
- Y coordinate: Center of bbox - half image height
3. Rotation:
- Applied via PyMuPDF transformation matrix
- Rotation around image center
- Angle from visual element definition
Key Differences from Stage 12 (Handwriting):
- Centered placement: Visual elements centered in bbox
- No random offsets: Precise placement for logos/stamps
- Rotation support: Stamps often rotated for "APPROVED" effect
- Fallback: Copies PDF if no visual elements (ensures output exists)
Rotation Transformation:
# PyMuPDF rotation matrix
rotation_matrix = fitz.Matrix(rotation_angle)
image_rect = image_rect * rotation_matrix
Key Features:
- High-Resolution Insertion: 3x scaling for quality
- Centered Alignment: Visual elements centered in bboxes
- Rotation Support: Arbitrary angles for stamps
- Transparency Preservation: PNG alpha channel maintained
- Fallback Handling: Copies PDF if no visual elements
Configuration Constants:
VISUAL_ELEMENT_UPSCALE_FACTOR: 3xROTATION_PRECISION: Angle precision in degrees
Coordinate System:
- Same as stage 12 (PDF 72 DPI)
- Rotation applied after positioning
Error Handling:
- Missing visual element image β Warning, element skipped
- Image too large for bbox β Scaled down with warning
- PyMuPDF insertion failure β Logged, document flagged
- No visual elements β PDF copied from stage 12
Stage 14: Render Image
File: pipeline_14_render_image.py
Purpose: Convert final PDFs to high-quality PNG images for OCR and dataset distribution.
Key Functions:
main(): Main conversion orchestratorconvert_pdf_to_image(): PDF to PNG conversion
Process:
- Load PDFs from stage 13
- Convert each PDF to PNG using custom PDF-to-image module
- Validate image dimensions
- Save images
Inputs:
- PDFs from
pdf_final/(stage 13)
Outputs:
images/
{doc_id}.png # Final document images
logs/image_rendering/ # Conversion logs
Image Specifications:
- Format: PNG
- DPI: Configurable (default: 200 DPI for quality OCR)
- Color Mode: RGB (24-bit)
- Compression: PNG lossless
PDF-to-Image Module: Located in custom module (not standard library):
- Handles PDF rendering at specified DPI
- Single-page conversion only (multi-page PDFs skipped)
- Uses PDF coordinate system: 72 DPI internally
Key Features:
- High DPI: 200+ DPI for accurate OCR
- Single Page Only: Multi-page PDFs flagged as errors
- Lossless Compression: PNG preserves all details
- Size Validation: Checks image dimensions match PDF
Configuration Constants:
IMAGE_DPI: 200 (OCR quality vs file size tradeoff)IMAGE_MAX_DIMENSION: Optional max width/height
Coordinate System Conversion:
PDF: 210mm Γ 297mm @ 72 DPI β 595 Γ 842 points
PNG: 210mm Γ 297mm @ 200 DPI β 1654 Γ 2339 pixels
Why This Stage:
- OCR performs better on high-DPI images
- Images are final output format for datasets
- PNG preserves quality better than JPEG for text
Error Handling:
- Multi-page PDF β Flagged and skipped
- Conversion failure β Logged, document marked invalid
- Empty image β Error, document flagged
- Dimension mismatch β Warning logged
Stage 15: Perform OCR
File: pipeline_15_perform_ocr.py
Purpose: Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.
Key Functions:
main(): Main OCR orchestratorcall_microsoft_ocr(): Microsoft Azure OCR API callconvert_ocr_to_bbox_format(): Transform OCR results to internal formataggregate_words_to_segments(): Group words into lines
Process:
- Determine which documents need OCR:
- Has handwriting β Requires OCR
- Has visual elements β Requires OCR
- Neither β Copy PDF bboxes from stage 05
- For documents requiring OCR:
- Call Microsoft OCR service
- Parse word-level bboxes
- Aggregate into line-level segments
- Convert coordinates to PDF space
- Save final bboxes
Inputs:
- Images from
images/(stage 14) - Handwriting definitions from
handwriting_definitions/(stage 07) - Visual element definitions from
visual_element_definitions/(stage 08) - PDF bboxes from
pdf_word_bboxes/(stage 05, for non-OCR documents)
Outputs:
final_word_bboxes/
{doc_id}.json # Word-level bounding boxes
final_segment_bboxes/
{doc_id}.json # Line-level bounding boxes
ocr_results_cache/
{doc_id}.json # Raw OCR API responses (cached)
logs/ocr/ # OCR logs and errors
OCR Decision Logic:
requires_ocr = (
has_handwriting(doc_id) or
has_visual_elements(doc_id)
)
if requires_ocr:
perform_microsoft_ocr()
else:
copy_pdf_bboxes() # Reuse stage 05 results
Why Handwriting/Visual Elements Require OCR:
- Handwriting images inserted in stage 12 β Not in PDF text layer
- Visual elements inserted in stage 13 β Not in PDF text layer
- PDF text extraction (stage 05) misses these elements
- OCR captures all visible text on rendered image
Microsoft OCR API:
- Service: Azure Computer Vision (Read API)
- Input: PNG image (from stage 14)
- Output: Word polygons, text, confidence scores
- Caching: Results cached to avoid re-processing
OCR Response Format:
{
"readResults": [{
"page": 1,
"words": [
{
"boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
"text": "Invoice",
"confidence": 0.98
}
],
"lines": [
{
"boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
"text": "Invoice Date",
"words": [...]
}
]
}]
}
Coordinate Conversion: OCR returns pixel coordinates; converted to PDF points:
pdf_x = ocr_x * (pdf_width / image_width)
pdf_y = ocr_y * (pdf_height / image_height)
Segment Aggregation: Groups words into lines based on Y coordinate proximity.
Key Features:
- Selective OCR: Only runs OCR when necessary (cost/time savings)
- Result Caching: Avoids redundant API calls
- Dual-Level Output: Word and line (segment) bboxes
- Coordinate Conversion: OCR pixels β PDF points
- Fallback: Copies PDF bboxes for documents without handwriting/VEs
Configuration:
- OCR API key from environment variables
- Timeout: 30 seconds per request
- Retry: Up to 3 attempts
Error Handling:
- OCR API failure β Retry, fallback to PDF bboxes if all retries fail
- Empty OCR result β Warning, uses PDF bboxes
- Coordinate conversion error β Logged, bbox skipped
Stage 16: Normalize BBoxes
File: pipeline_16_normalize_bboxes.py
Purpose: Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.
Key Functions:
main(): Main normalization orchestratornormalize_word_and_segment_bboxes(): Normalize word/segment bboxesnormalize_layout_bboxes(): Normalize layout element bboxes (DLA only)normalize_coordinates(): Core coordinate transformation
Process:
- Load final bboxes from stage 15
- Load image dimensions (for normalization denominators)
- For each bbox:
- Transform:
normalized_x = pixel_x / image_width - Transform:
normalized_y = pixel_y / image_height - Preserve text content
- Transform:
- Save normalized bboxes
- For DLA tasks: also normalize layout element bboxes
Inputs:
- Word bboxes from
final_word_bboxes/(stage 15) - Segment bboxes from
final_segment_bboxes/(stage 15) - Layout element definitions from
layout_element_definitions/(stage 06, for DLA) - Image dimensions from stage 14 logs
Outputs:
normalized_word_bboxes/
{doc_id}.json # Normalized word bboxes
normalized_segment_bboxes/
{doc_id}.json # Normalized segment bboxes
normalized_gt/ # For DLA tasks only
{doc_id}.json # Normalized layout elements
logs/normalization/ # Normalization logs
Normalization Formula:
normalized_bbox = {
"x0": bbox["x0"] / image_width,
"y0": bbox["y0"] / image_height,
"x1": bbox["x1"] / image_width,
"y1": bbox["y1"] / image_height,
"text": bbox["text"] # Preserved
}
Normalized BBox Format:
[
{
"x0": 0.095, # Was: 20 pixels out of 210mm @ 200dpi
"y0": 0.128, # Was: 30 pixels
"x1": 0.286, # Was: 60 pixels
"y1": 0.192, # Was: 45 pixels
"text": "Invoice"
}
]
DLA-Specific Normalization: For Document Layout Analysis tasks, layout element bboxes are also normalized:
[
{
"label": "title",
"bbox": [0.095, 0.128, 0.905, 0.192] # [x0, y0, x1, y1]
},
{
"label": "text",
"bbox": [0.095, 0.213, 0.905, 0.534]
}
]
Why Normalization:
- Model Training: Most models expect [0, 1] coordinates
- Resolution Independence: Works across different image sizes
- Standard Format: Matches common dataset formats (e.g., LayoutLM)
Coordinate System Mapping:
PDF Points (72 DPI):
595 Γ 842 points (A4)
β (stage 14 conversion @ 200 DPI)
Image Pixels (200 DPI):
1654 Γ 2339 pixels
β (stage 16 normalization)
Normalized [0, 1]:
0.0-1.0 Γ 0.0-1.0
Key Features:
- Preserves Text: Text content unchanged during normalization
- Task-Specific: DLA tasks get additional layout bbox normalization
- Validation: Checks for out-of-bounds coordinates (clamps to [0, 1])
- Precision: Full float precision maintained
Error Handling:
- Out-of-bounds coordinates β Clamped to [0, 1] with warning
- Missing image dimensions β Error, document skipped
- Zero image dimensions β Error, document marked invalid
Stage 17: GT Preparation & Verification
File: pipeline_17_gt_preparation_verification.py
Purpose: Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.
Key Functions:
main(): Main verification orchestratorroute_task(): Route to task-specific handlerhandle_qa(): QA ground truth processinghandle_kie(): KIE ground truth with BIO tagginghandle_dla(): DLA ground truth processingfuzzy_match_text(): Levenshtein-based text matching
Process:
- Load raw GT from stage 03/06
- Load final word bboxes from stage 15
- Route to task-specific handler
- Perform fuzzy text matching (GT text β OCR text)
- Map GT annotations to bbox indices
- Apply task-specific formatting
- Validate and save verified GT
Inputs:
- Raw GT from
raw_annotations/(stage 03/06) - Word bboxes from
final_word_bboxes/(stage 15) - Layout elements from
layout_element_definitions/(stage 06, for DLA) - Visual elements from
visual_element_definitions/(stage 08, for DLA)
Outputs:
verified_gt/
qa/
{doc_id}.json # QA format
kie/
{doc_id}.json # KIE format with BIO tagging
dla/
{doc_id}.json # DLA format with normalized bboxes
cls/
{doc_id}.json # Classification format
logs/gt_verification/ # Verification logs with match statistics
QA (Question Answering) Format
Process:
- Load QA pairs:
{"question": "answer", ...} - For each answer:
- Find words in bboxes matching answer text (fuzzy)
- Record bbox indices of matching words
- Save verified GT with bbox mappings
Output Format:
{
"questions": [
{
"question": "What is the invoice number?",
"answer": "INV-12345",
"answer_bbox_indices": [15, 16] # Word indices in final_word_bboxes
},
{
"question": "What is the total amount?",
"answer": "$1,234.56",
"answer_bbox_indices": [42]
}
]
}
Fuzzy Matching: Uses Levenshtein distance with 0.85 similarity cutoff:
similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
if similarity >= 0.85:
match_found = True
KIE (Key Information Extraction) Format
Process:
- Load entity annotations:
{entity_name: text_value, ...} - For each entity:
- Find words matching entity value (fuzzy)
- Generate BIO tags for all words
- Save verified GT with BIO tagging
Output Format:
{
"entities": [
{
"entity": "company_name",
"value": "Acme Corporation",
"bbox_indices": [5, 6]
},
{
"entity": "invoice_date",
"value": "January 15, 2024",
"bbox_indices": [20, 21, 22]
}
],
"word_labels": [
"O", "O", "O", "O", "O", # Words 0-4: Outside entities
"B-company_name", "I-company_name", # Words 5-6: Company name
"O", "O", ..., # Words 7-19: Outside
"B-invoice_date", "I-invoice_date", "I-invoice_date" # Words 20-22
]
}
BIO Tagging:
B-entity: Beginning of entityI-entity: Inside entity (continuation)O: Outside any entity
DLA (Document Layout Analysis) Format
Process:
- Load layout element definitions from stage 06
- Load visual element definitions from stage 08
- Validate labels (must be in
valid_labels) - Check spatial constraints (no containment, minimal overlap)
- Merge visual elements into layout annotations
- Normalize bboxes to [0, 1]
- Save verified GT
Output Format:
{
"layout_elements": [
{
"id": "layout_0",
"label": "title",
"bbox": [0.095, 0.128, 0.905, 0.192] # Normalized [x0, y0, x1, y1]
},
{
"id": "layout_1",
"label": "text",
"bbox": [0.095, 0.213, 0.905, 0.534]
},
{
"id": "ve0",
"label": "figure", # Visual element mapped to layout label
"bbox": [0.714, 0.895, 0.905, 0.980]
}
]
}
Visual Element Merging: Visual elements from stage 08 are converted to layout labels:
stampβfigure(or custom mapping)logoβfigurechartβfigurebarcodeβfigurephotoβfigure
Spatial Validation:
- No containment: One bbox fully inside another β Error
- Minimal overlap: Overlap area < 5% of smaller bbox β Warning
- Valid labels: All labels must be in
SynDatasetDefinition.valid_labels
CLS (Classification) Format
Process:
- Load classification label from raw GT
- Validate label against expected classes
- Save verified GT
Output Format:
{
"document_class": "invoice",
"confidence": 1.0
}
Fuzzy Matching Details:
Uses Levenshtein distance (via fuzz library) to handle OCR discrepancies:
# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
# similarity = 0.89 (above 0.85 threshold) β Match!
Match Statistics Logged:
- Total GT annotations
- Successfully matched annotations
- Failed matches (similarity < 0.85)
- Average similarity score
Key Features:
- Fuzzy Matching: Handles OCR errors gracefully
- Task-Specific Formatting: QA, KIE, DLA, CLS all handled differently
- BIO Tagging: Automatic generation for KIE
- Visual Element Integration: DLA merges visual elements as layout annotations
- Spatial Validation: Detects overlapping/contained layout elements
- Similarity Tracking: Logs match quality for analysis
Error Handling:
- Match failure (similarity < 0.85) β Logged, annotation skipped
- Invalid labels (DLA) β Error, document marked invalid
- Spatial violations (DLA) β Warning, elements flagged
- Missing bboxes β Error, document marked invalid
Stage 18: Analyze
File: pipeline_18_analyze.py
Purpose: Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.
Key Functions:
main(): Main analysis orchestratorcalculate_api_costs(): Compute LLM API costs from token usage
Process:
- Load all document logs from stages 01-17
- Categorize documents: valid vs. invalid
- Calculate error distributions
- Compute API usage and costs
- Generate statistics (handwriting, visual elements, annotations)
- Save comprehensive dataset log
Inputs:
- All document logs from previous stages
- Batch results from stage 02 (for cost calculation)
- Message processing logs from stage 03
- Prompt usage statistics
Outputs:
dataset_log.json # Comprehensive dataset statistics
logs/analysis/ # Analysis logs
Dataset Log Structure:
{
"metadata": {
"syndatadef_name": "docvqa_alpha=1.0",
"task": "qa",
"total_documents_requested": 1000,
"generation_date": "2026-02-07"
},
"prompting": {
"total_prompts": 100,
"total_batches": 10,
"llm_model": "claude-sonnet-4-20250514"
},
"total_cost_summary": {
"total_cost_usd": 123.45,
"input_tokens": 1500000,
"output_tokens": 800000,
"cached_tokens": 500000,
"cost_per_document": 0.12
},
"valid_samples_stats": {
"total_valid": 847,
"total_invalid": 153,
"validity_rate": 0.847,
"avg_handwriting_regions_per_doc": 2.3,
"avg_visual_elements_per_doc": 1.5,
"avg_annotations_per_doc": 8.7,
"documents_with_handwriting": 654,
"documents_with_visual_elements": 512
},
"valid_samples": [
{
"doc_id": "doc_0001",
"seed_image": "docvqa_train_12345",
"has_handwriting": true,
"has_visual_elements": true,
"num_annotations": 10,
"num_words": 247,
"image_size": [1654, 2339]
}
],
"valid_samples_by_category": {
"invoice": 234,
"receipt": 198,
"form": 415
},
"errors": {
"multipage_pdf": 12,
"missing_ocr_result": 5,
"failed_gt_verification": 38,
"rendering_timeout": 8,
"llm_parsing_error": 23,
"bbox_extraction_failed": 4,
"handwriting_generation_failed": 7,
"visual_element_generation_failed": 3,
"other": 53
}
}
Error Categories:
| Error Category | Description |
|---|---|
multipage_pdf |
PDF rendered with multiple pages (invalid) |
missing_ocr_result |
OCR failed or returned empty result |
failed_gt_verification |
GT matching/validation failed in stage 17 |
rendering_timeout |
PDF rendering exceeded timeout |
llm_parsing_error |
Failed to extract HTML/GT from LLM response |
bbox_extraction_failed |
PyMuPDF failed to extract bboxes |
handwriting_generation_failed |
Diffusion model failed |
visual_element_generation_failed |
VE generation/selection failed |
other |
Miscellaneous errors |
Cost Calculation:
Claude API Pricing (example):
costs = {
"input": input_tokens * 0.003 / 1000, # $3 per 1M tokens
"output": output_tokens * 0.015 / 1000, # $15 per 1M tokens
"cached": cached_tokens * 0.0003 / 1000 # $0.30 per 1M tokens (prompt caching)
}
total_cost = costs["input"] + costs["output"] + costs["cached"]
Statistics Computed:
- Validity Rate: Percentage of documents passing all stages
- Handwriting Stats: Documents with handwriting, avg regions per doc
- Visual Element Stats: Documents with VEs, avg elements per doc
- Annotation Stats: Avg QA pairs/KIE entities/DLA elements per doc
- Token Usage: Input/output/cached token totals
- Cost Metrics: Total cost, cost per document, cost per valid document
Key Features:
- Comprehensive Error Tracking: All error categories logged
- Cost Transparency: Token-level cost breakdown
- Quality Metrics: Validity rate, avg annotations, etc.
- Category Breakdown: Valid samples grouped by document type
- Per-Document Tracking: Each valid document's metadata saved
Usage: This log is essential for:
- Understanding dataset quality
- Optimizing pipeline (identify bottlenecks)
- Cost estimation for future runs
- Debugging (error distribution analysis)
Stage 19: Create Debug Data
File: pipeline_19_create_debug_data.py
Purpose: Generate comprehensive debug visualizations for manual inspection and quality assurance.
Key Functions:
main(): Main debug generatorvisualize_visual_element_bboxes(): Overlay VE bboxes on PDFsvisualize_final_bboxes_on_images(): Overlay OCR bboxes on imagesvisualize_pdf_bboxes(): Overlay PDF bboxes
Process:
- Load all generated documents
- Create debug subdirectories
- For each document:
- Generate PDF bbox overlays
- Generate VE bbox overlays
- Generate handwriting insertion region overlays
- Generate final OCR bbox overlays on images
- Copy raw HTML with debug.js script for browser inspection
Inputs:
- All intermediate and final outputs from stages 01-18
Outputs:
debug/
pdf_bboxes/ # PDF word bboxes overlaid (stage 05)
{doc_id}.pdf
visual_element_bboxes/ # VE bboxes overlaid (stage 08)
{doc_id}.pdf
handwriting_insertion/ # Handwriting regions overlaid (stage 12)
{doc_id}.pdf
final_bboxes_on_images/ # OCR bboxes on final images (stage 15)
{doc_id}.png
html_with_debug/ # Raw HTML + debug.js
{doc_id}.html
debug.js # Browser-based inspection script
Debug Visualizations:
1. PDF BBoxes (Stage 05):
- Red rectangles: Word bounding boxes from PyMuPDF
- Annotated with word text
- Purpose: Verify PDF text extraction quality
2. Visual Element BBoxes (Stage 08):
- Blue rectangles: Visual element placeholder regions
- Annotated with element type (stamp, logo, etc.)
- Purpose: Verify VE extraction and positioning
3. Handwriting Insertion Regions (Stage 12):
- Green rectangles: Handwriting bbox regions
- Annotated with handwriting text
- Purpose: Verify handwriting placement accuracy
4. Final BBoxes on Images (Stage 15):
- Orange rectangles: OCR word bounding boxes
- Overlaid on final rendered images
- Purpose: Verify OCR accuracy and coverage
Debug JavaScript (debug.js):
// Browser-based inspection tool
// Features:
// - Highlight elements on hover
// - Show geometry data in console
// - Toggle element visibility
// - Measure element dimensions
Key Features:
- Color-Coded Overlays: Different colors for different stages
- Text Annotations: Bboxes labeled with content
- Multi-Format: Both PDF and PNG visualizations
- Browser Inspection: HTML with interactive debug script
- Selective Generation: Only enabled with
DEBUG_MODE=trueflag
Configuration:
DEBUG_MODE: Enable/disable debug output (default: false)DEBUG_BBOX_LINE_WIDTH: Line thickness for overlays (default: 2)DEBUG_BBOX_OPACITY: Overlay transparency (default: 0.5)
When to Use:
- Visual quality assurance
- Debugging bbox extraction issues
- Verifying handwriting/VE insertion
- Identifying OCR problems
- Manual inspection of edge cases
Performance Note: Debug generation adds ~20% processing time; disabled by default in production.
API Implementation
The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Client Request
β
β
βββββββββββββββββββββββ
β FastAPI Server β
β (main.py) β
βββββββββββββββββββββββ
β
ββββΊ Validate Request (schemas.py)
β
ββββΊ Download Seed Images (utils.py)
β
ββββΊ Build Prompt (utils.py)
β
ββββΊ Call Claude API (Synchronous)
β ββββΊ Claude Sonnet 4.5
β
ββββΊ Extract HTML & GT (pipeline_03 functions)
β
ββββΊ Render PDF (pipeline_04 functions)
β ββββΊ Playwright/Chromium
β
ββββΊ Extract BBoxes (pipeline_05 functions)
β ββββΊ PyMuPDF
β
ββββΊ Return Response
ββββΊ JSON with base64-encoded PDFs
Endpoints
GET /
Health check endpoint.
Response:
{
"message": "DocGenie API is running",
"version": "1.0",
"status": "healthy"
}
GET /health
Detailed health status.
Response:
{
"status": "healthy",
"api_version": "1.0",
"playwright_available": true,
"llm_model": "claude-sonnet-4-5-20250929"
}
POST /generate
Generate documents with ground truth annotations.
Request Body (schemas.GenerateDocumentRequest):
{
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business and administrative",
"gt_type": "Multiple questions and their answers",
"gt_format": "{\"question\": \"answer\", ...}",
"num_solutions": 2
}
}
Request Validation:
seed_images: 1-10 URLs (HTTPS only)num_solutions: 1-5 documents- All prompt parameters required
Response (schemas.GenerateDocumentResponse):
{
"success": true,
"message": "Successfully generated 2 documents",
"documents": [
{
"document_id": "doc_20260207_001",
"html": "<html>...</html>",
"css": "body { ... }",
"ground_truth": {
"What is the invoice number?": "INV-12345"
},
"pdf_base64": "JVBERi0xLjQK...",
"bboxes": [
{
"x0": 20.0,
"y0": 30.0,
"x1": 60.0,
"y1": 45.0,
"text": "Invoice"
}
],
"page_width_mm": 210.0,
"page_height_mm": 297.0
}
],
"total_documents": 2
}
POST /generate-files
Generate documents and return as downloadable files.
Request: Same as /generate
Response:
- Content-Type:
application/zip - File: ZIP archive containing:
doc_001.pdfdoc_001_gt.jsondoc_001_bboxes.jsondoc_002.pdf- ...
File Structure:
generated_documents.zip
βββ doc_001.pdf
βββ doc_001_gt.json
βββ doc_001_bboxes.json
βββ doc_001_metadata.json
βββ doc_002.pdf
βββ ...
Request/Response Schemas
Defined in api/schemas.py:
PromptParameters
class PromptParameters(BaseModel):
language: str = "English"
doc_type: str = "business and administrative"
gt_type: str = "Multiple questions and their answers"
gt_format: str = '{"question": "answer", ...}'
num_solutions: int = Field(default=1, ge=1, le=5)
GenerateDocumentRequest
class GenerateDocumentRequest(BaseModel):
seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
prompt_params: PromptParameters
BoundingBox
class BoundingBox(BaseModel):
x0: float
y0: float
x1: float
y1: float
text: str
DocumentResult
class DocumentResult(BaseModel):
document_id: str
html: str
css: str
ground_truth: Optional[dict]
pdf_base64: str
bboxes: List[BoundingBox]
page_width_mm: float
page_height_mm: float
GenerateDocumentResponse
class GenerateDocumentResponse(BaseModel):
success: bool
message: str
documents: List[DocumentResult]
total_documents: int
API Pipeline Flow
Detailed integration with pipeline stages:
# Simplified API flow (from api/main.py)
@app.post("/generate", response_model=GenerateDocumentResponse)
async def generate_documents(request: GenerateDocumentRequest):
# 1. Download and encode seed images
seed_images_base64 = await download_and_encode_images(request.seed_images)
# 2. Build prompt from template
prompt = build_prompt_from_template(
seed_images=seed_images_base64,
params=request.prompt_params
)
# 3. Call Claude API (synchronous, not batched)
llm_response = await call_claude_api(
prompt=prompt,
model="claude-sonnet-4-5-20250929"
)
# 4. Extract HTML and GT (from pipeline_03)
documents_html = extract_html_from_message(llm_response)
documents_gt = extract_gt_from_html(documents_html)
# 5. Validate HTML (from utils.py)
validate_html_structure(documents_html)
# 6. Render PDFs (from pipeline_04)
pdfs = await render_pdfs_with_playwright(documents_html)
# 7. Validate PDFs (from utils.py)
validate_pdf_pages(pdfs)
# 8. Extract bboxes (from pipeline_05)
bboxes = extract_bboxes_from_pdfs(pdfs)
# 9. Validate bboxes (from utils.py)
validate_bbox_completeness(bboxes)
# 10. Encode PDFs to base64
pdfs_base64 = encode_pdfs_to_base64(pdfs)
# 11. Build response
return GenerateDocumentResponse(
success=True,
documents=[...],
total_documents=len(documents_html)
)
Integration with Pipeline Functions
Reused from Pipeline:
extract_html_from_message()frompipeline_03extract_gt_from_html()frompipeline_03render_pdf_with_playwright()frompipeline_04extract_bboxes_from_pdf()frompipeline_05- Various utilities from
docgenie.generation.utils
API-Specific Functions (api/utils.py):
download_seed_images(): Fetch images from URLsencode_images_to_base64(): Convert images for API transmissionbuild_prompt_from_template(): Template-based prompt constructioncall_claude_api_sync(): Synchronous Claude API call (non-batched)encode_pdf_to_base64(): PDF encoding for responsevalidate_html_structure(): HTML validationvalidate_pdf_pages(): PDF page count/size validationvalidate_bbox_completeness(): Ensure bboxes extracted
Configuration
Environment Variables (.env)
ANTHROPIC_API_KEY=sk-ant-... # Required for Claude API
LLM_MODEL=claude-sonnet-4-5-20250929 # Default model
API_PORT=8000 # Server port
DEBUG_MODE=false # Enable debug logging
Prompt Templates
Located in data/prompt_templates/:
Template Structure:
data/prompt_templates/<template_name>/
βββ system_prompt.txt # System message
βββ user_prompt_template.txt # User message template
βββ example_output.html # Example for few-shot
Placeholder Substitution:
# In user_prompt_template.txt
"""
Please generate {num_solutions} {doc_type} documents in {language}.
Ground truth format: {gt_format}
Ground truth type: {gt_type}
"""
# Substituted with request.prompt_params
CORS Configuration
# In api/main.py
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Open for development
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Authentication & Security
Current State:
- No built-in authentication: API is open (for development)
- API Key Management: Claude API key in
.env, not passed in requests - CORS: Open (
allow_origins=["*"]) for development
Production Recommendations:
- Add API key authentication (e.g., Bearer tokens)
- Restrict CORS origins to known frontends
- Rate limiting (e.g., Redis-based)
- Input sanitization for HTML injection prevention
- HTTPS only (terminate SSL at reverse proxy)
Performance Considerations
Async Rendering:
- Playwright rendering uses async/await
- Concurrent requests supported by FastAPI
- Semaphore control prevents resource exhaustion
Image Size Limits:
- Seed images compressed before API transmission
- Max dimension: 1024px (configurable)
- JPEG quality: 85% (configurable)
Timeouts:
- PDF render timeout: 60 seconds per document
- Claude API timeout: 120 seconds
- Total request timeout: 300 seconds (5 minutes)
Retry Logic:
- PDF rendering: Up to 3 attempts
- Claude API: Up to 2 retries on network errors
- No retry on validation failures
Concurrency:
- FastAPI default: Multiple workers (configurable)
- Playwright: Semaphore-controlled (10 concurrent renders)
Limitations
Current Limitations:
- Single-Page Only: Multi-page PDFs flagged as errors
- Seed Image Limit: Maximum 10 seed images per request
- Document Limit: Maximum 5 document variations (
num_solutions) - No Handwriting/Visual Elements: Stages 07-13 not integrated (API stops at stage 06)
- Synchronous LLM: No batching (higher cost per document)
- No Dataset Export: No
/export-datasetendpoint for full pipeline runs
Known Issues:
- Large documents (>10 pages worth of content) may timeout
- Complex CSS (animations, 3D transforms) may not render correctly
- Some Unicode characters may not display in PDFs
Future Integration Plan
From api/PIPELINE_INTEGRATION.md:
Stage 3: Handwriting & Visual Elements (Stages 07-11)
New Request Parameters:
class PromptParameters(BaseModel):
# ... existing ...
enable_handwriting: bool = False
handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
enable_visual_elements: bool = False
visual_element_types: List[str] = ["stamp", "logo"]
New Response Fields:
class DocumentResult(BaseModel):
# ... existing ...
handwriting_regions: Optional[List[HandwritingRegion]]
visual_elements: Optional[List[VisualElement]]
Impact:
- Longer processing time (diffusion model: ~5s per handwriting region)
- Larger response size (additional images)
Stage 4: Image Finalization & OCR (Stages 12-15)
New Response Fields:
class DocumentResult(BaseModel):
# ... existing ...
image_base64: str # Final rendered image (PNG)
ocr_text: str # Full OCR text
ocr_confidence: float # Average OCR confidence
Impact:
- OCR API costs (~$1.50 per 1000 images)
- Additional 2-3 seconds per document (OCR latency)
Stage 5: Dataset Packaging (Stages 16-19)
New Endpoint:
@app.post("/export-dataset")
async def export_dataset(request: ExportDatasetRequest):
"""
Run full pipeline (stages 01-19) and return packaged dataset.
"""
# Run pipeline with syndatadef
# Return ZIP with images, GTs, statistics
Request:
class ExportDatasetRequest(BaseModel):
syndatadef_config: dict # Full SynDatasetDefinition
output_format: str = "huggingface" # "huggingface", "coco", "custom"
Response:
- ZIP archive with full dataset
- Includes
dataset_log.jsonfrom stage 18
Example Usage
Python Client (api/example_usage.py)
import requests
import base64
from pathlib import Path
# API endpoint
API_URL = "http://localhost:8000/generate"
# Prepare request
request_data = {
"seed_images": [
"https://example.com/invoice_seed.jpg",
"https://example.com/receipt_seed.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "invoices and receipts",
"gt_type": "Multiple questions and their answers",
"gt_format": '{"question": "answer", ...}',
"num_solutions": 3
}
}
# Call API
response = requests.post(API_URL, json=request_data)
result = response.json()
# Process results
if result["success"]:
for doc in result["documents"]:
doc_id = doc["document_id"]
# Save PDF
pdf_data = base64.b64decode(doc["pdf_base64"])
Path(f"{doc_id}.pdf").write_bytes(pdf_data)
# Save GT
Path(f"{doc_id}_gt.json").write_text(
json.dumps(doc["ground_truth"], indent=2)
)
# Save HTML
Path(f"{doc_id}.html").write_text(doc["html"])
print(f"Saved: {doc_id}")
print(f" - BBoxes: {len(doc['bboxes'])}")
print(f" - GT Annotations: {len(doc['ground_truth'])}")
cURL Example
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"seed_images": [
"https://example.com/seed.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business documents",
"gt_type": "Questions and answers",
"gt_format": "{\"question\": \"answer\"}",
"num_solutions": 1
}
}'
JavaScript/TypeScript Client
const response = await fetch('http://localhost:8000/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
seed_images: ['https://example.com/seed.jpg'],
prompt_params: {
language: 'English',
doc_type: 'invoices',
gt_type: 'Questions and answers',
gt_format: '{"question": "answer"}',
num_solutions: 2
}
})
});
const result = await response.json();
// Decode PDF
const pdfBlob = new Blob(
[Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
{ type: 'application/pdf' }
);
// Download PDF
const url = URL.createObjectURL(pdfBlob);
const a = document.createElement('a');
a.href = url;
a.download = `${result.documents[0].document_id}.pdf`;
a.click();
Testing
Test File: api/test_api.py
import pytest
from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)
def test_health_check():
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_generate_documents():
request_data = {
"seed_images": ["https://example.com/seed.jpg"],
"prompt_params": {
"language": "English",
"doc_type": "invoices",
"gt_type": "Questions",
"gt_format": "{\"q\": \"a\"}",
"num_solutions": 1
}
}
response = client.post("/generate", json=request_data)
assert response.status_code == 200
result = response.json()
assert result["success"] is True
assert len(result["documents"]) > 0
def test_invalid_request():
request_data = {
"seed_images": [], # Invalid: empty
"prompt_params": {"num_solutions": 10} # Invalid: > 5
}
response = client.post("/generate", json=request_data)
assert response.status_code == 422 # Validation error
Run Tests:
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
pytest api/test_api.py -v
Deployment
Start Server:
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
chmod +x start.sh
./start.sh
start.sh Contents:
#!/bin/bash
export $(cat .env | xargs)
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Production Deployment:
# With Gunicorn (multi-worker)
gunicorn main:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000 \
--timeout 300
Docker Deployment:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Core Models & Utilities
SynDatasetDefinition Model
File: docgenie/generation/models/_syndatadef.py
Purpose: Configuration object for synthetic dataset generation, loaded from YAML files.
Key Attributes:
@dataclass
class SynDatasetDefinition:
# Dataset metadata
name: str # "docvqa_alpha=1.0"
task: TaskType # TaskType.QA, .KIE, .DLA, .CLS
base_dataset: str # "docvqa", "cord", etc.
# LLM configuration
llm_model: str # "claude-sonnet-4-20250514"
prompt_template: str # "DocGenie"
num_solutions: int # Documents per prompt
# Prompting parameters
language: str # "English", "German", etc.
doc_type: str # "business and administrative"
gt_type: str # Task-specific GT description
gt_format: str # Expected GT format
# Dataset parameters
num_samples: int # Total documents to generate
alpha: float # Clustering diversity parameter
# Seed selection
seeds_per_cluster: int # Seeds sampled per cluster
clustering_method: str # "kmeans", "agglomerative"
# Task-specific
valid_labels: Optional[List[str]] # For DLA: ["title", "text", ...]
# Output paths
output_dir: Path # Root output directory
Key Methods:
def get_file_structure(self) -> FileStructure:
"""Returns FileStructure manager for output directories."""
def build_prompt(self, seed_images: List[str]) -> str:
"""Builds prompt from template with parameter substitution."""
def iter_document_logs(self) -> Iterator[DocumentLog]:
"""Iterates over all document logs."""
def update_document_status(self, doc_id: str, status: Status):
"""Updates document status in log."""
@classmethod
def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
"""Load configuration from YAML file."""
Example YAML:
name: "docvqa_alpha=1.0"
task: "qa"
base_dataset: "docvqa"
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1
language: "English"
doc_type: "business and administrative documents"
gt_type: "Multiple questions and their answers"
gt_format: '{"question": "answer", ...}'
num_samples: 1000
alpha: 1.0
seeds_per_cluster: 10
clustering_method: "kmeans"
output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"
PipelineParameters Model
File: docgenie/generation/models/_pipeline.py
Purpose: Runtime parameters for pipeline execution.
Attributes:
@dataclass
class PipelineParameters:
# Execution control
start_stage: int = 1 # First stage to execute
end_stage: int = 19 # Last stage to execute
skip_existing: bool = True # Skip documents with existing outputs
# Parallelization
max_workers: int = 10 # Concurrent processing
chromium_concurrency: int = 10 # Parallel PDF renders
# Debug mode
debug_mode: bool = False # Enable debug visualizations
# Retry configuration
max_retries: int = 3 # Retry attempts
retry_delay: float = 2.0 # Seconds between retries
# Timeouts
pdf_render_timeout: int = 60 # Seconds
ocr_timeout: int = 30 # Seconds
FileStructure Model
File: docgenie/generation/models/_file.py
Purpose: Manages directory structure for generated data.
Key Properties:
@dataclass
class FileStructure:
root: Path # Root output directory
# Directory properties (all return Path)
seeds_directory: Path
prompt_batches_directory: Path
message_results_directory: Path
raw_html_directory: Path
raw_annotations_directory: Path
geometries_directory: Path
pdf_initial_directory: Path
render_html_directory: Path
pdf_word_bboxes_directory: Path
pdf_char_bboxes_directory: Path
layout_element_definitions_directory: Path
handwriting_definitions_directory: Path
visual_element_definitions_directory: Path
handwriting_images_directory: Path
visual_element_images_directory: Path
pdf_without_handwriting_placeholder_directory: Path
pdf_with_handwriting_directory: Path
pdf_final_directory: Path
images_directory: Path
final_word_bboxes_directory: Path
final_segment_bboxes_directory: Path
normalized_word_bboxes_directory: Path
normalized_segment_bboxes_directory: Path
verified_gt_directory: Path
# Debug subdirectories
debug_directory: Path
debug_pdf_bboxes_directory: Path
debug_visual_element_bboxes_directory: Path
debug_handwriting_insertion_directory: Path
debug_final_bboxes_on_images_directory: Path
debug_html_with_debug_directory: Path
Key Methods:
def create_all_directories(self):
"""Create all output directories."""
def get_document_path(self, doc_id: str, stage: str) -> Path:
"""Get path for specific document and stage."""
DocumentLog Model
File: docgenie/generation/models/_log.py
Purpose: Document-level metadata and status tracking.
Attributes:
@dataclass
class DocumentLog:
doc_id: str # Unique document ID
seed_image_id: str # Source seed image
prompt_call_id: str # Prompt batch/call ID
status: Status # VALID, INVALID, PROCESSING
# Stage completion flags
has_raw_html: bool = False
has_raw_gt: bool = False
has_pdf_initial: bool = False
has_geometries: bool = False
has_bboxes: bool = False
has_handwriting: bool = False
has_visual_elements: bool = False
has_final_image: bool = False
has_ocr_result: bool = False
has_verified_gt: bool = False
# Statistics
num_words: int = 0
num_annotations: int = 0
num_handwriting_regions: int = 0
num_visual_elements: int = 0
# Error tracking
error_stage: Optional[str] = None
error_message: Optional[str] = None
error_category: Optional[str] = None
# Timestamps
created_at: datetime
updated_at: datetime
Key Methods:
def mark_stage_complete(self, stage: str):
"""Mark pipeline stage as complete."""
def mark_error(self, stage: str, error_msg: str, category: str):
"""Record error and mark document as invalid."""
def is_valid(self) -> bool:
"""Check if document passed all stages."""
BBox Model
File: docgenie/generation/models/_bbox.py
Purpose: Bounding box representation with text content.
Attributes:
@dataclass
class BBox:
rect: Rect # {x0, y0, x1, y1}
text: str # Text content
metadata: Optional[dict] = None # Additional data
Key Methods:
@property
def width(self) -> float:
return self.rect["x1"] - self.rect["x0"]
@property
def height(self) -> float:
return self.rect["y1"] - self.rect["y0"]
def normalize(self, image_width: float, image_height: float) -> "BBox":
"""Convert to normalized [0, 1] coordinates."""
def unnormalize(self, image_width: float, image_height: float) -> "BBox":
"""Convert from normalized to pixel coordinates."""
def to_dict(self) -> dict:
"""Serialize to dictionary."""
@classmethod
def from_dict(cls, data: dict) -> "BBox":
"""Deserialize from dictionary."""
LayoutBBox Model
File: docgenie/generation/models/_bbox.py
Purpose: Layout element bounding box (for DLA tasks).
Attributes:
@dataclass
class LayoutBBox:
label: str # "title", "text", "table", etc.
rect: Rect # {x0, y0, x1, y1}
metadata: Optional[dict] = None
Key Methods:
def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
"""Normalize coordinates."""
def contains(self, other: "LayoutBBox") -> bool:
"""Check if this bbox fully contains another."""
def overlaps(self, other: "LayoutBBox") -> bool:
"""Check if this bbox overlaps with another."""
def overlap_area(self, other: "LayoutBBox") -> float:
"""Calculate overlap area with another bbox."""
Utility Modules
BBox Utilities (utils/bboxes.py)
def load_bboxes_from_file(file_path: Path) -> List[BBox]:
"""Load bboxes from JSON file."""
def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
"""Save bboxes to JSON file."""
def visualize_bboxes_on_pdf(
pdf_path: Path,
bboxes: List[BBox],
output_path: Path,
color: str = "red"
):
"""Draw bbox overlays on PDF."""
def visualize_bboxes_on_image(
image_path: Path,
bboxes: List[BBox],
output_path: Path,
color: str = "orange"
):
"""Draw bbox overlays on image."""
def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
"""Check if bbox1 contains bbox2."""
Geometry Utilities (utils/geos.py)
def filter_layout_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-label attribute."""
def filter_handwriting_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-handwriting attribute."""
def filter_visual_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-visual-element attribute."""
def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
"""Extract elements with specific CSS class."""
Serialization Utilities (utils/serialization.py)
def encode_image_to_base64(image_path: Path) -> str:
"""Encode image file to base64 string."""
def decode_base64_to_image(base64_str: str, output_path: Path):
"""Decode base64 string and save as image."""
def serialize_dataclass(obj: Any) -> dict:
"""Serialize dataclass to dictionary."""
def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
"""Deserialize dictionary to dataclass."""
Configuration & Constants
Key Constants (generation/constants.py)
# Document processing
HTML_PARSER = "html.parser" # BeautifulSoup parser
PDF_POINT_SCALING = 72 / 96 # CSS DPI to PDF DPI
# Bounding boxes
BBOX_OVERLAP_THRESHOLD = 0.05 # 5% overlap tolerance
SPATIAL_MATCH_THRESHOLD = 10.0 # Pixel tolerance for matching
# PDF rendering
CHROMIUM_CONCURRENCY = 10 # Parallel renders
PER_PDF_RENDER_TIMEOUT = 60 # Seconds
PER_PDF_RENDER_MAX_RETRIES = 3 # Retry attempts
# Handwriting
MAX_HANDWRITING_CHARS = 7 # Max chars per diffusion generation
HANDWRITING_HEIGHT_PX = 40 # Image height
HANDWRITING_PADDING_PX = 0 # Horizontal padding
DIFFUSION_NUM_INFERENCE_STEPS = 50 # Generation quality
HANDWRITING_IMAGE_UPSCALE_FACTOR = 3 # Insertion scaling
MAX_HANDWRITING_RAND_X = 2 # Random X offset (pixels)
MAX_HANDWRITING_RAND_Y = 1 # Random Y offset (pixels)
# Visual elements
VISUAL_ELEMENT_UPSCALE_FACTOR = 3 # Insertion scaling
STAMP_BORDER_WIDTH = 2 # Stamp border thickness
BARCODE_DPI = 300 # Barcode image quality
# OCR
IMAGE_DPI = 200 # Final image DPI
OCR_CONFIDENCE_THRESHOLD = 0.8 # Min confidence
# Ground truth
FUZZY_MATCH_THRESHOLD = 0.85 # Levenshtein similarity cutoff
# Handwriting styles
HANDWRITING_STYLES = [
"writer_0", "writer_1", "writer_2", ..., "writer_99"
]
# Visual element types
VISUAL_ELEMENT_TYPES = [
"stamp", "logo", "barcode", "chart", "photo"
]
# Visual element type mapping (LLM output β standard)
VISUAL_ELEMENT_TYPE_MAPPING = {
"stamp": "stamp",
"company_stamp": "stamp",
"approval_stamp": "stamp",
"logo": "logo",
"company_logo": "logo",
"brand_logo": "logo",
"barcode": "barcode",
"code128": "barcode",
"chart": "chart",
"graph": "chart",
"figure": "chart",
"photo": "photo",
"image": "photo",
"picture": "photo"
}
Environment Variables
Required:
ANTHROPIC_API_KEY=sk-ant-... # Claude API key
MICROSOFT_AZURE_OCR_KEY=... # Azure OCR key
MICROSOFT_AZURE_OCR_ENDPOINT=... # Azure OCR endpoint
Optional:
LLM_MODEL=claude-sonnet-4-20250514 # Override model
DEBUG_MODE=false # Enable debug output
LOG_LEVEL=INFO # Logging verbosity
CHROMIUM_CONCURRENCY=10 # Override concurrency
Error Handling & Debugging
Error Categories
Comprehensive error tracking in stage 18:
| Category | Description | Resolution |
|---|---|---|
multipage_pdf |
PDF rendered with >1 page | Check HTML content size, CSS page breaks |
missing_ocr_result |
OCR API failed or empty | Check Azure credentials, retry |
failed_gt_verification |
GT text not found in OCR | Review fuzzy match threshold, inspect HTML |
rendering_timeout |
PDF render exceeded timeout | Increase timeout, simplify HTML |
llm_parsing_error |
Failed to extract HTML/GT | Review LLM response format, update regex |
bbox_extraction_failed |
PyMuPDF extraction error | Check PDF validity, inspect fonts |
handwriting_generation_failed |
Diffusion model error | Check model checkpoint, GPU availability |
visual_element_generation_failed |
VE creation error | Check prefab directories, image validity |
Debug Visualizations (Stage 19)
Generated debug outputs:
- PDF BBoxes: Verify PyMuPDF extraction quality
- Visual Element BBoxes: Verify VE extraction and positioning
- Handwriting Insertion: Verify handwriting placement
- Final BBoxes on Images: Verify OCR accuracy
Enable debug mode:
# In pipeline execution
pipeline_params = PipelineParameters(
debug_mode=True,
# ... other params
)
Logging
Log locations:
data/datasets/synthesized_datasets/<dataset_name>/logs/
βββ pipeline_01/
β βββ seed_selection.log
βββ pipeline_02/
β βββ prompting.log
βββ pipeline_03/
β βββ message_processing/
β βββ batch_001.log
β βββ batch_002.log
βββ ...
βββ pipeline_19/
βββ debug_generation.log
Log format:
[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout
Common Issues & Solutions
Issue: Multi-page PDFs
- Cause: HTML content exceeds page size
- Solution: Reduce content, check for long tables, use CSS
overflow: hidden
Issue: Handwriting not appearing
- Cause: Character bboxes not available (stage 05)
- Solution: Use simpler fonts in HTML, ensure PyMuPDF can extract chars
Issue: OCR missing text
- Cause: Low image DPI, poor contrast
- Solution: Increase
IMAGE_DPI(stage 14), adjust HTML styling
Issue: GT verification failures
- Cause: OCR discrepancies, fuzzy match threshold too high
- Solution: Lower
FUZZY_MATCH_THRESHOLD, improve HTML text rendering
Issue: Claude API timeout
- Cause: Large seed images, complex prompts
- Solution: Compress seed images, simplify prompt template
Usage Examples
Full Pipeline Execution
from docgenie.generation import pipeline_01_select_seeds
from docgenie.generation.models import SynDatasetDefinition, PipelineParameters
# Load configuration
syndatadef = SynDatasetDefinition.from_yaml(
"data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
)
# Configure pipeline
params = PipelineParameters(
start_stage=1,
end_stage=19,
skip_existing=True,
debug_mode=False,
max_workers=10
)
# Execute pipeline stages
for stage in range(params.start_stage, params.end_stage + 1):
print(f"Executing stage {stage:02d}...")
# Import and run stage module
stage_module = importlib.import_module(
f"docgenie.generation.pipeline_{stage:02d}_*"
)
stage_module.main(syndatadef, params)
print(f"Stage {stage:02d} complete.")
print("Pipeline execution complete!")
API Usage
See API Implementation section for detailed examples.
Custom Dataset Definition
# data/syn_dataset_definitions/custom_invoices.yaml
name: "custom_invoices_v1"
task: "kie"
base_dataset: "cord"
llm_model: "claude-sonnet-4-20250514"
prompt_template: "ClaudeRefined12"
num_solutions: 1
language: "English"
doc_type: "invoices and receipts"
gt_type: "Key-value pairs (entity extraction)"
gt_format: '{"entity_name": "entity_value", ...}'
num_samples: 500
alpha: 0.75
seeds_per_cluster: 5
clustering_method: "kmeans"
# KIE-specific
valid_labels: null # Not used for KIE
output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"
Conclusion
This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:
- Pipeline Integration Guide:
api/PIPELINE_INTEGRATION.md - API README:
api/README.md - Source Code:
docgenie/generation/andapi/
Key Takeaways:
- 19-Stage Pipeline: Modular design from seed selection to GT verification
- Multi-Task Support: QA, KIE, DLA, CLS with task-specific handling
- Realistic Documents: LLM-generated content, diffusion handwriting, visual elements
- Quality Assurance: Comprehensive validation, OCR verification, error tracking
- API Integration: FastAPI service for synchronous document generation (stages 01-06)
- Extensibility: Modular code, clear interfaces, easy to extend
Pipeline Strengths:
- Task-agnostic core with task-specific adapters
- Extensive logging and error tracking
- Parallel processing where applicable
- Debug visualizations at every stage
- Integration with state-of-the-art models
Future Directions:
- Full API integration (stages 07-19)
- Additional document types (forms, legal, medical)
- Multi-language support expansion
- Enhanced visual element generation (charts, diagrams)
- Real-time generation optimization