# DocGenie Generation Pipeline & API Documentation **Version:** 1.0 **Last Updated:** February 7, 2026 **Purpose:** Comprehensive reference for the DocGenie synthetic document generation system --- ## Table of Contents 1. [Overview](#overview) 2. [Pipeline Architecture](#pipeline-architecture) 3. [Pipeline Stages (01-19)](#pipeline-stages-01-19) 4. [API Implementation](#api-implementation) 5. [Core Models & Utilities](#core-models--utilities) 6. [Configuration & Constants](#configuration--constants) 7. [Usage Examples](#usage-examples) 8. [Error Handling & Debugging](#error-handling--debugging) --- ## Overview DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks: - **Document Question Answering (QA)** - **Key Information Extraction (KIE)** - **Document Layout Analysis (DLA)** - **Document Classification (CLS)** ### Key Features - **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content - **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles - **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos - **Multi-Task Support**: Task-specific ground truth formatting and validation - **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking - **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs ### Technology Stack - **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen - **PDF Rendering**: Playwright (Chromium), PyMuPDF - **OCR**: Microsoft Azure OCR - **Handwriting**: Custom diffusion model - **Image Processing**: PIL, OpenCV - **API Framework**: FastAPI - **Data Processing**: Pandas, NumPy --- ## Pipeline Architecture ### High-Level Flow ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ DOCGENIE GENERATION PIPELINE │ └─────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────┐ │ PHASE 1: SELECTION │ └──────────────────────┘ ↓ [01] Select Seeds ────────────► seeds.csv, clusters.csv (Cluster-based diverse seed selection) ┌──────────────────────┐ │ PHASE 2: LLM GEN │ └──────────────────────┘ ↓ [02] Prompt LLM ──────────────► batch_results/ (JSON) │ (Claude API batched calls) ↓ [03] Process Response ────────► raw_html/, raw_annotations/ (Extract HTML & GT from responses) ┌──────────────────────┐ │ PHASE 3: RENDERING │ └──────────────────────┘ ↓ [04] Render PDF Initial ──────► pdf_initial/, geometries/ │ (HTML→PDF with geometry extraction) ↓ [05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/ │ (PyMuPDF text extraction) ↓ [06] Extract Layout ──────────► layout_element_definitions/ (DLA/KIE-specific annotations) ┌──────────────────────┐ │ PHASE 4: EXTRACTION │ └──────────────────────┘ ↓ [07] Extract Handwriting ─────► handwriting_definitions/ │ (Identify handwriting regions) ↓ [08] Extract Visual Elements ─► visual_element_definitions/ (Stamp/logo/barcode placeholders) ┌──────────────────────┐ │ PHASE 5: GENERATION │ └──────────────────────┘ ↓ [09] Create Handwriting ──────► handwriting_images/ │ (Diffusion model generation) ↓ [10] Create Visual Elements ──► visual_element_images/ (Generate/select stamps, logos, etc.) ┌──────────────────────┐ │ PHASE 6: COMPOSITION │ └──────────────────────┘ ↓ [11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/ │ (Remove handwriting placeholders) ↓ [12] Insert Handwriting ──────► pdf_with_handwriting/ │ (Overlay handwriting images) ↓ [13] Insert Visual Elements ──► pdf_final/ │ (Overlay stamps, logos, etc.) ↓ [14] Render Image ────────────► images/ (PDF→PNG conversion) ┌──────────────────────┐ │ PHASE 7: FINALIZATION│ └──────────────────────┘ ↓ [15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/ │ (Microsoft OCR) ↓ [16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/ (Pixel→[0,1] coordinates) ┌──────────────────────┐ │ PHASE 8: VALIDATION │ └──────────────────────┘ ↓ [17] GT Preparation ──────────► verified_gt/ │ (Fuzzy matching, BIO tagging) ↓ [18] Analyze ─────────────────► dataset_log.json │ (Statistics, cost analysis) ↓ [19] Create Debug Data ───────► debug/ subdirectories (Visualizations for inspection) ``` ### Data Flow Between Stages ``` Seed Images ──┐ ├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries Prompt Params ┘ │ ├──► [05] ──► BBoxes │ │ │ ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐ │ │ │ │ └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤ │ │ └──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤ │ (Insert HW) │ └──► [13] ◄────────────────────────┘ (Insert VE) ↓ [14] ──► Image ↓ [15] ──► OCR BBoxes ↓ [16] ──► Normalized ↓ [17] ──► Verified GT ``` --- ## Pipeline Stages (01-19) ### Stage 01: Select Seeds **File:** `pipeline_01_select_seeds.py` **Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents. **Key Functions:** - `main()`: Orchestrates seed selection process - `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission - `plot_class_distribution()`: Visualizes class balance in selected seeds - `visualize_cluster_histogram()`: Shows distribution across clusters **Process:** 1. Load embeddings from base dataset 2. Perform clustering (KMeans or other algorithms) 3. Sample N seeds per cluster 4. Downscale and compress images (JPEG, max dimension) 5. Save seed manifest and cluster assignments **Inputs:** - `SynDatasetDefinition` configuration - Base dataset name (e.g., `docvqa`, `cord`, `publaynet`) - Clustering parameters from constants **Outputs:** ``` seeds.csv # Selected seed document IDs per prompt call clusters.csv # Cluster assignments for all documents seeds/ (directory) # Preprocessed seed images (JPEG, compressed) ``` **Configuration Parameters:** - `EMBEDDING_MODEL`: Specifies which embedding model was used - `IMAGE_MAX_DIMENSION`: Max width/height for compression - `JPEG_QUALITY`: Compression quality (0-100) **Example Usage:** ```python from docgenie.generation import pipeline_01_select_seeds pipeline_01_select_seeds.main( syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml", base_dataset="docvqa" ) ``` --- ### Stage 02: Prompt LLM **File:** `pipeline_02_prompt_llm.py` **Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth. **Key Functions:** - `main()`: Main orchestrator for LLM prompting - `create_batched_messages()`: Constructs API-compatible message batches - `track_batch_completion()`: Polls API for batch status - Cost calculation utilities in `pipeline_01/cost.py` **Process:** 1. Load seed images and encode as base64 2. Build prompts from template with parameter injection 3. Create batched API requests (Claude Batch API for cost efficiency) 4. Submit batches and track completion 5. Save results for processing in stage 03 **Inputs:** - Prompt template from `data/prompt_templates//` - Seed images from stage 01 - API credentials from environment variables - `SynDatasetDefinition` parameters **Outputs:** ``` prompt_batches/ # Batch metadata (batch IDs, status) message_results/ # JSON response files per batch logs/ # Prompting logs and progress ``` **API Configuration:** - **Claude:** Uses Batch API with prompt caching for cost efficiency - **Batch Size:** Configurable via `BATCH_SIZE` constant - **Polling Interval:** Configurable wait time between status checks - **Model Selection:** Specified in `SynDatasetDefinition.llm_model` **Cost Tracking:** - Input/output token counts per request - Cached token usage (for Claude) - Total cost estimation per batch **Example Configuration:** ```yaml # In syn_dataset_definition YAML llm_model: "claude-sonnet-4-20250514" prompt_template: "DocGenie" num_solutions: 1 # Documents per prompt language: "English" doc_type: "business and administrative" ``` --- ### Stage 03: Process Response **File:** `pipeline_03_process_response.py` **Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses. **Key Functions:** - `main()`: Main processor - `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks - `extract_gt_from_html()`: Parse JSON ground truth from `