| # DocGenie Generation Pipeline & API Documentation |
|
|
| **Version:** 1.0 |
| **Last Updated:** February 7, 2026 |
| **Purpose:** Comprehensive reference for the DocGenie synthetic document generation system |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Overview](#overview) |
| 2. [Pipeline Architecture](#pipeline-architecture) |
| 3. [Pipeline Stages (01-19)](#pipeline-stages-01-19) |
| 4. [API Implementation](#api-implementation) |
| 5. [Core Models & Utilities](#core-models--utilities) |
| 6. [Configuration & Constants](#configuration--constants) |
| 7. [Usage Examples](#usage-examples) |
| 8. [Error Handling & Debugging](#error-handling--debugging) |
|
|
| --- |
|
|
| ## Overview |
|
|
| DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks: |
|
|
| - **Document Question Answering (QA)** |
| - **Key Information Extraction (KIE)** |
| - **Document Layout Analysis (DLA)** |
| - **Document Classification (CLS)** |
|
|
| ### Key Features |
|
|
| - **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content |
| - **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles |
| - **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos |
| - **Multi-Task Support**: Task-specific ground truth formatting and validation |
| - **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking |
| - **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs |
|
|
| ### Technology Stack |
|
|
| - **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen |
| - **PDF Rendering**: Playwright (Chromium), PyMuPDF |
| - **OCR**: Microsoft Azure OCR |
| - **Handwriting**: Custom diffusion model |
| - **Image Processing**: PIL, OpenCV |
| - **API Framework**: FastAPI |
| - **Data Processing**: Pandas, NumPy |
|
|
| --- |
|
|
| ## Pipeline Architecture |
|
|
| ### High-Level Flow |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β DOCGENIE GENERATION PIPELINE β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| ββββββββββββββββββββββββ |
| β PHASE 1: SELECTION β |
| ββββββββββββββββββββββββ |
| β |
| [01] Select Seeds βββββββββββββΊ seeds.csv, clusters.csv |
| (Cluster-based diverse seed selection) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 2: LLM GEN β |
| ββββββββββββββββββββββββ |
| β |
| [02] Prompt LLM βββββββββββββββΊ batch_results/ (JSON) |
| β (Claude API batched calls) |
| β |
| [03] Process Response βββββββββΊ raw_html/, raw_annotations/ |
| (Extract HTML & GT from responses) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 3: RENDERING β |
| ββββββββββββββββββββββββ |
| β |
| [04] Render PDF Initial βββββββΊ pdf_initial/, geometries/ |
| β (HTMLβPDF with geometry extraction) |
| β |
| [05] Extract BBoxes βββββββββββΊ pdf_word_bboxes/, pdf_char_bboxes/ |
| β (PyMuPDF text extraction) |
| β |
| [06] Extract Layout βββββββββββΊ layout_element_definitions/ |
| (DLA/KIE-specific annotations) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 4: EXTRACTION β |
| ββββββββββββββββββββββββ |
| β |
| [07] Extract Handwriting ββββββΊ handwriting_definitions/ |
| β (Identify handwriting regions) |
| β |
| [08] Extract Visual Elements ββΊ visual_element_definitions/ |
| (Stamp/logo/barcode placeholders) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 5: GENERATION β |
| ββββββββββββββββββββββββ |
| β |
| [09] Create Handwriting βββββββΊ handwriting_images/ |
| β (Diffusion model generation) |
| β |
| [10] Create Visual Elements βββΊ visual_element_images/ |
| (Generate/select stamps, logos, etc.) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 6: COMPOSITION β |
| ββββββββββββββββββββββββ |
| β |
| [11] Render PDF (2nd Pass) ββββΊ pdf_without_handwriting_placeholder/ |
| β (Remove handwriting placeholders) |
| β |
| [12] Insert Handwriting βββββββΊ pdf_with_handwriting/ |
| β (Overlay handwriting images) |
| β |
| [13] Insert Visual Elements βββΊ pdf_final/ |
| β (Overlay stamps, logos, etc.) |
| β |
| [14] Render Image βββββββββββββΊ images/ |
| (PDFβPNG conversion) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 7: FINALIZATIONβ |
| ββββββββββββββββββββββββ |
| β |
| [15] Perform OCR ββββββββββββββΊ final_word_bboxes/, final_segment_bboxes/ |
| β (Microsoft OCR) |
| β |
| [16] Normalize BBoxes βββββββββΊ normalized_word_bboxes/, normalized_segment_bboxes/ |
| (Pixelβ[0,1] coordinates) |
| |
| ββββββββββββββββββββββββ |
| β PHASE 8: VALIDATION β |
| ββββββββββββββββββββββββ |
| β |
| [17] GT Preparation βββββββββββΊ verified_gt/ |
| β (Fuzzy matching, BIO tagging) |
| β |
| [18] Analyze ββββββββββββββββββΊ dataset_log.json |
| β (Statistics, cost analysis) |
| β |
| [19] Create Debug Data ββββββββΊ debug/ subdirectories |
| (Visualizations for inspection) |
| ``` |
|
|
| ### Data Flow Between Stages |
|
|
| ``` |
| Seed Images βββ |
| ββββΊ [02] βββΊ HTML + GT βββΊ [04] βββΊ PDF + Geometries |
| Prompt Params β β |
| ββββΊ [05] βββΊ BBoxes |
| β β |
| β ββββΊ [07] βββΊ HW Defs βββΊ [09] βββΊ HW Images βββ |
| β β β |
| β ββββΊ [08] βββΊ VE Defs βββΊ [10] βββΊ VE Images βββ€ |
| β β |
| ββββΊ [11] βββΊ PDF (no HW) βββ¬βββΊ [12] βββββββββββββββββββββββββββ€ |
| β (Insert HW) β |
| ββββΊ [13] ββββββββββββββββββββββββββ |
| (Insert VE) |
| β |
| [14] βββΊ Image |
| β |
| [15] βββΊ OCR BBoxes |
| β |
| [16] βββΊ Normalized |
| β |
| [17] βββΊ Verified GT |
| ``` |
|
|
| --- |
|
|
| ## Pipeline Stages (01-19) |
|
|
| ### Stage 01: Select Seeds |
|
|
| **File:** `pipeline_01_select_seeds.py` |
|
|
| **Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents. |
|
|
| **Key Functions:** |
| - `main()`: Orchestrates seed selection process |
| - `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission |
| - `plot_class_distribution()`: Visualizes class balance in selected seeds |
| - `visualize_cluster_histogram()`: Shows distribution across clusters |
|
|
| **Process:** |
| 1. Load embeddings from base dataset |
| 2. Perform clustering (KMeans or other algorithms) |
| 3. Sample N seeds per cluster |
| 4. Downscale and compress images (JPEG, max dimension) |
| 5. Save seed manifest and cluster assignments |
|
|
| **Inputs:** |
| - `SynDatasetDefinition` configuration |
| - Base dataset name (e.g., `docvqa`, `cord`, `publaynet`) |
| - Clustering parameters from constants |
|
|
| **Outputs:** |
| ``` |
| seeds.csv # Selected seed document IDs per prompt call |
| clusters.csv # Cluster assignments for all documents |
| seeds/ (directory) # Preprocessed seed images (JPEG, compressed) |
| ``` |
|
|
| **Configuration Parameters:** |
| - `EMBEDDING_MODEL`: Specifies which embedding model was used |
| - `IMAGE_MAX_DIMENSION`: Max width/height for compression |
| - `JPEG_QUALITY`: Compression quality (0-100) |
|
|
| **Example Usage:** |
| ```python |
| from docgenie.generation import pipeline_01_select_seeds |
| |
| pipeline_01_select_seeds.main( |
| syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml", |
| base_dataset="docvqa" |
| ) |
| ``` |
|
|
| --- |
|
|
| ### Stage 02: Prompt LLM |
|
|
| **File:** `pipeline_02_prompt_llm.py` |
|
|
| **Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth. |
|
|
| **Key Functions:** |
| - `main()`: Main orchestrator for LLM prompting |
| - `create_batched_messages()`: Constructs API-compatible message batches |
| - `track_batch_completion()`: Polls API for batch status |
| - Cost calculation utilities in `pipeline_01/cost.py` |
|
|
| **Process:** |
| 1. Load seed images and encode as base64 |
| 2. Build prompts from template with parameter injection |
| 3. Create batched API requests (Claude Batch API for cost efficiency) |
| 4. Submit batches and track completion |
| 5. Save results for processing in stage 03 |
|
|
| **Inputs:** |
| - Prompt template from `data/prompt_templates/<template_name>/` |
| - Seed images from stage 01 |
| - API credentials from environment variables |
| - `SynDatasetDefinition` parameters |
|
|
| **Outputs:** |
| ``` |
| prompt_batches/ # Batch metadata (batch IDs, status) |
| message_results/ # JSON response files per batch |
| logs/ # Prompting logs and progress |
| ``` |
|
|
| **API Configuration:** |
| - **Claude:** Uses Batch API with prompt caching for cost efficiency |
| - **Batch Size:** Configurable via `BATCH_SIZE` constant |
| - **Polling Interval:** Configurable wait time between status checks |
| - **Model Selection:** Specified in `SynDatasetDefinition.llm_model` |
|
|
| **Cost Tracking:** |
| - Input/output token counts per request |
| - Cached token usage (for Claude) |
| - Total cost estimation per batch |
|
|
| **Example Configuration:** |
| ```yaml |
| # In syn_dataset_definition YAML |
| llm_model: "claude-sonnet-4-20250514" |
| prompt_template: "DocGenie" |
| num_solutions: 1 # Documents per prompt |
| language: "English" |
| doc_type: "business and administrative" |
| ``` |
|
|
| --- |
|
|
| ### Stage 03: Process Response |
|
|
| **File:** `pipeline_03_process_response.py` |
|
|
| **Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses. |
|
|
| **Key Functions:** |
| - `main()`: Main processor |
| - `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks |
| - `extract_gt_from_html()`: Parse JSON ground truth from `<script id="GT">` tags |
| - `validate_and_save_gt()`: Task-specific validation and formatting |
|
|
| **Process:** |
| 1. Load message results from stage 02 |
| 2. Extract HTML content using regex patterns |
| 3. Parse and validate ground truth JSON |
| 4. Apply task-specific formatting (QA, KIE, DLA, CLS) |
| 5. Save raw HTML and annotations separately |
|
|
| **Inputs:** |
| - Message results from `message_results/` (stage 02) |
| - `SynDatasetDefinition` task type and prompt format |
|
|
| **Outputs:** |
| ``` |
| raw_html/ # HTML files (one per document) |
| raw_annotations/ # Ground truth JSON files |
| qa/ # QA format: {"question": "answer", ...} |
| kie/ # KIE format: {"entity": "value", ...} |
| dla/ # DLA format: {"element_id": "label", ...} |
| cls/ # CLS format: {"class": "category"} |
| logs/message_processing/ # Processing logs per message |
| ``` |
|
|
| **Ground Truth Formats:** |
|
|
| **QA Format:** |
| ```json |
| { |
| "What is the invoice number?": "INV-12345", |
| "What is the total amount?": "$1,234.56" |
| } |
| ``` |
|
|
| **KIE Format:** |
| ```json |
| { |
| "company_name": "Acme Corp", |
| "invoice_date": "2024-01-15", |
| "total_amount": "1234.56" |
| } |
| ``` |
|
|
| **DLA Format:** |
| ```json |
| { |
| "layout_0": "title", |
| "layout_1": "text", |
| "layout_2": "table" |
| } |
| ``` |
|
|
| **Validation Checks:** |
| - Expected document count matches actual |
| - Valid JSON structure in GT |
| - Task-specific field presence |
| - HTML completeness (valid tags) |
|
|
| **Error Handling:** |
| - Malformed HTML β Logged, skipped |
| - Missing GT β Document flagged in logs |
| - Invalid JSON β Parsing error logged |
|
|
| --- |
|
|
| ### Stage 04: Render PDF and Extract Geometries |
|
|
| **File:** `pipeline_04_render_pdf_and_extract_geos.py` |
|
|
| **Purpose:** Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing. |
|
|
| **Key Functions:** |
| - `main()`: Async batch rendering orchestrator |
| - `render_pdf_with_playwright()`: Single PDF rendering using Playwright/Chromium |
| - `preprocess_css()`: Remove conflicting CSS `@page` rules |
| - `validate_pdf()`: Check page count and file integrity |
| - `extract_geometries()`: Parse element positions from injected JavaScript |
|
|
| **Process:** |
| 1. Load raw HTML from stage 03 |
| 2. Inject geometry extraction JavaScript |
| 3. Preprocess CSS (remove fixed page sizes) |
| 4. Render with Playwright in headless Chromium |
| 5. Measure content dimensions dynamically |
| 6. Export PDF with calculated page size |
| 7. Extract and save element geometries |
|
|
| **Inputs:** |
| - Raw HTML files from `raw_html/` (stage 03) |
|
|
| **Outputs:** |
| ``` |
| pdf_initial/ # Initial PDFs |
| geometries/ # Element geometry JSON files |
| {doc_id}.json # Contains positions, dimensions, CSS classes |
| render_html/ # HTML with geometry extraction scripts |
| debug/pdf_with_geos/ # Debug PDFs with geometry overlays (optional) |
| logs/rendering/ # Rendering logs and errors |
| ``` |
|
|
| **Geometry JSON Structure:** |
| ```json |
| { |
| "page_width_mm": 210.0, |
| "page_height_mm": 297.0, |
| "elements": [ |
| { |
| "id": "layout_0", |
| "type": "div", |
| "class": "title", |
| "rect": { |
| "x": 20.0, |
| "y": 30.0, |
| "width": 170.0, |
| "height": 15.0 |
| }, |
| "text": "Invoice", |
| "attributes": { |
| "data-label": "title", |
| "data-handwriting": null, |
| "data-visual-element": null |
| } |
| } |
| ] |
| } |
| ``` |
|
|
| **Key Features:** |
| - **Automatic Page Sizing:** No fixed dimensions; page size adapts to content |
| - **Concurrent Rendering:** Semaphore-controlled parallel processing |
| - **Geometry Extraction:** JavaScript injected to capture element positions |
| - **CSS Coordinate Conversion:** 96 DPI (CSS) β 72 DPI (PDF) |
| - **Retry Logic:** Up to 3 attempts with configurable timeout |
| - **Debug Visualizations:** Optional overlay of geometries on PDFs |
|
|
| **Configuration Constants:** |
| - `CHROMIUM_CONCURRENCY`: 10 (parallel render limit) |
| - `PER_PDF_RENDER_TIMEOUT`: 60 seconds |
| - `PER_PDF_RENDER_MAX_RETRIES`: 3 attempts |
| - `PDF_POINT_SCALING`: 72/96 for DPI conversion |
|
|
| **Error Handling:** |
| - Timeout β Retry with increased timeout |
| - Multi-page PDFs β Flagged and skipped |
| - Missing geometries β Error logged, document marked invalid |
|
|
| --- |
|
|
| ### Stage 05: Extract BBoxes from PDF |
|
|
| **File:** `pipeline_05_extract_bboxes_from_pdf.py` |
|
|
| **Purpose:** Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning. |
|
|
| **Key Functions:** |
| - `main()`: Main extraction orchestrator |
| - `extract_bboxes_from_pdf()`: PyMuPDF-based extraction |
| - `verify_char_to_word_mapping()`: Validate character-to-word relationships |
| - `create_bbox_debug_pdf()`: Generate debug visualizations |
|
|
| **Process:** |
| 1. Load PDFs from stage 04 |
| 2. Extract words with PyMuPDF `get_text("words")` |
| 3. Extract characters with `get_text("rawdict")` |
| 4. Map characters to parent words |
| 5. Validate mappings (required for handwriting splitting) |
| 6. Save both word and character bboxes |
|
|
| **Inputs:** |
| - PDFs from `pdf_initial/` (stage 04) |
|
|
| **Outputs:** |
| ``` |
| pdf_word_bboxes/ # Word-level bounding boxes |
| {doc_id}.json # Format: [x0, y0, x1, y1, text] |
| pdf_char_bboxes/ # Character-level bounding boxes (if mappable) |
| {doc_id}.json # Format: [x0, y0, x1, y1, char, word_idx] |
| debug/pdf_bboxes/ # Debug PDFs with bbox overlays (optional) |
| logs/bbox_extraction/ # Extraction logs |
| ``` |
|
|
| **BBox JSON Format:** |
|
|
| **Word BBoxes:** |
| ```json |
| [ |
| {"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"}, |
| {"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"} |
| ] |
| ``` |
|
|
| **Char BBoxes (with word mapping):** |
| ```json |
| [ |
| {"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0}, |
| {"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0} |
| ] |
| ``` |
|
|
| **Key Features:** |
| - **Dual-Level Extraction:** Both word and character granularity |
| - **Mapping Validation:** Ensures characters can be mapped to words (critical for handwriting) |
| - **Debug Visualizations:** Color-coded bbox overlays on PDFs |
| - **PDF Coordinate System:** Uses 72 DPI (PDF points) |
|
|
| **Character-to-Word Mapping:** |
| Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback. |
|
|
| **Error Handling:** |
| - Empty PDFs β Skipped with warning |
| - Unmappable characters β Word-level fallback |
| - Invalid bboxes (negative dimensions) β Filtered out |
|
|
| --- |
|
|
| ### Stage 06: Extract Layout Element Definitions and Annotation GT |
|
|
| **File:** `pipeline_06_extract_layout_element_definitions_and_annotation_gt.py` |
|
|
| **Purpose:** Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE). |
|
|
| **Key Functions:** |
| - `main()`: Task router |
| - `handle_dla()`: Process DLA tasks |
| - `handle_kie()`: Process KIE tasks |
| - `parse_layout_elements()`: Extract layout elements from geometries |
| - `parse_kie_fields()`: Extract KIE annotated fields from geometries |
|
|
| **Process (DLA):** |
| 1. Load geometries from stage 04 |
| 2. Filter elements with `data-label` attribute |
| 3. Validate labels against task-specific valid labels |
| 4. Extract bounding boxes and labels |
| 5. Save layout element definitions |
|
|
| **Process (KIE):** |
| 1. Load geometries from stage 04 |
| 2. Filter elements with `data-annotation` attribute |
| 3. Extract entity names and text values |
| 4. Validate against expected entity schema |
| 5. Save raw annotations for stage 17 |
|
|
| **Inputs:** |
| - Geometries from `geometries/` (stage 04) |
| - Valid labels from `SynDatasetDefinition.valid_labels` |
| - Task type from `SynDatasetDefinition.task` |
|
|
| **Outputs:** |
| ``` |
| layout_element_definitions/ # For DLA tasks |
| {doc_id}.json # [{id, label, rect}, ...] |
| raw_annotations/ # For KIE tasks |
| kie_annotations/ |
| {doc_id}.json # {entity_name: text_value, ...} |
| ``` |
|
|
| **Layout Element Definition Format (DLA):** |
| ```json |
| [ |
| { |
| "id": "layout_0", |
| "label": "title", |
| "rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0} |
| }, |
| { |
| "id": "layout_1", |
| "label": "text", |
| "rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0} |
| } |
| ] |
| ``` |
|
|
| **KIE Annotation Format:** |
| ```json |
| { |
| "company_name": "Acme Corporation", |
| "invoice_number": "INV-12345", |
| "invoice_date": "January 15, 2024", |
| "total_amount": "$1,234.56" |
| } |
| ``` |
|
|
| **Validation Checks:** |
| - **Missing labels:** Elements without `data-label` β Warning |
| - **Invalid labels:** Labels not in valid_labels β Error |
| - **Zero dimensions:** Width or height = 0 β Error |
| - **Multiple labels (KIE only):** One element, multiple entities β Error |
| - **Missing text:** KIE elements without text β Warning |
| |
| **Task-Specific Handling:** |
| - **DLA:** Converts geometries to `LayoutBBox` format for stage 17 |
| - **KIE:** Extracts entity-value pairs for BIO tagging in stage 17 |
| - **QA/CLS:** No processing in this stage (uses raw GT from stage 03) |
| |
| --- |
| |
| ### Stage 07: Extract Handwriting |
| |
| **File:** `pipeline_07_extract_handwriting.py` |
|
|
| **Purpose:** Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model. |
|
|
| **Key Functions:** |
| - `main()`: Main extraction orchestrator |
| - `parse_handwriting_geometries()`: Extract elements with `data-handwriting` attribute |
| - `map_handwriting_to_bboxes()`: Match handwriting text to OCR bboxes spatially |
| - `split_long_words()`: Split words exceeding diffusion model's character limit |
| - `extract_handwriting_style()`: Parse author ID from CSS class |
|
|
| **Process:** |
| 1. Load geometries from stage 04 |
| 2. Filter elements with `data-handwriting` attribute |
| 3. Load word/character bboxes from stage 05 |
| 4. Spatially match handwriting regions to bboxes |
| 5. Split long words for diffusion model constraints |
| 6. Extract author ID for style consistency |
| 7. Save handwriting definitions with bbox mappings |
|
|
| **Inputs:** |
| - Geometries from `geometries/` (stage 04) |
| - Word bboxes from `pdf_word_bboxes/` (stage 05) |
| - Character bboxes from `pdf_char_bboxes/` (stage 05, if available) |
|
|
| **Outputs:** |
| ``` |
| handwriting_definitions/ |
| {doc_id}.json # Handwriting region definitions |
| logs/handwriting_extraction/ # Extraction logs and warnings |
| ``` |
|
|
| **Handwriting Definition Format:** |
| ```json |
| [ |
| { |
| "id": "hw0", |
| "text": "John Smith", |
| "author_id": "author1", |
| "bboxes": [ |
| "20.0,30.0,60.0,45.0,John", |
| "65.0,30.0,95.0,45.0,Smith" |
| ], |
| "rect": { |
| "x": 20.0, |
| "y": 30.0, |
| "width": 75.0, |
| "height": 15.0 |
| }, |
| "is_signature": false |
| } |
| ] |
| ``` |
|
|
| **Key Features:** |
| - **Author ID Extraction:** Parses CSS classes like `handwriting-author1` for style consistency |
| - **Spatial Matching:** Uses bbox overlap/proximity to map handwriting regions to text |
| - **Long Word Splitting:** Splits words > `MAX_HANDWRITING_CHARS` (default: 7) into multiple images |
| - **Signature Detection:** Identifies signature fields for special handling |
| - **Character-Level Fallback:** Uses word bboxes if char-level mapping unavailable |
|
|
| **Configuration Constants:** |
| - `MAX_HANDWRITING_CHARS`: 7 (max characters per diffusion generation) |
| - `SPATIAL_MATCH_THRESHOLD`: Pixel tolerance for bbox matching |
|
|
| **Mapping Logic:** |
| 1. Find all words whose bboxes overlap/intersect with handwriting rect |
| 2. Extract text from matched bboxes |
| 3. Compare with expected handwriting text |
| 4. If char bboxes available: split long words at char boundaries |
| 5. If char bboxes unavailable: split at word boundaries |
|
|
| **Error Handling:** |
| - No bbox matches β Warning, handwriting region skipped |
| - Text mismatch β Logged, uses best match |
| - Missing char bboxes β Word-level fallback (may affect quality) |
|
|
| --- |
|
|
| ### Stage 08: Extract Visual Element Definitions |
|
|
| **File:** `pipeline_08_extract_visual_element_definitions.py` |
|
|
| **Purpose:** Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10. |
|
|
| **Key Functions:** |
| - `main()`: Main extraction orchestrator |
| - `parse_visual_element_geometries()`: Extract elements with `data-visual-element` attribute |
| - `parse_css_dimension()`: Parse width/height from CSS |
| - `parse_css_rotation()`: Extract rotation angle from CSS transform |
|
|
| **Process:** |
| 1. Load geometries from stage 04 |
| 2. Filter elements with `data-visual-element` attribute |
| 3. Parse element type (stamp, logo, barcode, etc.) |
| 4. Extract dimensions and rotation from CSS |
| 5. Parse content (e.g., stamp text, barcode value) |
| 6. Validate element types and dimensions |
| 7. Save visual element definitions |
|
|
| **Inputs:** |
| - Geometries from `geometries/` (stage 04) |
|
|
| **Outputs:** |
| ``` |
| visual_element_definitions/ |
| {doc_id}.json # Visual element definitions |
| logs/visual_element_extraction/ # Extraction logs |
| ``` |
|
|
| **Visual Element Definition Format:** |
| ```json |
| [ |
| { |
| "id": "ve0", |
| "type": "stamp", |
| "type_unmapped": "stamp", |
| "content": "CONFIDENTIAL", |
| "rect": { |
| "x": 150.0, |
| "y": 250.0, |
| "width": 60.0, |
| "height": 30.0 |
| }, |
| "rotation": -15.0 |
| }, |
| { |
| "id": "ve1", |
| "type": "barcode", |
| "type_unmapped": "barcode", |
| "content": "1234567890", |
| "rect": { |
| "x": 20.0, |
| "y": 270.0, |
| "width": 100.0, |
| "height": 20.0 |
| }, |
| "rotation": 0.0 |
| } |
| ] |
| ``` |
|
|
| **Supported Visual Element Types:** |
| - `stamp`: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL") |
| - `logo`: Company/brand logos (selected from prefabs) |
| - `barcode`: Code128 barcodes |
| - `chart`: Charts and graphs (selected from prefabs) |
| - `photo`: Photographic images (selected from prefabs) |
|
|
| **Type Mapping:** |
| Maps LLM-generated type names to standard types using `VISUAL_ELEMENT_TYPE_MAPPING` constant. |
|
|
| **Key Features:** |
| - **Content Extraction:** Parses text content for stamps, values for barcodes |
| - **Rotation Support:** Extracts CSS `rotate()` transform angles |
| - **Dimension Parsing:** Handles CSS units (px, mm, %, etc.) |
| - **Type Validation:** Warns about unknown types, uses fallback mapping |
|
|
| **Validation Checks:** |
| - Unknown type β Logged, mapped to closest known type |
| - Zero dimensions β Error, element skipped |
| - Invalid rotation angle β Defaults to 0Β° |
| - Missing content (for stamps/barcodes) β Warning |
|
|
| **CSS Parsing Examples:** |
| ```css |
| /* Rotation extraction */ |
| transform: rotate(-15deg); β -15.0 |
| transform: rotate(0.5turn); β 180.0 |
| |
| /* Dimension extraction */ |
| width: 60mm; β 60.0 (mm) |
| height: 30px; β Converted to mm |
| ``` |
|
|
| --- |
|
|
| ### Stage 09: Create Handwriting Images |
|
|
| **File:** `pipeline_09_create_handwriting_images.py` |
|
|
| **Purpose:** Generate realistic handwriting images using a diffusion model, with per-author style consistency. |
|
|
| **Key Functions:** |
| - `main()`: Main generation orchestrator |
| - `generate_handwriting_diffusion()`: Call diffusion model API/local model |
| - `add_handwriting_blur()`: Optional post-processing for realism |
|
|
| **Process:** |
| 1. Load handwriting definitions from stage 07 |
| 2. Group by author ID for style consistency |
| 3. For each text segment: |
| - Generate handwriting image with diffusion model |
| - Apply optional blur for realism |
| - Save image with metadata |
| 4. Track author-to-writer style mapping |
| 5. Save generation logs |
|
|
| **Inputs:** |
| - Handwriting definitions from `handwriting_definitions/` (stage 07) |
| - Diffusion model checkpoint (local or API) |
|
|
| **Outputs:** |
| ``` |
| handwriting_images/ |
| {doc_id}/ |
| hw0_0.png # First bbox of handwriting region hw0 |
| hw0_1.png # Second bbox (if multi-word) |
| hw1_0.png |
| logs/handwriting_generation/ # Generation logs with author mappings |
| ``` |
|
|
| **Handwriting Image Specifications:** |
| - **Format:** PNG with transparency |
| - **Height:** 40 pixels (configurable via `HANDWRITING_HEIGHT_PX`) |
| - **Padding:** 0 pixels horizontal (configurable via `HANDWRITING_PADDING_PX`) |
| - **Background:** Transparent |
| - **Color:** Black text on transparent |
|
|
| **Diffusion Model Integration:** |
| Located in `handwriting_diffusion/` module: |
| - `generate_handwriting_diffusion_raw.py`: Core generation logic |
| - `text_encoder.py`: Text encoding for model input |
| - `tokenizer.py`: Character tokenization |
|
|
| **Key Features:** |
| - **Author-Style Consistency:** Same author ID β consistent handwriting style |
| - **Writer Style Mapping:** Maps author IDs to writer style IDs (e.g., "author1" β writer_42) |
| - **Batch Processing:** Generates multiple images efficiently |
| - **Optional Blur:** Post-processing for more realistic appearance |
| - **Quality Control:** Validates generated images (non-empty, correct dimensions) |
| |
| **Configuration Constants:** |
| - `HANDWRITING_HEIGHT_PX`: 40 pixels |
| - `HANDWRITING_PADDING_PX`: 0 pixels |
| - `DIFFUSION_NUM_INFERENCE_STEPS`: 50 (generation quality/speed tradeoff) |
| - `HANDWRITING_BLUR_ENABLED`: Optional blur toggle |
| - `HANDWRITING_STYLES`: List of available writer styles |
|
|
| **Author-to-Writer Mapping Example:** |
| ```json |
| { |
| "author1": "writer_42", |
| "author2": "writer_17", |
| "author3": "writer_89" |
| } |
| ``` |
|
|
| **Error Handling:** |
| - Generation failure β Retry up to 3 times |
| - Empty image β Logged, document flagged |
| - Model timeout β Skipped, logged |
|
|
| --- |
|
|
| ### Stage 10: Create Visual Elements |
|
|
| **File:** `pipeline_10_create_visual_elements.py` |
|
|
| **Purpose:** Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08. |
|
|
| **Key Functions:** |
| - `main()`: Main generation orchestrator |
| - `route_visual_element_generation()`: Router for element type |
| - `generate_stamp()`: Create stamp image with text |
| - `select_logo()`: Random selection from logo prefabs |
| - `generate_barcode()`: Create Code128 barcode |
| - `select_photo()`: Random selection from photo prefabs |
| - `select_chart()`: Random selection from chart prefabs |
|
|
| **Process:** |
| 1. Load visual element definitions from stage 08 |
| 2. For each element: |
| - Route to type-specific generator |
| - Generate or select image |
| - Resize to target dimensions |
| - Save with transparency (if applicable) |
| 3. Cache prefab directories for performance |
| 4. Save generation logs |
|
|
| **Inputs:** |
| - Visual element definitions from `visual_element_definitions/` (stage 08) |
| - Prefab directories in `data/visual_element_prefabs/`: |
| - `logos/` |
| - `photos/` |
| - `charts/` |
|
|
| **Outputs:** |
| ``` |
| visual_element_images/ |
| {doc_id}/ |
| ve0.png # Stamp image |
| ve1.png # Logo image |
| ve2.png # Barcode image |
| logs/visual_element_generation/ # Generation logs |
| ``` |
|
|
| **Type-Specific Generation:** |
|
|
| **Stamps:** |
| - Font: Configurable (default: Arial Bold) |
| - Color: Configurable (default: red/blue) |
| - Background: Transparent |
| - Border: Optional rounded rectangle |
| - Rotation: Applied during insertion (stage 13) |
|
|
| **Logos:** |
| - Source: Random selection from `data/visual_element_prefabs/logos/` |
| - Format: PNG with transparency preserved |
| - Caching: Directory contents cached after first scan |
|
|
| **Barcodes:** |
| - Type: Code128 (supports alphanumeric) |
| - Library: `python-barcode` |
| - Background: White |
| - Content validation: Numeric or alphanumeric |
|
|
| **Photos:** |
| - Source: Random selection from `data/visual_element_prefabs/photos/` |
| - Format: JPEG or PNG |
| - Aspect ratio: Preserved during resize |
|
|
| **Charts/Figures:** |
| - Source: Random selection from `data/visual_element_prefabs/charts/` |
| - Format: PNG with transparency |
| - Types: Bar charts, line graphs, pie charts, etc. |
|
|
| **Key Features:** |
| - **Type-Specific Logic:** Each element type has dedicated generation function |
| - **Prefab Caching:** Directory scans cached for performance |
| - **Transparent Backgrounds:** Stamps and some logos support transparency |
| - **Content Validation:** Barcodes validate numeric content |
| - **Aspect Ratio Preservation:** Images scaled without distortion |
|
|
| **Configuration Constants:** |
| - `STAMP_FONT_SIZE`: Calculated from target dimensions |
| - `STAMP_BORDER_WIDTH`: 2 pixels |
| - `BARCODE_DPI`: 300 for high quality |
| - `PREFAB_CACHE_SIZE`: In-memory cache limit |
|
|
| **Error Handling:** |
| - Missing prefab directory β Error, element skipped |
| - Empty prefab directory β Warning, fallback placeholder |
| - Invalid barcode content β Logged, uses fallback text |
| - Image generation failure β Placeholder created |
|
|
| --- |
|
|
| ### Stage 11: Render PDF Second Pass |
|
|
| **File:** `pipeline_11_render_pdf_second_pass.py` |
|
|
| **Purpose:** Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12. |
|
|
| **Key Functions:** |
| - `main()`: Main rendering orchestrator |
| - `render_pdf_playwright_async()`: Async Playwright rendering |
| - `remove_handwriting_placeholders_from_html()`: Strip handwriting elements |
|
|
| **Process:** |
| 1. Load HTML from `render_html/` (stage 04) |
| 2. Remove all elements with `data-handwriting` attribute |
| 3. Re-render PDF using same dimensions from stage 04 |
| 4. Validate PDF output |
| 5. Save PDFs without handwriting placeholders |
|
|
| **Inputs:** |
| - HTML from `render_html/` (stage 04) |
| - Page dimensions from stage 04 logs |
|
|
| **Outputs:** |
| ``` |
| pdf_without_handwriting_placeholder/ |
| {doc_id}.pdf # PDFs with handwriting regions blank |
| logs/rendering_second_pass/ # Rendering logs |
| ``` |
|
|
| **Key Differences from Stage 04:** |
| - **Uses Pre-calculated Dimensions:** No content measurement needed |
| - **Handwriting Elements Removed:** Not just hidden (visibility: hidden), but removed from DOM |
| - **No Geometry Extraction:** Geometries already saved in stage 04 |
| - **Faster Rendering:** No JavaScript injection for measurement |
|
|
| **HTML Preprocessing:** |
| ```html |
| <!-- Before (Stage 04) --> |
| <div data-handwriting="author1" class="handwriting">John Smith</div> |
| |
| <!-- After (Stage 11) --> |
| <!-- Element completely removed --> |
| ``` |
|
|
| **Why This Stage Exists:** |
| Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12. |
|
|
| **Rendering Configuration:** |
| - Same timeout and retry logic as stage 04 |
| - Reuses Playwright browser context for efficiency |
| - No debug output (geometries not extracted) |
|
|
| **Error Handling:** |
| - Multi-page PDF β Flagged and skipped |
| - Rendering timeout β Retry with increased timeout |
| - HTML parsing error β Logged, document marked invalid |
|
|
| --- |
|
|
| ### Stage 12: Insert Handwriting Images |
|
|
| **File:** `pipeline_12_insert_handwriting_images.py` |
|
|
| **Purpose:** Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation. |
|
|
| **Key Functions:** |
| - `main()`: Main insertion orchestrator |
| - `insert_handwriting_into_pdf()`: Per-document insertion |
| - `scale_image_with_aspect_ratio()`: Resize images while preserving aspect |
| - `group_bboxes_by_line()`: Group multi-word handwriting by line |
|
|
| **Process:** |
| 1. Load PDFs from stage 11 |
| 2. Load handwriting images from stage 09 |
| 3. Load handwriting definitions (bbox mappings) from stage 07 |
| 4. For each handwriting region: |
| - Group bboxes by line/block |
| - Scale images to fit bboxes |
| - Apply random offsets for natural variation |
| - Insert images at calculated positions |
| 5. Save PDFs with handwriting |
|
|
| **Inputs:** |
| - PDFs from `pdf_without_handwriting_placeholder/` (stage 11) |
| - Handwriting images from `handwriting_images/` (stage 09) |
| - Handwriting definitions from `handwriting_definitions/` (stage 07) |
|
|
| **Outputs:** |
| ``` |
| pdf_with_handwriting/ |
| {doc_id}.pdf # PDFs with handwriting inserted |
| logs/handwriting_insertion/ # Insertion logs |
| debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional) |
| ``` |
|
|
| **Insertion Logic:** |
|
|
| **1. Image Scaling:** |
| - High-res scaling: 3x upsampling before insertion |
| - Aspect ratio preserved |
| - Left-aligned within bbox (respects layout rect) |
|
|
| **2. Positioning:** |
| - **X coordinate:** Left edge of bbox + random offset |
| - **Y coordinate:** Top edge of bbox + random offset |
| - **Multi-word lines:** Consistent Y offset for line |
|
|
| **3. Random Offsets:** |
| ```python |
| x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X) |
| y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y) |
| ``` |
|
|
| **4. Block/Line Grouping:** |
| For multi-word handwriting (e.g., "John Smith"): |
| - Group bboxes by Y coordinate (same line) |
| - Apply consistent Y offset to entire line |
| - Individual X offsets per word for natural spacing |
|
|
| **Key Features:** |
| - **High-Resolution Insertion:** 3x scaling for quality |
| - **Natural Variation:** Random offsets simulate handwriting imperfection |
| - **Line Consistency:** Multi-word lines maintain baseline |
| - **Aspect Ratio Preservation:** Images not distorted |
| - **Transparency Support:** PNG alpha channel preserved |
|
|
| **Configuration Constants:** |
| - `HANDWRITING_IMAGE_UPSCALE_FACTOR`: 3x |
| - `MAX_HANDWRITING_RAND_X`: Β±2 pixels |
| - `MAX_HANDWRITING_RAND_Y`: Β±1 pixel |
| - `HANDWRITING_LINE_Y_CONSISTENCY`: Same Y offset per line |
|
|
| **Coordinate System:** |
| - PDF uses 72 DPI (points) |
| - Bboxes from stage 07 are in PDF coordinates |
| - No conversion needed |
|
|
| **Error Handling:** |
| - Missing handwriting image β Warning, bbox skipped |
| - Image too large for bbox β Scaled down with warning |
| - PyMuPDF insertion failure β Logged, document flagged |
|
|
| --- |
|
|
| ### Stage 13: Insert Visual Elements |
|
|
| **File:** `pipeline_13_insert_visual_elements.py` |
|
|
| **Purpose:** Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation. |
|
|
| **Key Functions:** |
| - `main()`: Main insertion orchestrator |
| - `insert_visual_elements_into_pdf()`: Per-document insertion |
| - `scale_image_with_aspect_ratio()`: Resize images (same as stage 12) |
|
|
| **Process:** |
| 1. Load PDFs from stage 12 |
| 2. Load visual element images from stage 10 |
| 3. Load visual element definitions from stage 08 |
| 4. For each visual element: |
| - Scale image to fit bbox |
| - Calculate centered position |
| - Apply rotation (if specified) |
| - Insert image at calculated position |
| 5. Save final PDFs |
| 6. If no visual elements: copy PDF from stage 12 |
|
|
| **Inputs:** |
| - PDFs from `pdf_with_handwriting/` (stage 12) |
| - Visual element images from `visual_element_images/` (stage 10) |
| - Visual element definitions from `visual_element_definitions/` (stage 08) |
|
|
| **Outputs:** |
| ``` |
| pdf_final/ |
| {doc_id}.pdf # Final PDFs with all elements |
| logs/visual_element_insertion/ # Insertion logs |
| debug/visual_element_insertion/ # Debug PDFs (optional) |
| ``` |
|
|
| **Insertion Logic:** |
|
|
| **1. Image Scaling:** |
| - High-res scaling: 3x upsampling |
| - Aspect ratio preserved |
| - Centered within bbox (not left-aligned like handwriting) |
|
|
| **2. Positioning:** |
| - **X coordinate:** Center of bbox - half image width |
| - **Y coordinate:** Center of bbox - half image height |
|
|
| **3. Rotation:** |
| - Applied via PyMuPDF transformation matrix |
| - Rotation around image center |
| - Angle from visual element definition |
|
|
| **Key Differences from Stage 12 (Handwriting):** |
| - **Centered placement:** Visual elements centered in bbox |
| - **No random offsets:** Precise placement for logos/stamps |
| - **Rotation support:** Stamps often rotated for "APPROVED" effect |
| - **Fallback:** Copies PDF if no visual elements (ensures output exists) |
|
|
| **Rotation Transformation:** |
| ```python |
| # PyMuPDF rotation matrix |
| rotation_matrix = fitz.Matrix(rotation_angle) |
| image_rect = image_rect * rotation_matrix |
| ``` |
|
|
| **Key Features:** |
| - **High-Resolution Insertion:** 3x scaling for quality |
| - **Centered Alignment:** Visual elements centered in bboxes |
| - **Rotation Support:** Arbitrary angles for stamps |
| - **Transparency Preservation:** PNG alpha channel maintained |
| - **Fallback Handling:** Copies PDF if no visual elements |
|
|
| **Configuration Constants:** |
| - `VISUAL_ELEMENT_UPSCALE_FACTOR`: 3x |
| - `ROTATION_PRECISION`: Angle precision in degrees |
|
|
| **Coordinate System:** |
| - Same as stage 12 (PDF 72 DPI) |
| - Rotation applied after positioning |
|
|
| **Error Handling:** |
| - Missing visual element image β Warning, element skipped |
| - Image too large for bbox β Scaled down with warning |
| - PyMuPDF insertion failure β Logged, document flagged |
| - No visual elements β PDF copied from stage 12 |
|
|
| --- |
|
|
| ### Stage 14: Render Image |
|
|
| **File:** `pipeline_14_render_image.py` |
|
|
| **Purpose:** Convert final PDFs to high-quality PNG images for OCR and dataset distribution. |
|
|
| **Key Functions:** |
| - `main()`: Main conversion orchestrator |
| - `convert_pdf_to_image()`: PDF to PNG conversion |
|
|
| **Process:** |
| 1. Load PDFs from stage 13 |
| 2. Convert each PDF to PNG using custom PDF-to-image module |
| 3. Validate image dimensions |
| 4. Save images |
|
|
| **Inputs:** |
| - PDFs from `pdf_final/` (stage 13) |
|
|
| **Outputs:** |
| ``` |
| images/ |
| {doc_id}.png # Final document images |
| logs/image_rendering/ # Conversion logs |
| ``` |
|
|
| **Image Specifications:** |
| - **Format:** PNG |
| - **DPI:** Configurable (default: 200 DPI for quality OCR) |
| - **Color Mode:** RGB (24-bit) |
| - **Compression:** PNG lossless |
|
|
| **PDF-to-Image Module:** |
| Located in custom module (not standard library): |
| - Handles PDF rendering at specified DPI |
| - Single-page conversion only (multi-page PDFs skipped) |
| - Uses PDF coordinate system: 72 DPI internally |
|
|
| **Key Features:** |
| - **High DPI:** 200+ DPI for accurate OCR |
| - **Single Page Only:** Multi-page PDFs flagged as errors |
| - **Lossless Compression:** PNG preserves all details |
| - **Size Validation:** Checks image dimensions match PDF |
|
|
| **Configuration Constants:** |
| - `IMAGE_DPI`: 200 (OCR quality vs file size tradeoff) |
| - `IMAGE_MAX_DIMENSION`: Optional max width/height |
|
|
| **Coordinate System Conversion:** |
| ``` |
| PDF: 210mm Γ 297mm @ 72 DPI β 595 Γ 842 points |
| PNG: 210mm Γ 297mm @ 200 DPI β 1654 Γ 2339 pixels |
| ``` |
|
|
| **Why This Stage:** |
| - OCR performs better on high-DPI images |
| - Images are final output format for datasets |
| - PNG preserves quality better than JPEG for text |
|
|
| **Error Handling:** |
| - Multi-page PDF β Flagged and skipped |
| - Conversion failure β Logged, document marked invalid |
| - Empty image β Error, document flagged |
| - Dimension mismatch β Warning logged |
|
|
| --- |
|
|
| ### Stage 15: Perform OCR |
|
|
| **File:** `pipeline_15_perform_ocr.py` |
|
|
| **Purpose:** Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements. |
|
|
| **Key Functions:** |
| - `main()`: Main OCR orchestrator |
| - `call_microsoft_ocr()`: Microsoft Azure OCR API call |
| - `convert_ocr_to_bbox_format()`: Transform OCR results to internal format |
| - `aggregate_words_to_segments()`: Group words into lines |
|
|
| **Process:** |
| 1. Determine which documents need OCR: |
| - Has handwriting β Requires OCR |
| - Has visual elements β Requires OCR |
| - Neither β Copy PDF bboxes from stage 05 |
| 2. For documents requiring OCR: |
| - Call Microsoft OCR service |
| - Parse word-level bboxes |
| - Aggregate into line-level segments |
| - Convert coordinates to PDF space |
| 3. Save final bboxes |
|
|
| **Inputs:** |
| - Images from `images/` (stage 14) |
| - Handwriting definitions from `handwriting_definitions/` (stage 07) |
| - Visual element definitions from `visual_element_definitions/` (stage 08) |
| - PDF bboxes from `pdf_word_bboxes/` (stage 05, for non-OCR documents) |
|
|
| **Outputs:** |
| ``` |
| final_word_bboxes/ |
| {doc_id}.json # Word-level bounding boxes |
| final_segment_bboxes/ |
| {doc_id}.json # Line-level bounding boxes |
| ocr_results_cache/ |
| {doc_id}.json # Raw OCR API responses (cached) |
| logs/ocr/ # OCR logs and errors |
| ``` |
|
|
| **OCR Decision Logic:** |
| ```python |
| requires_ocr = ( |
| has_handwriting(doc_id) or |
| has_visual_elements(doc_id) |
| ) |
| |
| if requires_ocr: |
| perform_microsoft_ocr() |
| else: |
| copy_pdf_bboxes() # Reuse stage 05 results |
| ``` |
|
|
| **Why Handwriting/Visual Elements Require OCR:** |
| - Handwriting images inserted in stage 12 β Not in PDF text layer |
| - Visual elements inserted in stage 13 β Not in PDF text layer |
| - PDF text extraction (stage 05) misses these elements |
| - OCR captures all visible text on rendered image |
|
|
| **Microsoft OCR API:** |
| - **Service:** Azure Computer Vision (Read API) |
| - **Input:** PNG image (from stage 14) |
| - **Output:** Word polygons, text, confidence scores |
| - **Caching:** Results cached to avoid re-processing |
|
|
| **OCR Response Format:** |
| ```json |
| { |
| "readResults": [{ |
| "page": 1, |
| "words": [ |
| { |
| "boundingBox": [20, 30, 60, 30, 60, 45, 20, 45], |
| "text": "Invoice", |
| "confidence": 0.98 |
| } |
| ], |
| "lines": [ |
| { |
| "boundingBox": [20, 30, 95, 30, 95, 45, 20, 45], |
| "text": "Invoice Date", |
| "words": [...] |
| } |
| ] |
| }] |
| } |
| ``` |
|
|
| **Coordinate Conversion:** |
| OCR returns pixel coordinates; converted to PDF points: |
| ```python |
| pdf_x = ocr_x * (pdf_width / image_width) |
| pdf_y = ocr_y * (pdf_height / image_height) |
| ``` |
|
|
| **Segment Aggregation:** |
| Groups words into lines based on Y coordinate proximity. |
|
|
| **Key Features:** |
| - **Selective OCR:** Only runs OCR when necessary (cost/time savings) |
| - **Result Caching:** Avoids redundant API calls |
| - **Dual-Level Output:** Word and line (segment) bboxes |
| - **Coordinate Conversion:** OCR pixels β PDF points |
| - **Fallback:** Copies PDF bboxes for documents without handwriting/VEs |
|
|
| **Configuration:** |
| - OCR API key from environment variables |
| - Timeout: 30 seconds per request |
| - Retry: Up to 3 attempts |
|
|
| **Error Handling:** |
| - OCR API failure β Retry, fallback to PDF bboxes if all retries fail |
| - Empty OCR result β Warning, uses PDF bboxes |
| - Coordinate conversion error β Logged, bbox skipped |
|
|
| --- |
|
|
| ### Stage 16: Normalize BBoxes |
|
|
| **File:** `pipeline_16_normalize_bboxes.py` |
|
|
| **Purpose:** Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation. |
|
|
| **Key Functions:** |
| - `main()`: Main normalization orchestrator |
| - `normalize_word_and_segment_bboxes()`: Normalize word/segment bboxes |
| - `normalize_layout_bboxes()`: Normalize layout element bboxes (DLA only) |
| - `normalize_coordinates()`: Core coordinate transformation |
|
|
| **Process:** |
| 1. Load final bboxes from stage 15 |
| 2. Load image dimensions (for normalization denominators) |
| 3. For each bbox: |
| - Transform: `normalized_x = pixel_x / image_width` |
| - Transform: `normalized_y = pixel_y / image_height` |
| - Preserve text content |
| 4. Save normalized bboxes |
| 5. For DLA tasks: also normalize layout element bboxes |
|
|
| **Inputs:** |
| - Word bboxes from `final_word_bboxes/` (stage 15) |
| - Segment bboxes from `final_segment_bboxes/` (stage 15) |
| - Layout element definitions from `layout_element_definitions/` (stage 06, for DLA) |
| - Image dimensions from stage 14 logs |
|
|
| **Outputs:** |
| ``` |
| normalized_word_bboxes/ |
| {doc_id}.json # Normalized word bboxes |
| normalized_segment_bboxes/ |
| {doc_id}.json # Normalized segment bboxes |
| normalized_gt/ # For DLA tasks only |
| {doc_id}.json # Normalized layout elements |
| logs/normalization/ # Normalization logs |
| ``` |
|
|
| **Normalization Formula:** |
| ```python |
| normalized_bbox = { |
| "x0": bbox["x0"] / image_width, |
| "y0": bbox["y0"] / image_height, |
| "x1": bbox["x1"] / image_width, |
| "y1": bbox["y1"] / image_height, |
| "text": bbox["text"] # Preserved |
| } |
| ``` |
|
|
| **Normalized BBox Format:** |
| ```json |
| [ |
| { |
| "x0": 0.095, # Was: 20 pixels out of 210mm @ 200dpi |
| "y0": 0.128, # Was: 30 pixels |
| "x1": 0.286, # Was: 60 pixels |
| "y1": 0.192, # Was: 45 pixels |
| "text": "Invoice" |
| } |
| ] |
| ``` |
|
|
| **DLA-Specific Normalization:** |
| For Document Layout Analysis tasks, layout element bboxes are also normalized: |
| ```json |
| [ |
| { |
| "label": "title", |
| "bbox": [0.095, 0.128, 0.905, 0.192] # [x0, y0, x1, y1] |
| }, |
| { |
| "label": "text", |
| "bbox": [0.095, 0.213, 0.905, 0.534] |
| } |
| ] |
| ``` |
|
|
| **Why Normalization:** |
| - **Model Training:** Most models expect [0, 1] coordinates |
| - **Resolution Independence:** Works across different image sizes |
| - **Standard Format:** Matches common dataset formats (e.g., LayoutLM) |
|
|
| **Coordinate System Mapping:** |
| ``` |
| PDF Points (72 DPI): |
| 595 Γ 842 points (A4) |
| β (stage 14 conversion @ 200 DPI) |
| Image Pixels (200 DPI): |
| 1654 Γ 2339 pixels |
| β (stage 16 normalization) |
| Normalized [0, 1]: |
| 0.0-1.0 Γ 0.0-1.0 |
| ``` |
|
|
| **Key Features:** |
| - **Preserves Text:** Text content unchanged during normalization |
| - **Task-Specific:** DLA tasks get additional layout bbox normalization |
| - **Validation:** Checks for out-of-bounds coordinates (clamps to [0, 1]) |
| - **Precision:** Full float precision maintained |
|
|
| **Error Handling:** |
| - Out-of-bounds coordinates β Clamped to [0, 1] with warning |
| - Missing image dimensions β Error, document skipped |
| - Zero image dimensions β Error, document marked invalid |
|
|
| --- |
|
|
| ### Stage 17: GT Preparation & Verification |
|
|
| **File:** `pipeline_17_gt_preparation_verification.py` |
|
|
| **Purpose:** Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation. |
|
|
| **Key Functions:** |
| - `main()`: Main verification orchestrator |
| - `route_task()`: Route to task-specific handler |
| - `handle_qa()`: QA ground truth processing |
| - `handle_kie()`: KIE ground truth with BIO tagging |
| - `handle_dla()`: DLA ground truth processing |
| - `fuzzy_match_text()`: Levenshtein-based text matching |
|
|
| **Process:** |
| 1. Load raw GT from stage 03/06 |
| 2. Load final word bboxes from stage 15 |
| 3. Route to task-specific handler |
| 4. Perform fuzzy text matching (GT text β OCR text) |
| 5. Map GT annotations to bbox indices |
| 6. Apply task-specific formatting |
| 7. Validate and save verified GT |
|
|
| **Inputs:** |
| - Raw GT from `raw_annotations/` (stage 03/06) |
| - Word bboxes from `final_word_bboxes/` (stage 15) |
| - Layout elements from `layout_element_definitions/` (stage 06, for DLA) |
| - Visual elements from `visual_element_definitions/` (stage 08, for DLA) |
|
|
| **Outputs:** |
| ``` |
| verified_gt/ |
| qa/ |
| {doc_id}.json # QA format |
| kie/ |
| {doc_id}.json # KIE format with BIO tagging |
| dla/ |
| {doc_id}.json # DLA format with normalized bboxes |
| cls/ |
| {doc_id}.json # Classification format |
| logs/gt_verification/ # Verification logs with match statistics |
| ``` |
|
|
| --- |
|
|
| #### **QA (Question Answering) Format** |
|
|
| **Process:** |
| 1. Load QA pairs: `{"question": "answer", ...}` |
| 2. For each answer: |
| - Find words in bboxes matching answer text (fuzzy) |
| - Record bbox indices of matching words |
| 3. Save verified GT with bbox mappings |
|
|
| **Output Format:** |
| ```json |
| { |
| "questions": [ |
| { |
| "question": "What is the invoice number?", |
| "answer": "INV-12345", |
| "answer_bbox_indices": [15, 16] # Word indices in final_word_bboxes |
| }, |
| { |
| "question": "What is the total amount?", |
| "answer": "$1,234.56", |
| "answer_bbox_indices": [42] |
| } |
| ] |
| } |
| ``` |
|
|
| **Fuzzy Matching:** |
| Uses Levenshtein distance with 0.85 similarity cutoff: |
| ```python |
| similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0 |
| if similarity >= 0.85: |
| match_found = True |
| ``` |
|
|
| --- |
|
|
| #### **KIE (Key Information Extraction) Format** |
|
|
| **Process:** |
| 1. Load entity annotations: `{entity_name: text_value, ...}` |
| 2. For each entity: |
| - Find words matching entity value (fuzzy) |
| - Generate BIO tags for all words |
| 3. Save verified GT with BIO tagging |
|
|
| **Output Format:** |
| ```json |
| { |
| "entities": [ |
| { |
| "entity": "company_name", |
| "value": "Acme Corporation", |
| "bbox_indices": [5, 6] |
| }, |
| { |
| "entity": "invoice_date", |
| "value": "January 15, 2024", |
| "bbox_indices": [20, 21, 22] |
| } |
| ], |
| "word_labels": [ |
| "O", "O", "O", "O", "O", # Words 0-4: Outside entities |
| "B-company_name", "I-company_name", # Words 5-6: Company name |
| "O", "O", ..., # Words 7-19: Outside |
| "B-invoice_date", "I-invoice_date", "I-invoice_date" # Words 20-22 |
| ] |
| } |
| ``` |
|
|
| **BIO Tagging:** |
| - `B-entity`: Beginning of entity |
| - `I-entity`: Inside entity (continuation) |
| - `O`: Outside any entity |
|
|
| --- |
|
|
| #### **DLA (Document Layout Analysis) Format** |
|
|
| **Process:** |
| 1. Load layout element definitions from stage 06 |
| 2. Load visual element definitions from stage 08 |
| 3. Validate labels (must be in `valid_labels`) |
| 4. Check spatial constraints (no containment, minimal overlap) |
| 5. Merge visual elements into layout annotations |
| 6. Normalize bboxes to [0, 1] |
| 7. Save verified GT |
|
|
| **Output Format:** |
| ```json |
| { |
| "layout_elements": [ |
| { |
| "id": "layout_0", |
| "label": "title", |
| "bbox": [0.095, 0.128, 0.905, 0.192] # Normalized [x0, y0, x1, y1] |
| }, |
| { |
| "id": "layout_1", |
| "label": "text", |
| "bbox": [0.095, 0.213, 0.905, 0.534] |
| }, |
| { |
| "id": "ve0", |
| "label": "figure", # Visual element mapped to layout label |
| "bbox": [0.714, 0.895, 0.905, 0.980] |
| } |
| ] |
| } |
| ``` |
|
|
| **Visual Element Merging:** |
| Visual elements from stage 08 are converted to layout labels: |
| - `stamp` β `figure` (or custom mapping) |
| - `logo` β `figure` |
| - `chart` β `figure` |
| - `barcode` β `figure` |
| - `photo` β `figure` |
|
|
| **Spatial Validation:** |
| - **No containment:** One bbox fully inside another β Error |
| - **Minimal overlap:** Overlap area < 5% of smaller bbox β Warning |
| - **Valid labels:** All labels must be in `SynDatasetDefinition.valid_labels` |
|
|
| --- |
|
|
| #### **CLS (Classification) Format** |
|
|
| **Process:** |
| 1. Load classification label from raw GT |
| 2. Validate label against expected classes |
| 3. Save verified GT |
|
|
| **Output Format:** |
| ```json |
| { |
| "document_class": "invoice", |
| "confidence": 1.0 |
| } |
| ``` |
|
|
| --- |
|
|
| **Fuzzy Matching Details:** |
|
|
| Uses Levenshtein distance (via `fuzz` library) to handle OCR discrepancies: |
| ```python |
| # Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error) |
| similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0 |
| # similarity = 0.89 (above 0.85 threshold) β Match! |
| ``` |
|
|
| **Match Statistics Logged:** |
| - Total GT annotations |
| - Successfully matched annotations |
| - Failed matches (similarity < 0.85) |
| - Average similarity score |
|
|
| **Key Features:** |
| - **Fuzzy Matching:** Handles OCR errors gracefully |
| - **Task-Specific Formatting:** QA, KIE, DLA, CLS all handled differently |
| - **BIO Tagging:** Automatic generation for KIE |
| - **Visual Element Integration:** DLA merges visual elements as layout annotations |
| - **Spatial Validation:** Detects overlapping/contained layout elements |
| - **Similarity Tracking:** Logs match quality for analysis |
|
|
| **Error Handling:** |
| - Match failure (similarity < 0.85) β Logged, annotation skipped |
| - Invalid labels (DLA) β Error, document marked invalid |
| - Spatial violations (DLA) β Warning, elements flagged |
| - Missing bboxes β Error, document marked invalid |
|
|
| --- |
|
|
| ### Stage 18: Analyze |
|
|
| **File:** `pipeline_18_analyze.py` |
|
|
| **Purpose:** Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process. |
|
|
| **Key Functions:** |
| - `main()`: Main analysis orchestrator |
| - `calculate_api_costs()`: Compute LLM API costs from token usage |
|
|
| **Process:** |
| 1. Load all document logs from stages 01-17 |
| 2. Categorize documents: valid vs. invalid |
| 3. Calculate error distributions |
| 4. Compute API usage and costs |
| 5. Generate statistics (handwriting, visual elements, annotations) |
| 6. Save comprehensive dataset log |
|
|
| **Inputs:** |
| - All document logs from previous stages |
| - Batch results from stage 02 (for cost calculation) |
| - Message processing logs from stage 03 |
| - Prompt usage statistics |
|
|
| **Outputs:** |
| ``` |
| dataset_log.json # Comprehensive dataset statistics |
| logs/analysis/ # Analysis logs |
| ``` |
|
|
| **Dataset Log Structure:** |
| ```json |
| { |
| "metadata": { |
| "syndatadef_name": "docvqa_alpha=1.0", |
| "task": "qa", |
| "total_documents_requested": 1000, |
| "generation_date": "2026-02-07" |
| }, |
| |
| "prompting": { |
| "total_prompts": 100, |
| "total_batches": 10, |
| "llm_model": "claude-sonnet-4-20250514" |
| }, |
| |
| "total_cost_summary": { |
| "total_cost_usd": 123.45, |
| "input_tokens": 1500000, |
| "output_tokens": 800000, |
| "cached_tokens": 500000, |
| "cost_per_document": 0.12 |
| }, |
| |
| "valid_samples_stats": { |
| "total_valid": 847, |
| "total_invalid": 153, |
| "validity_rate": 0.847, |
| "avg_handwriting_regions_per_doc": 2.3, |
| "avg_visual_elements_per_doc": 1.5, |
| "avg_annotations_per_doc": 8.7, |
| "documents_with_handwriting": 654, |
| "documents_with_visual_elements": 512 |
| }, |
| |
| "valid_samples": [ |
| { |
| "doc_id": "doc_0001", |
| "seed_image": "docvqa_train_12345", |
| "has_handwriting": true, |
| "has_visual_elements": true, |
| "num_annotations": 10, |
| "num_words": 247, |
| "image_size": [1654, 2339] |
| } |
| ], |
| |
| "valid_samples_by_category": { |
| "invoice": 234, |
| "receipt": 198, |
| "form": 415 |
| }, |
| |
| "errors": { |
| "multipage_pdf": 12, |
| "missing_ocr_result": 5, |
| "failed_gt_verification": 38, |
| "rendering_timeout": 8, |
| "llm_parsing_error": 23, |
| "bbox_extraction_failed": 4, |
| "handwriting_generation_failed": 7, |
| "visual_element_generation_failed": 3, |
| "other": 53 |
| } |
| } |
| ``` |
|
|
| **Error Categories:** |
|
|
| | Error Category | Description | |
| |---------------|-------------| |
| | `multipage_pdf` | PDF rendered with multiple pages (invalid) | |
| | `missing_ocr_result` | OCR failed or returned empty result | |
| | `failed_gt_verification` | GT matching/validation failed in stage 17 | |
| | `rendering_timeout` | PDF rendering exceeded timeout | |
| | `llm_parsing_error` | Failed to extract HTML/GT from LLM response | |
| | `bbox_extraction_failed` | PyMuPDF failed to extract bboxes | |
| | `handwriting_generation_failed` | Diffusion model failed | |
| | `visual_element_generation_failed` | VE generation/selection failed | |
| | `other` | Miscellaneous errors | |
|
|
| **Cost Calculation:** |
|
|
| **Claude API Pricing (example):** |
| ```python |
| costs = { |
| "input": input_tokens * 0.003 / 1000, # $3 per 1M tokens |
| "output": output_tokens * 0.015 / 1000, # $15 per 1M tokens |
| "cached": cached_tokens * 0.0003 / 1000 # $0.30 per 1M tokens (prompt caching) |
| } |
| total_cost = costs["input"] + costs["output"] + costs["cached"] |
| ``` |
|
|
| **Statistics Computed:** |
| - **Validity Rate:** Percentage of documents passing all stages |
| - **Handwriting Stats:** Documents with handwriting, avg regions per doc |
| - **Visual Element Stats:** Documents with VEs, avg elements per doc |
| - **Annotation Stats:** Avg QA pairs/KIE entities/DLA elements per doc |
| - **Token Usage:** Input/output/cached token totals |
| - **Cost Metrics:** Total cost, cost per document, cost per valid document |
|
|
| **Key Features:** |
| - **Comprehensive Error Tracking:** All error categories logged |
| - **Cost Transparency:** Token-level cost breakdown |
| - **Quality Metrics:** Validity rate, avg annotations, etc. |
| - **Category Breakdown:** Valid samples grouped by document type |
| - **Per-Document Tracking:** Each valid document's metadata saved |
|
|
| **Usage:** |
| This log is essential for: |
| - Understanding dataset quality |
| - Optimizing pipeline (identify bottlenecks) |
| - Cost estimation for future runs |
| - Debugging (error distribution analysis) |
|
|
| --- |
|
|
| ### Stage 19: Create Debug Data |
|
|
| **File:** `pipeline_19_create_debug_data.py` |
|
|
| **Purpose:** Generate comprehensive debug visualizations for manual inspection and quality assurance. |
|
|
| **Key Functions:** |
| - `main()`: Main debug generator |
| - `visualize_visual_element_bboxes()`: Overlay VE bboxes on PDFs |
| - `visualize_final_bboxes_on_images()`: Overlay OCR bboxes on images |
| - `visualize_pdf_bboxes()`: Overlay PDF bboxes |
|
|
| **Process:** |
| 1. Load all generated documents |
| 2. Create debug subdirectories |
| 3. For each document: |
| - Generate PDF bbox overlays |
| - Generate VE bbox overlays |
| - Generate handwriting insertion region overlays |
| - Generate final OCR bbox overlays on images |
| 4. Copy raw HTML with debug.js script for browser inspection |
|
|
| **Inputs:** |
| - All intermediate and final outputs from stages 01-18 |
|
|
| **Outputs:** |
| ``` |
| debug/ |
| pdf_bboxes/ # PDF word bboxes overlaid (stage 05) |
| {doc_id}.pdf |
| visual_element_bboxes/ # VE bboxes overlaid (stage 08) |
| {doc_id}.pdf |
| handwriting_insertion/ # Handwriting regions overlaid (stage 12) |
| {doc_id}.pdf |
| final_bboxes_on_images/ # OCR bboxes on final images (stage 15) |
| {doc_id}.png |
| html_with_debug/ # Raw HTML + debug.js |
| {doc_id}.html |
| debug.js # Browser-based inspection script |
| ``` |
|
|
| **Debug Visualizations:** |
|
|
| **1. PDF BBoxes (Stage 05):** |
| - Red rectangles: Word bounding boxes from PyMuPDF |
| - Annotated with word text |
| - Purpose: Verify PDF text extraction quality |
|
|
| **2. Visual Element BBoxes (Stage 08):** |
| - Blue rectangles: Visual element placeholder regions |
| - Annotated with element type (stamp, logo, etc.) |
| - Purpose: Verify VE extraction and positioning |
|
|
| **3. Handwriting Insertion Regions (Stage 12):** |
| - Green rectangles: Handwriting bbox regions |
| - Annotated with handwriting text |
| - Purpose: Verify handwriting placement accuracy |
|
|
| **4. Final BBoxes on Images (Stage 15):** |
| - Orange rectangles: OCR word bounding boxes |
| - Overlaid on final rendered images |
| - Purpose: Verify OCR accuracy and coverage |
|
|
| **Debug JavaScript (debug.js):** |
| ```javascript |
| // Browser-based inspection tool |
| // Features: |
| // - Highlight elements on hover |
| // - Show geometry data in console |
| // - Toggle element visibility |
| // - Measure element dimensions |
| ``` |
|
|
| **Key Features:** |
| - **Color-Coded Overlays:** Different colors for different stages |
| - **Text Annotations:** Bboxes labeled with content |
| - **Multi-Format:** Both PDF and PNG visualizations |
| - **Browser Inspection:** HTML with interactive debug script |
| - **Selective Generation:** Only enabled with `DEBUG_MODE=true` flag |
|
|
| **Configuration:** |
| - `DEBUG_MODE`: Enable/disable debug output (default: false) |
| - `DEBUG_BBOX_LINE_WIDTH`: Line thickness for overlays (default: 2) |
| - `DEBUG_BBOX_OPACITY`: Overlay transparency (default: 0.5) |
|
|
| **When to Use:** |
| - Visual quality assurance |
| - Debugging bbox extraction issues |
| - Verifying handwriting/VE insertion |
| - Identifying OCR problems |
| - Manual inspection of edge cases |
|
|
| **Performance Note:** |
| Debug generation adds ~20% processing time; disabled by default in production. |
|
|
| --- |
|
|
| ## API Implementation |
|
|
| The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06. |
|
|
| ### Architecture Overview |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β API ARCHITECTURE β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| Client Request |
| β |
| β |
| βββββββββββββββββββββββ |
| β FastAPI Server β |
| β (main.py) β |
| βββββββββββββββββββββββ |
| β |
| ββββΊ Validate Request (schemas.py) |
| β |
| ββββΊ Download Seed Images (utils.py) |
| β |
| ββββΊ Build Prompt (utils.py) |
| β |
| ββββΊ Call Claude API (Synchronous) |
| β ββββΊ Claude Sonnet 4.5 |
| β |
| ββββΊ Extract HTML & GT (pipeline_03 functions) |
| β |
| ββββΊ Render PDF (pipeline_04 functions) |
| β ββββΊ Playwright/Chromium |
| β |
| ββββΊ Extract BBoxes (pipeline_05 functions) |
| β ββββΊ PyMuPDF |
| β |
| ββββΊ Return Response |
| ββββΊ JSON with base64-encoded PDFs |
| ``` |
|
|
| --- |
|
|
| ### Endpoints |
|
|
| #### **GET /** |
| Health check endpoint. |
|
|
| **Response:** |
| ```json |
| { |
| "message": "DocGenie API is running", |
| "version": "1.0", |
| "status": "healthy" |
| } |
| ``` |
|
|
| --- |
|
|
| #### **GET /health** |
| Detailed health status. |
|
|
| **Response:** |
| ```json |
| { |
| "status": "healthy", |
| "api_version": "1.0", |
| "playwright_available": true, |
| "llm_model": "claude-sonnet-4-5-20250929" |
| } |
| ``` |
|
|
| --- |
|
|
| #### **POST /generate** |
| Generate documents with ground truth annotations. |
|
|
| **Request Body** (`schemas.GenerateDocumentRequest`): |
| ```json |
| { |
| "seed_images": [ |
| "https://example.com/seed1.jpg", |
| "https://example.com/seed2.jpg" |
| ], |
| "prompt_params": { |
| "language": "English", |
| "doc_type": "business and administrative", |
| "gt_type": "Multiple questions and their answers", |
| "gt_format": "{\"question\": \"answer\", ...}", |
| "num_solutions": 2 |
| } |
| } |
| ``` |
|
|
| **Request Validation:** |
| - `seed_images`: 1-10 URLs (HTTPS only) |
| - `num_solutions`: 1-5 documents |
| - All prompt parameters required |
|
|
| **Response** (`schemas.GenerateDocumentResponse`): |
| ```json |
| { |
| "success": true, |
| "message": "Successfully generated 2 documents", |
| "documents": [ |
| { |
| "document_id": "doc_20260207_001", |
| "html": "<html>...</html>", |
| "css": "body { ... }", |
| "ground_truth": { |
| "What is the invoice number?": "INV-12345" |
| }, |
| "pdf_base64": "JVBERi0xLjQK...", |
| "bboxes": [ |
| { |
| "x0": 20.0, |
| "y0": 30.0, |
| "x1": 60.0, |
| "y1": 45.0, |
| "text": "Invoice" |
| } |
| ], |
| "page_width_mm": 210.0, |
| "page_height_mm": 297.0 |
| } |
| ], |
| "total_documents": 2 |
| } |
| ``` |
|
|
| --- |
|
|
| #### **POST /generate-files** |
| Generate documents and return as downloadable files. |
|
|
| **Request:** Same as `/generate` |
|
|
| **Response:** |
| - **Content-Type:** `application/zip` |
| - **File:** ZIP archive containing: |
| - `doc_001.pdf` |
| - `doc_001_gt.json` |
| - `doc_001_bboxes.json` |
| - `doc_002.pdf` |
| - ... |
|
|
| **File Structure:** |
| ``` |
| generated_documents.zip |
| βββ doc_001.pdf |
| βββ doc_001_gt.json |
| βββ doc_001_bboxes.json |
| βββ doc_001_metadata.json |
| βββ doc_002.pdf |
| βββ ... |
| ``` |
|
|
| --- |
|
|
| ### Request/Response Schemas |
|
|
| Defined in `api/schemas.py`: |
|
|
| #### **PromptParameters** |
| ```python |
| class PromptParameters(BaseModel): |
| language: str = "English" |
| doc_type: str = "business and administrative" |
| gt_type: str = "Multiple questions and their answers" |
| gt_format: str = '{"question": "answer", ...}' |
| num_solutions: int = Field(default=1, ge=1, le=5) |
| ``` |
|
|
| #### **GenerateDocumentRequest** |
| ```python |
| class GenerateDocumentRequest(BaseModel): |
| seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10) |
| prompt_params: PromptParameters |
| ``` |
|
|
| #### **BoundingBox** |
| ```python |
| class BoundingBox(BaseModel): |
| x0: float |
| y0: float |
| x1: float |
| y1: float |
| text: str |
| ``` |
|
|
| #### **DocumentResult** |
| ```python |
| class DocumentResult(BaseModel): |
| document_id: str |
| html: str |
| css: str |
| ground_truth: Optional[dict] |
| pdf_base64: str |
| bboxes: List[BoundingBox] |
| page_width_mm: float |
| page_height_mm: float |
| ``` |
|
|
| #### **GenerateDocumentResponse** |
| ```python |
| class GenerateDocumentResponse(BaseModel): |
| success: bool |
| message: str |
| documents: List[DocumentResult] |
| total_documents: int |
| ``` |
|
|
| --- |
|
|
| ### API Pipeline Flow |
|
|
| Detailed integration with pipeline stages: |
|
|
| ```python |
| # Simplified API flow (from api/main.py) |
| |
| @app.post("/generate", response_model=GenerateDocumentResponse) |
| async def generate_documents(request: GenerateDocumentRequest): |
| # 1. Download and encode seed images |
| seed_images_base64 = await download_and_encode_images(request.seed_images) |
| |
| # 2. Build prompt from template |
| prompt = build_prompt_from_template( |
| seed_images=seed_images_base64, |
| params=request.prompt_params |
| ) |
| |
| # 3. Call Claude API (synchronous, not batched) |
| llm_response = await call_claude_api( |
| prompt=prompt, |
| model="claude-sonnet-4-5-20250929" |
| ) |
| |
| # 4. Extract HTML and GT (from pipeline_03) |
| documents_html = extract_html_from_message(llm_response) |
| documents_gt = extract_gt_from_html(documents_html) |
| |
| # 5. Validate HTML (from utils.py) |
| validate_html_structure(documents_html) |
| |
| # 6. Render PDFs (from pipeline_04) |
| pdfs = await render_pdfs_with_playwright(documents_html) |
| |
| # 7. Validate PDFs (from utils.py) |
| validate_pdf_pages(pdfs) |
| |
| # 8. Extract bboxes (from pipeline_05) |
| bboxes = extract_bboxes_from_pdfs(pdfs) |
| |
| # 9. Validate bboxes (from utils.py) |
| validate_bbox_completeness(bboxes) |
| |
| # 10. Encode PDFs to base64 |
| pdfs_base64 = encode_pdfs_to_base64(pdfs) |
| |
| # 11. Build response |
| return GenerateDocumentResponse( |
| success=True, |
| documents=[...], |
| total_documents=len(documents_html) |
| ) |
| ``` |
|
|
| --- |
|
|
| ### Integration with Pipeline Functions |
|
|
| **Reused from Pipeline:** |
| - `extract_html_from_message()` from `pipeline_03` |
| - `extract_gt_from_html()` from `pipeline_03` |
| - `render_pdf_with_playwright()` from `pipeline_04` |
| - `extract_bboxes_from_pdf()` from `pipeline_05` |
| - Various utilities from `docgenie.generation.utils` |
|
|
| **API-Specific Functions (api/utils.py):** |
| - `download_seed_images()`: Fetch images from URLs |
| - `encode_images_to_base64()`: Convert images for API transmission |
| - `build_prompt_from_template()`: Template-based prompt construction |
| - `call_claude_api_sync()`: Synchronous Claude API call (non-batched) |
| - `encode_pdf_to_base64()`: PDF encoding for response |
| - `validate_html_structure()`: HTML validation |
| - `validate_pdf_pages()`: PDF page count/size validation |
| - `validate_bbox_completeness()`: Ensure bboxes extracted |
|
|
| --- |
|
|
| ### Configuration |
|
|
| #### **Environment Variables (.env)** |
| ```bash |
| ANTHROPIC_API_KEY=sk-ant-... # Required for Claude API |
| LLM_MODEL=claude-sonnet-4-5-20250929 # Default model |
| API_PORT=8000 # Server port |
| DEBUG_MODE=false # Enable debug logging |
| ``` |
|
|
| #### **Prompt Templates** |
| Located in `data/prompt_templates/`: |
|
|
| **Template Structure:** |
| ``` |
| data/prompt_templates/<template_name>/ |
| βββ system_prompt.txt # System message |
| βββ user_prompt_template.txt # User message template |
| βββ example_output.html # Example for few-shot |
| ``` |
|
|
| **Placeholder Substitution:** |
| ```python |
| # In user_prompt_template.txt |
| """ |
| Please generate {num_solutions} {doc_type} documents in {language}. |
| |
| Ground truth format: {gt_format} |
| Ground truth type: {gt_type} |
| """ |
| |
| # Substituted with request.prompt_params |
| ``` |
|
|
| #### **CORS Configuration** |
| ```python |
| # In api/main.py |
| app.add_middleware( |
| CORSMiddleware, |
| allow_origins=["*"], # Open for development |
| allow_credentials=True, |
| allow_methods=["*"], |
| allow_headers=["*"], |
| ) |
| ``` |
|
|
| --- |
|
|
| ### Authentication & Security |
|
|
| **Current State:** |
| - **No built-in authentication:** API is open (for development) |
| - **API Key Management:** Claude API key in `.env`, not passed in requests |
| - **CORS:** Open (`allow_origins=["*"]`) for development |
|
|
| **Production Recommendations:** |
| - Add API key authentication (e.g., Bearer tokens) |
| - Restrict CORS origins to known frontends |
| - Rate limiting (e.g., Redis-based) |
| - Input sanitization for HTML injection prevention |
| - HTTPS only (terminate SSL at reverse proxy) |
|
|
| --- |
|
|
| ### Performance Considerations |
|
|
| **Async Rendering:** |
| - Playwright rendering uses async/await |
| - Concurrent requests supported by FastAPI |
| - Semaphore control prevents resource exhaustion |
|
|
| **Image Size Limits:** |
| - Seed images compressed before API transmission |
| - Max dimension: 1024px (configurable) |
| - JPEG quality: 85% (configurable) |
|
|
| **Timeouts:** |
| - PDF render timeout: 60 seconds per document |
| - Claude API timeout: 120 seconds |
| - Total request timeout: 300 seconds (5 minutes) |
|
|
| **Retry Logic:** |
| - PDF rendering: Up to 3 attempts |
| - Claude API: Up to 2 retries on network errors |
| - No retry on validation failures |
|
|
| **Concurrency:** |
| - FastAPI default: Multiple workers (configurable) |
| - Playwright: Semaphore-controlled (10 concurrent renders) |
|
|
| --- |
|
|
| ### Limitations |
|
|
| **Current Limitations:** |
| 1. **Single-Page Only:** Multi-page PDFs flagged as errors |
| 2. **Seed Image Limit:** Maximum 10 seed images per request |
| 3. **Document Limit:** Maximum 5 document variations (`num_solutions`) |
| 4. **No Handwriting/Visual Elements:** Stages 07-13 not integrated (API stops at stage 06) |
| 5. **Synchronous LLM:** No batching (higher cost per document) |
| 6. **No Dataset Export:** No `/export-dataset` endpoint for full pipeline runs |
|
|
| **Known Issues:** |
| - Large documents (>10 pages worth of content) may timeout |
| - Complex CSS (animations, 3D transforms) may not render correctly |
| - Some Unicode characters may not display in PDFs |
|
|
| --- |
|
|
| ### Future Integration Plan |
|
|
| From `api/PIPELINE_INTEGRATION.md`: |
|
|
| #### **Stage 3: Handwriting & Visual Elements (Stages 07-11)** |
|
|
| **New Request Parameters:** |
| ```python |
| class PromptParameters(BaseModel): |
| # ... existing ... |
| enable_handwriting: bool = False |
| handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0) |
| enable_visual_elements: bool = False |
| visual_element_types: List[str] = ["stamp", "logo"] |
| ``` |
|
|
| **New Response Fields:** |
| ```python |
| class DocumentResult(BaseModel): |
| # ... existing ... |
| handwriting_regions: Optional[List[HandwritingRegion]] |
| visual_elements: Optional[List[VisualElement]] |
| ``` |
|
|
| **Impact:** |
| - Longer processing time (diffusion model: ~5s per handwriting region) |
| - Larger response size (additional images) |
|
|
| --- |
|
|
| #### **Stage 4: Image Finalization & OCR (Stages 12-15)** |
|
|
| **New Response Fields:** |
| ```python |
| class DocumentResult(BaseModel): |
| # ... existing ... |
| image_base64: str # Final rendered image (PNG) |
| ocr_text: str # Full OCR text |
| ocr_confidence: float # Average OCR confidence |
| ``` |
|
|
| **Impact:** |
| - OCR API costs (~$1.50 per 1000 images) |
| - Additional 2-3 seconds per document (OCR latency) |
|
|
| --- |
|
|
| #### **Stage 5: Dataset Packaging (Stages 16-19)** |
|
|
| **New Endpoint:** |
| ```python |
| @app.post("/export-dataset") |
| async def export_dataset(request: ExportDatasetRequest): |
| """ |
| Run full pipeline (stages 01-19) and return packaged dataset. |
| """ |
| # Run pipeline with syndatadef |
| # Return ZIP with images, GTs, statistics |
| ``` |
|
|
| **Request:** |
| ```python |
| class ExportDatasetRequest(BaseModel): |
| syndatadef_config: dict # Full SynDatasetDefinition |
| output_format: str = "huggingface" # "huggingface", "coco", "custom" |
| ``` |
|
|
| **Response:** |
| - ZIP archive with full dataset |
| - Includes `dataset_log.json` from stage 18 |
|
|
| --- |
|
|
| ### Example Usage |
|
|
| #### **Python Client (api/example_usage.py)** |
| |
| ```python |
| import requests |
| import base64 |
| from pathlib import Path |
| |
| # API endpoint |
| API_URL = "http://localhost:8000/generate" |
| |
| # Prepare request |
| request_data = { |
| "seed_images": [ |
| "https://example.com/invoice_seed.jpg", |
| "https://example.com/receipt_seed.jpg" |
| ], |
| "prompt_params": { |
| "language": "English", |
| "doc_type": "invoices and receipts", |
| "gt_type": "Multiple questions and their answers", |
| "gt_format": '{"question": "answer", ...}', |
| "num_solutions": 3 |
| } |
| } |
| |
| # Call API |
| response = requests.post(API_URL, json=request_data) |
| result = response.json() |
| |
| # Process results |
| if result["success"]: |
| for doc in result["documents"]: |
| doc_id = doc["document_id"] |
| |
| # Save PDF |
| pdf_data = base64.b64decode(doc["pdf_base64"]) |
| Path(f"{doc_id}.pdf").write_bytes(pdf_data) |
| |
| # Save GT |
| Path(f"{doc_id}_gt.json").write_text( |
| json.dumps(doc["ground_truth"], indent=2) |
| ) |
| |
| # Save HTML |
| Path(f"{doc_id}.html").write_text(doc["html"]) |
| |
| print(f"Saved: {doc_id}") |
| print(f" - BBoxes: {len(doc['bboxes'])}") |
| print(f" - GT Annotations: {len(doc['ground_truth'])}") |
| ``` |
| |
| #### **cURL Example** |
|
|
| ```bash |
| curl -X POST http://localhost:8000/generate \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "seed_images": [ |
| "https://example.com/seed.jpg" |
| ], |
| "prompt_params": { |
| "language": "English", |
| "doc_type": "business documents", |
| "gt_type": "Questions and answers", |
| "gt_format": "{\"question\": \"answer\"}", |
| "num_solutions": 1 |
| } |
| }' |
| ``` |
|
|
| #### **JavaScript/TypeScript Client** |
|
|
| ```typescript |
| const response = await fetch('http://localhost:8000/generate', { |
| method: 'POST', |
| headers: { 'Content-Type': 'application/json' }, |
| body: JSON.stringify({ |
| seed_images: ['https://example.com/seed.jpg'], |
| prompt_params: { |
| language: 'English', |
| doc_type: 'invoices', |
| gt_type: 'Questions and answers', |
| gt_format: '{"question": "answer"}', |
| num_solutions: 2 |
| } |
| }) |
| }); |
| |
| const result = await response.json(); |
| |
| // Decode PDF |
| const pdfBlob = new Blob( |
| [Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))], |
| { type: 'application/pdf' } |
| ); |
| |
| // Download PDF |
| const url = URL.createObjectURL(pdfBlob); |
| const a = document.createElement('a'); |
| a.href = url; |
| a.download = `${result.documents[0].document_id}.pdf`; |
| a.click(); |
| ``` |
|
|
| --- |
|
|
| ### Testing |
|
|
| **Test File:** `api/test_api.py` |
|
|
| ```python |
| import pytest |
| from fastapi.testclient import TestClient |
| from api.main import app |
| |
| client = TestClient(app) |
| |
| def test_health_check(): |
| response = client.get("/health") |
| assert response.status_code == 200 |
| assert response.json()["status"] == "healthy" |
| |
| def test_generate_documents(): |
| request_data = { |
| "seed_images": ["https://example.com/seed.jpg"], |
| "prompt_params": { |
| "language": "English", |
| "doc_type": "invoices", |
| "gt_type": "Questions", |
| "gt_format": "{\"q\": \"a\"}", |
| "num_solutions": 1 |
| } |
| } |
| response = client.post("/generate", json=request_data) |
| assert response.status_code == 200 |
| result = response.json() |
| assert result["success"] is True |
| assert len(result["documents"]) > 0 |
| |
| def test_invalid_request(): |
| request_data = { |
| "seed_images": [], # Invalid: empty |
| "prompt_params": {"num_solutions": 10} # Invalid: > 5 |
| } |
| response = client.post("/generate", json=request_data) |
| assert response.status_code == 422 # Validation error |
| ``` |
|
|
| **Run Tests:** |
| ```bash |
| cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie |
| pytest api/test_api.py -v |
| ``` |
|
|
| --- |
|
|
| ### Deployment |
|
|
| **Start Server:** |
| ```bash |
| cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api |
| chmod +x start.sh |
| ./start.sh |
| ``` |
|
|
| **start.sh Contents:** |
| ```bash |
| #!/bin/bash |
| export $(cat .env | xargs) |
| uvicorn main:app --host 0.0.0.0 --port 8000 --reload |
| ``` |
|
|
| **Production Deployment:** |
| ```bash |
| # With Gunicorn (multi-worker) |
| gunicorn main:app \ |
| --workers 4 \ |
| --worker-class uvicorn.workers.UvicornWorker \ |
| --bind 0.0.0.0:8000 \ |
| --timeout 300 |
| ``` |
|
|
| **Docker Deployment:** |
| ```dockerfile |
| FROM python:3.10-slim |
| |
| WORKDIR /app |
| COPY requirements.txt . |
| RUN pip install --no-cache-dir -r requirements.txt |
| |
| # Install Playwright browsers |
| RUN playwright install chromium |
| RUN playwright install-deps |
| |
| COPY . . |
| |
| CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] |
| ``` |
|
|
| --- |
|
|
| ## Core Models & Utilities |
|
|
| ### SynDatasetDefinition Model |
|
|
| **File:** `docgenie/generation/models/_syndatadef.py` |
|
|
| **Purpose:** Configuration object for synthetic dataset generation, loaded from YAML files. |
|
|
| **Key Attributes:** |
| ```python |
| @dataclass |
| class SynDatasetDefinition: |
| # Dataset metadata |
| name: str # "docvqa_alpha=1.0" |
| task: TaskType # TaskType.QA, .KIE, .DLA, .CLS |
| base_dataset: str # "docvqa", "cord", etc. |
| |
| # LLM configuration |
| llm_model: str # "claude-sonnet-4-20250514" |
| prompt_template: str # "DocGenie" |
| num_solutions: int # Documents per prompt |
| |
| # Prompting parameters |
| language: str # "English", "German", etc. |
| doc_type: str # "business and administrative" |
| gt_type: str # Task-specific GT description |
| gt_format: str # Expected GT format |
| |
| # Dataset parameters |
| num_samples: int # Total documents to generate |
| alpha: float # Clustering diversity parameter |
| |
| # Seed selection |
| seeds_per_cluster: int # Seeds sampled per cluster |
| clustering_method: str # "kmeans", "agglomerative" |
| |
| # Task-specific |
| valid_labels: Optional[List[str]] # For DLA: ["title", "text", ...] |
| |
| # Output paths |
| output_dir: Path # Root output directory |
| ``` |
|
|
| **Key Methods:** |
| ```python |
| def get_file_structure(self) -> FileStructure: |
| """Returns FileStructure manager for output directories.""" |
| |
| def build_prompt(self, seed_images: List[str]) -> str: |
| """Builds prompt from template with parameter substitution.""" |
| |
| def iter_document_logs(self) -> Iterator[DocumentLog]: |
| """Iterates over all document logs.""" |
| |
| def update_document_status(self, doc_id: str, status: Status): |
| """Updates document status in log.""" |
| |
| @classmethod |
| def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition": |
| """Load configuration from YAML file.""" |
| ``` |
|
|
| **Example YAML:** |
| ```yaml |
| name: "docvqa_alpha=1.0" |
| task: "qa" |
| base_dataset: "docvqa" |
| |
| llm_model: "claude-sonnet-4-20250514" |
| prompt_template: "DocGenie" |
| num_solutions: 1 |
| |
| language: "English" |
| doc_type: "business and administrative documents" |
| gt_type: "Multiple questions and their answers" |
| gt_format: '{"question": "answer", ...}' |
| |
| num_samples: 1000 |
| alpha: 1.0 |
| seeds_per_cluster: 10 |
| clustering_method: "kmeans" |
| |
| output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0" |
| ``` |
|
|
| --- |
|
|
| ### PipelineParameters Model |
|
|
| **File:** `docgenie/generation/models/_pipeline.py` |
|
|
| **Purpose:** Runtime parameters for pipeline execution. |
|
|
| **Attributes:** |
| ```python |
| @dataclass |
| class PipelineParameters: |
| # Execution control |
| start_stage: int = 1 # First stage to execute |
| end_stage: int = 19 # Last stage to execute |
| skip_existing: bool = True # Skip documents with existing outputs |
| |
| # Parallelization |
| max_workers: int = 10 # Concurrent processing |
| chromium_concurrency: int = 10 # Parallel PDF renders |
| |
| # Debug mode |
| debug_mode: bool = False # Enable debug visualizations |
| |
| # Retry configuration |
| max_retries: int = 3 # Retry attempts |
| retry_delay: float = 2.0 # Seconds between retries |
| |
| # Timeouts |
| pdf_render_timeout: int = 60 # Seconds |
| ocr_timeout: int = 30 # Seconds |
| ``` |
|
|
| --- |
|
|
| ### FileStructure Model |
|
|
| **File:** `docgenie/generation/models/_file.py` |
|
|
| **Purpose:** Manages directory structure for generated data. |
|
|
| **Key Properties:** |
| ```python |
| @dataclass |
| class FileStructure: |
| root: Path # Root output directory |
| |
| # Directory properties (all return Path) |
| seeds_directory: Path |
| prompt_batches_directory: Path |
| message_results_directory: Path |
| raw_html_directory: Path |
| raw_annotations_directory: Path |
| geometries_directory: Path |
| pdf_initial_directory: Path |
| render_html_directory: Path |
| pdf_word_bboxes_directory: Path |
| pdf_char_bboxes_directory: Path |
| layout_element_definitions_directory: Path |
| handwriting_definitions_directory: Path |
| visual_element_definitions_directory: Path |
| handwriting_images_directory: Path |
| visual_element_images_directory: Path |
| pdf_without_handwriting_placeholder_directory: Path |
| pdf_with_handwriting_directory: Path |
| pdf_final_directory: Path |
| images_directory: Path |
| final_word_bboxes_directory: Path |
| final_segment_bboxes_directory: Path |
| normalized_word_bboxes_directory: Path |
| normalized_segment_bboxes_directory: Path |
| verified_gt_directory: Path |
| |
| # Debug subdirectories |
| debug_directory: Path |
| debug_pdf_bboxes_directory: Path |
| debug_visual_element_bboxes_directory: Path |
| debug_handwriting_insertion_directory: Path |
| debug_final_bboxes_on_images_directory: Path |
| debug_html_with_debug_directory: Path |
| ``` |
|
|
| **Key Methods:** |
| ```python |
| def create_all_directories(self): |
| """Create all output directories.""" |
| |
| def get_document_path(self, doc_id: str, stage: str) -> Path: |
| """Get path for specific document and stage.""" |
| ``` |
|
|
| --- |
|
|
| ### DocumentLog Model |
|
|
| **File:** `docgenie/generation/models/_log.py` |
|
|
| **Purpose:** Document-level metadata and status tracking. |
|
|
| **Attributes:** |
| ```python |
| @dataclass |
| class DocumentLog: |
| doc_id: str # Unique document ID |
| seed_image_id: str # Source seed image |
| prompt_call_id: str # Prompt batch/call ID |
| |
| status: Status # VALID, INVALID, PROCESSING |
| |
| # Stage completion flags |
| has_raw_html: bool = False |
| has_raw_gt: bool = False |
| has_pdf_initial: bool = False |
| has_geometries: bool = False |
| has_bboxes: bool = False |
| has_handwriting: bool = False |
| has_visual_elements: bool = False |
| has_final_image: bool = False |
| has_ocr_result: bool = False |
| has_verified_gt: bool = False |
| |
| # Statistics |
| num_words: int = 0 |
| num_annotations: int = 0 |
| num_handwriting_regions: int = 0 |
| num_visual_elements: int = 0 |
| |
| # Error tracking |
| error_stage: Optional[str] = None |
| error_message: Optional[str] = None |
| error_category: Optional[str] = None |
| |
| # Timestamps |
| created_at: datetime |
| updated_at: datetime |
| ``` |
|
|
| **Key Methods:** |
| ```python |
| def mark_stage_complete(self, stage: str): |
| """Mark pipeline stage as complete.""" |
| |
| def mark_error(self, stage: str, error_msg: str, category: str): |
| """Record error and mark document as invalid.""" |
| |
| def is_valid(self) -> bool: |
| """Check if document passed all stages.""" |
| ``` |
|
|
| --- |
|
|
| ### BBox Model |
|
|
| **File:** `docgenie/generation/models/_bbox.py` |
|
|
| **Purpose:** Bounding box representation with text content. |
|
|
| **Attributes:** |
| ```python |
| @dataclass |
| class BBox: |
| rect: Rect # {x0, y0, x1, y1} |
| text: str # Text content |
| metadata: Optional[dict] = None # Additional data |
| ``` |
|
|
| **Key Methods:** |
| ```python |
| @property |
| def width(self) -> float: |
| return self.rect["x1"] - self.rect["x0"] |
| |
| @property |
| def height(self) -> float: |
| return self.rect["y1"] - self.rect["y0"] |
| |
| def normalize(self, image_width: float, image_height: float) -> "BBox": |
| """Convert to normalized [0, 1] coordinates.""" |
| |
| def unnormalize(self, image_width: float, image_height: float) -> "BBox": |
| """Convert from normalized to pixel coordinates.""" |
| |
| def to_dict(self) -> dict: |
| """Serialize to dictionary.""" |
| |
| @classmethod |
| def from_dict(cls, data: dict) -> "BBox": |
| """Deserialize from dictionary.""" |
| ``` |
|
|
| --- |
|
|
| ### LayoutBBox Model |
|
|
| **File:** `docgenie/generation/models/_bbox.py` |
|
|
| **Purpose:** Layout element bounding box (for DLA tasks). |
|
|
| **Attributes:** |
| ```python |
| @dataclass |
| class LayoutBBox: |
| label: str # "title", "text", "table", etc. |
| rect: Rect # {x0, y0, x1, y1} |
| metadata: Optional[dict] = None |
| ``` |
|
|
| **Key Methods:** |
| ```python |
| def normalize(self, image_width: float, image_height: float) -> "LayoutBBox": |
| """Normalize coordinates.""" |
| |
| def contains(self, other: "LayoutBBox") -> bool: |
| """Check if this bbox fully contains another.""" |
| |
| def overlaps(self, other: "LayoutBBox") -> bool: |
| """Check if this bbox overlaps with another.""" |
| |
| def overlap_area(self, other: "LayoutBBox") -> float: |
| """Calculate overlap area with another bbox.""" |
| ``` |
|
|
| --- |
|
|
| ### Utility Modules |
|
|
| #### **BBox Utilities (utils/bboxes.py)** |
|
|
| ```python |
| def load_bboxes_from_file(file_path: Path) -> List[BBox]: |
| """Load bboxes from JSON file.""" |
| |
| def save_bboxes_to_file(bboxes: List[BBox], file_path: Path): |
| """Save bboxes to JSON file.""" |
| |
| def visualize_bboxes_on_pdf( |
| pdf_path: Path, |
| bboxes: List[BBox], |
| output_path: Path, |
| color: str = "red" |
| ): |
| """Draw bbox overlays on PDF.""" |
| |
| def visualize_bboxes_on_image( |
| image_path: Path, |
| bboxes: List[BBox], |
| output_path: Path, |
| color: str = "orange" |
| ): |
| """Draw bbox overlays on image.""" |
| |
| def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool: |
| """Check if bbox1 contains bbox2.""" |
| ``` |
|
|
| --- |
|
|
| #### **Geometry Utilities (utils/geos.py)** |
|
|
| ```python |
| def filter_layout_elements(geometries: dict) -> List[dict]: |
| """Extract elements with data-label attribute.""" |
| |
| def filter_handwriting_elements(geometries: dict) -> List[dict]: |
| """Extract elements with data-handwriting attribute.""" |
| |
| def filter_visual_elements(geometries: dict) -> List[dict]: |
| """Extract elements with data-visual-element attribute.""" |
| |
| def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]: |
| """Extract elements with specific CSS class.""" |
| ``` |
|
|
| --- |
|
|
| #### **Serialization Utilities (utils/serialization.py)** |
|
|
| ```python |
| def encode_image_to_base64(image_path: Path) -> str: |
| """Encode image file to base64 string.""" |
| |
| def decode_base64_to_image(base64_str: str, output_path: Path): |
| """Decode base64 string and save as image.""" |
| |
| def serialize_dataclass(obj: Any) -> dict: |
| """Serialize dataclass to dictionary.""" |
| |
| def deserialize_dataclass(data: dict, cls: Type[T]) -> T: |
| """Deserialize dictionary to dataclass.""" |
| ``` |
|
|
| --- |
|
|
| ## Configuration & Constants |
|
|
| ### Key Constants (generation/constants.py) |
|
|
| ```python |
| # Document processing |
| HTML_PARSER = "html.parser" # BeautifulSoup parser |
| PDF_POINT_SCALING = 72 / 96 # CSS DPI to PDF DPI |
| |
| # Bounding boxes |
| BBOX_OVERLAP_THRESHOLD = 0.05 # 5% overlap tolerance |
| SPATIAL_MATCH_THRESHOLD = 10.0 # Pixel tolerance for matching |
| |
| # PDF rendering |
| CHROMIUM_CONCURRENCY = 10 # Parallel renders |
| PER_PDF_RENDER_TIMEOUT = 60 # Seconds |
| PER_PDF_RENDER_MAX_RETRIES = 3 # Retry attempts |
| |
| # Handwriting |
| MAX_HANDWRITING_CHARS = 7 # Max chars per diffusion generation |
| HANDWRITING_HEIGHT_PX = 40 # Image height |
| HANDWRITING_PADDING_PX = 0 # Horizontal padding |
| DIFFUSION_NUM_INFERENCE_STEPS = 50 # Generation quality |
| HANDWRITING_IMAGE_UPSCALE_FACTOR = 3 # Insertion scaling |
| MAX_HANDWRITING_RAND_X = 2 # Random X offset (pixels) |
| MAX_HANDWRITING_RAND_Y = 1 # Random Y offset (pixels) |
| |
| # Visual elements |
| VISUAL_ELEMENT_UPSCALE_FACTOR = 3 # Insertion scaling |
| STAMP_BORDER_WIDTH = 2 # Stamp border thickness |
| BARCODE_DPI = 300 # Barcode image quality |
| |
| # OCR |
| IMAGE_DPI = 200 # Final image DPI |
| OCR_CONFIDENCE_THRESHOLD = 0.8 # Min confidence |
| |
| # Ground truth |
| FUZZY_MATCH_THRESHOLD = 0.85 # Levenshtein similarity cutoff |
| |
| # Handwriting styles |
| HANDWRITING_STYLES = [ |
| "writer_0", "writer_1", "writer_2", ..., "writer_99" |
| ] |
| |
| # Visual element types |
| VISUAL_ELEMENT_TYPES = [ |
| "stamp", "logo", "barcode", "chart", "photo" |
| ] |
| |
| # Visual element type mapping (LLM output β standard) |
| VISUAL_ELEMENT_TYPE_MAPPING = { |
| "stamp": "stamp", |
| "company_stamp": "stamp", |
| "approval_stamp": "stamp", |
| "logo": "logo", |
| "company_logo": "logo", |
| "brand_logo": "logo", |
| "barcode": "barcode", |
| "code128": "barcode", |
| "chart": "chart", |
| "graph": "chart", |
| "figure": "chart", |
| "photo": "photo", |
| "image": "photo", |
| "picture": "photo" |
| } |
| ``` |
|
|
| --- |
|
|
| ### Environment Variables |
|
|
| **Required:** |
| ```bash |
| ANTHROPIC_API_KEY=sk-ant-... # Claude API key |
| MICROSOFT_AZURE_OCR_KEY=... # Azure OCR key |
| MICROSOFT_AZURE_OCR_ENDPOINT=... # Azure OCR endpoint |
| ``` |
|
|
| **Optional:** |
| ```bash |
| LLM_MODEL=claude-sonnet-4-20250514 # Override model |
| DEBUG_MODE=false # Enable debug output |
| LOG_LEVEL=INFO # Logging verbosity |
| CHROMIUM_CONCURRENCY=10 # Override concurrency |
| ``` |
|
|
| --- |
|
|
| ## Error Handling & Debugging |
|
|
| ### Error Categories |
|
|
| **Comprehensive error tracking in stage 18:** |
|
|
| | Category | Description | Resolution | |
| |----------|-------------|------------| |
| | `multipage_pdf` | PDF rendered with >1 page | Check HTML content size, CSS page breaks | |
| | `missing_ocr_result` | OCR API failed or empty | Check Azure credentials, retry | |
| | `failed_gt_verification` | GT text not found in OCR | Review fuzzy match threshold, inspect HTML | |
| | `rendering_timeout` | PDF render exceeded timeout | Increase timeout, simplify HTML | |
| | `llm_parsing_error` | Failed to extract HTML/GT | Review LLM response format, update regex | |
| | `bbox_extraction_failed` | PyMuPDF extraction error | Check PDF validity, inspect fonts | |
| | `handwriting_generation_failed` | Diffusion model error | Check model checkpoint, GPU availability | |
| | `visual_element_generation_failed` | VE creation error | Check prefab directories, image validity | |
|
|
| --- |
|
|
| ### Debug Visualizations (Stage 19) |
|
|
| **Generated debug outputs:** |
|
|
| 1. **PDF BBoxes:** Verify PyMuPDF extraction quality |
| 2. **Visual Element BBoxes:** Verify VE extraction and positioning |
| 3. **Handwriting Insertion:** Verify handwriting placement |
| 4. **Final BBoxes on Images:** Verify OCR accuracy |
|
|
| **Enable debug mode:** |
| ```python |
| # In pipeline execution |
| pipeline_params = PipelineParameters( |
| debug_mode=True, |
| # ... other params |
| ) |
| ``` |
|
|
| --- |
|
|
| ### Logging |
|
|
| **Log locations:** |
| ``` |
| data/datasets/synthesized_datasets/<dataset_name>/logs/ |
| βββ pipeline_01/ |
| β βββ seed_selection.log |
| βββ pipeline_02/ |
| β βββ prompting.log |
| βββ pipeline_03/ |
| β βββ message_processing/ |
| β βββ batch_001.log |
| β βββ batch_002.log |
| βββ ... |
| βββ pipeline_19/ |
| βββ debug_generation.log |
| ``` |
|
|
| **Log format:** |
| ``` |
| [2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001 |
| [2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s) |
| [2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout |
| ``` |
|
|
| --- |
|
|
| ### Common Issues & Solutions |
|
|
| **Issue: Multi-page PDFs** |
| - **Cause:** HTML content exceeds page size |
| - **Solution:** Reduce content, check for long tables, use CSS `overflow: hidden` |
|
|
| **Issue: Handwriting not appearing** |
| - **Cause:** Character bboxes not available (stage 05) |
| - **Solution:** Use simpler fonts in HTML, ensure PyMuPDF can extract chars |
|
|
| **Issue: OCR missing text** |
| - **Cause:** Low image DPI, poor contrast |
| - **Solution:** Increase `IMAGE_DPI` (stage 14), adjust HTML styling |
|
|
| **Issue: GT verification failures** |
| - **Cause:** OCR discrepancies, fuzzy match threshold too high |
| - **Solution:** Lower `FUZZY_MATCH_THRESHOLD`, improve HTML text rendering |
|
|
| **Issue: Claude API timeout** |
| - **Cause:** Large seed images, complex prompts |
| - **Solution:** Compress seed images, simplify prompt template |
|
|
| --- |
|
|
| ## Usage Examples |
|
|
| ### Full Pipeline Execution |
|
|
| ```python |
| from docgenie.generation import pipeline_01_select_seeds |
| from docgenie.generation.models import SynDatasetDefinition, PipelineParameters |
| |
| # Load configuration |
| syndatadef = SynDatasetDefinition.from_yaml( |
| "data/syn_dataset_definitions/docvqa_alpha=1.0.yaml" |
| ) |
| |
| # Configure pipeline |
| params = PipelineParameters( |
| start_stage=1, |
| end_stage=19, |
| skip_existing=True, |
| debug_mode=False, |
| max_workers=10 |
| ) |
| |
| # Execute pipeline stages |
| for stage in range(params.start_stage, params.end_stage + 1): |
| print(f"Executing stage {stage:02d}...") |
| |
| # Import and run stage module |
| stage_module = importlib.import_module( |
| f"docgenie.generation.pipeline_{stage:02d}_*" |
| ) |
| stage_module.main(syndatadef, params) |
| |
| print(f"Stage {stage:02d} complete.") |
| |
| print("Pipeline execution complete!") |
| ``` |
|
|
| --- |
|
|
| ### API Usage |
|
|
| **See API Implementation section for detailed examples.** |
|
|
| --- |
|
|
| ### Custom Dataset Definition |
|
|
| ```yaml |
| # data/syn_dataset_definitions/custom_invoices.yaml |
| |
| name: "custom_invoices_v1" |
| task: "kie" |
| base_dataset: "cord" |
| |
| llm_model: "claude-sonnet-4-20250514" |
| prompt_template: "ClaudeRefined12" |
| num_solutions: 1 |
| |
| language: "English" |
| doc_type: "invoices and receipts" |
| gt_type: "Key-value pairs (entity extraction)" |
| gt_format: '{"entity_name": "entity_value", ...}' |
| |
| num_samples: 500 |
| alpha: 0.75 |
| seeds_per_cluster: 5 |
| clustering_method: "kmeans" |
| |
| # KIE-specific |
| valid_labels: null # Not used for KIE |
| |
| output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1" |
| ``` |
|
|
| --- |
|
|
| ## Conclusion |
|
|
| This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to: |
|
|
| - **Pipeline Integration Guide:** `api/PIPELINE_INTEGRATION.md` |
| - **API README:** `api/README.md` |
| - **Source Code:** `docgenie/generation/` and `api/` |
|
|
| **Key Takeaways:** |
|
|
| 1. **19-Stage Pipeline:** Modular design from seed selection to GT verification |
| 2. **Multi-Task Support:** QA, KIE, DLA, CLS with task-specific handling |
| 3. **Realistic Documents:** LLM-generated content, diffusion handwriting, visual elements |
| 4. **Quality Assurance:** Comprehensive validation, OCR verification, error tracking |
| 5. **API Integration:** FastAPI service for synchronous document generation (stages 01-06) |
| 6. **Extensibility:** Modular code, clear interfaces, easy to extend |
|
|
| **Pipeline Strengths:** |
|
|
| - Task-agnostic core with task-specific adapters |
| - Extensive logging and error tracking |
| - Parallel processing where applicable |
| - Debug visualizations at every stage |
| - Integration with state-of-the-art models |
|
|
| **Future Directions:** |
|
|
| - Full API integration (stages 07-19) |
| - Additional document types (forms, legal, medical) |
| - Multi-language support expansion |
| - Enhanced visual element generation (charts, diagrams) |
| - Real-time generation optimization |
|
|