Spaces:

Text-to-Document-Generation
/

Docgenie-API

Paused

App Files Files Community

Docgenie-API / GENERATION_PIPELINE_DOCUMENTATION.md

Ahadhassan-2003

deploy: update HF Space

dc4e6da 8 days ago

preview code

raw

history blame contribute delete

97.8 kB

	# DocGenie Generation Pipeline & API Documentation

	Version: 1.0
	Last Updated: February 7, 2026
	Purpose: Comprehensive reference for the DocGenie synthetic document generation system

	---

	## Table of Contents

	1. [Overview](#overview)
	2. [Pipeline Architecture](#pipeline-architecture)
	3. [Pipeline Stages (01-19)](#pipeline-stages-01-19)
	4. [API Implementation](#api-implementation)
	5. [Core Models & Utilities](#core-models--utilities)
	6. [Configuration & Constants](#configuration--constants)
	7. [Usage Examples](#usage-examples)
	8. [Error Handling & Debugging](#error-handling--debugging)

	---

	## Overview

	DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:

	- Document Question Answering (QA)
	- Key Information Extraction (KIE)
	- Document Layout Analysis (DLA)
	- Document Classification (CLS)

	### Key Features

	- LLM-Powered Generation: Uses Claude/Gemini/Open-source models to generate diverse document content
	- Realistic Handwriting: Diffusion model-based handwriting synthesis with author-specific styles
	- Visual Element Integration: Stamps, logos, barcodes, charts, and photos
	- Multi-Task Support: Task-specific ground truth formatting and validation
	- Quality Assurance: Comprehensive validation, OCR verification, and error tracking
	- Modular Design: Each pipeline stage is independently executable with clear inputs/outputs

	### Technology Stack

	- LLM APIs: Claude (Anthropic), Gemini, DeepSeek, Qwen
	- PDF Rendering: Playwright (Chromium), PyMuPDF
	- OCR: Microsoft Azure OCR
	- Handwriting: Custom diffusion model
	- Image Processing: PIL, OpenCV
	- API Framework: FastAPI
	- Data Processing: Pandas, NumPy

	---

	## Pipeline Architecture

	### High-Level Flow

	```
	┌─────────────────────────────────────────────────────────────────────────┐
	│ DOCGENIE GENERATION PIPELINE │
	└─────────────────────────────────────────────────────────────────────────┘

	┌──────────────────────┐
	│ PHASE 1: SELECTION │
	└──────────────────────┘
	↓
	[01] Select Seeds ────────────► seeds.csv, clusters.csv
	(Cluster-based diverse seed selection)

	┌──────────────────────┐
	│ PHASE 2: LLM GEN │
	└──────────────────────┘
	↓
	[02] Prompt LLM ──────────────► batch_results/ (JSON)
	│ (Claude API batched calls)
	↓
	[03] Process Response ────────► raw_html/, raw_annotations/
	(Extract HTML & GT from responses)

	┌──────────────────────┐
	│ PHASE 3: RENDERING │
	└──────────────────────┘
	↓
	[04] Render PDF Initial ──────► pdf_initial/, geometries/
	│ (HTML→PDF with geometry extraction)
	↓
	[05] Extract BBoxes ──────────► pdf_word_bboxes/, pdf_char_bboxes/
	│ (PyMuPDF text extraction)
	↓
	[06] Extract Layout ──────────► layout_element_definitions/
	(DLA/KIE-specific annotations)

	┌──────────────────────┐
	│ PHASE 4: EXTRACTION │
	└──────────────────────┘
	↓
	[07] Extract Handwriting ─────► handwriting_definitions/
	│ (Identify handwriting regions)
	↓
	[08] Extract Visual Elements ─► visual_element_definitions/
	(Stamp/logo/barcode placeholders)

	┌──────────────────────┐
	│ PHASE 5: GENERATION │
	└──────────────────────┘
	↓
	[09] Create Handwriting ──────► handwriting_images/
	│ (Diffusion model generation)
	↓
	[10] Create Visual Elements ──► visual_element_images/
	(Generate/select stamps, logos, etc.)

	┌──────────────────────┐
	│ PHASE 6: COMPOSITION │
	└──────────────────────┘
	↓
	[11] Render PDF (2nd Pass) ───► pdf_without_handwriting_placeholder/
	│ (Remove handwriting placeholders)
	↓
	[12] Insert Handwriting ──────► pdf_with_handwriting/
	│ (Overlay handwriting images)
	↓
	[13] Insert Visual Elements ──► pdf_final/
	│ (Overlay stamps, logos, etc.)
	↓
	[14] Render Image ────────────► images/
	(PDF→PNG conversion)

	┌──────────────────────┐
	│ PHASE 7: FINALIZATION│
	└──────────────────────┘
	↓
	[15] Perform OCR ─────────────► final_word_bboxes/, final_segment_bboxes/
	│ (Microsoft OCR)
	↓
	[16] Normalize BBoxes ────────► normalized_word_bboxes/, normalized_segment_bboxes/
	(Pixel→[0,1] coordinates)

	┌──────────────────────┐
	│ PHASE 8: VALIDATION │
	└──────────────────────┘
	↓
	[17] GT Preparation ──────────► verified_gt/
	│ (Fuzzy matching, BIO tagging)
	↓
	[18] Analyze ─────────────────► dataset_log.json
	│ (Statistics, cost analysis)
	↓
	[19] Create Debug Data ───────► debug/ subdirectories
	(Visualizations for inspection)
	```

	### Data Flow Between Stages

	```
	Seed Images ──┐
	├──► [02] ──► HTML + GT ──► [04] ──► PDF + Geometries
	Prompt Params ┘ │
	├──► [05] ──► BBoxes
	│ │
	│ ├──► [07] ──► HW Defs ──► [09] ──► HW Images ──┐
	│ │ │
	│ └──► [08] ──► VE Defs ──► [10] ──► VE Images ──┤
	│ │
	└──► [11] ──► PDF (no HW) ──┬──► [12] ◄─────────────────────────┤
	│ (Insert HW) │
	└──► [13] ◄────────────────────────┘
	(Insert VE)
	↓
	[14] ──► Image
	↓
	[15] ──► OCR BBoxes
	↓
	[16] ──► Normalized
	↓
	[17] ──► Verified GT
	```

	---

	## Pipeline Stages (01-19)

	### Stage 01: Select Seeds

	File: `pipeline_01_select_seeds.py`

	Purpose: Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.

	Key Functions:
	- `main()`: Orchestrates seed selection process
	- `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission
	- `plot_class_distribution()`: Visualizes class balance in selected seeds
	- `visualize_cluster_histogram()`: Shows distribution across clusters

	Process:
	1. Load embeddings from base dataset
	2. Perform clustering (KMeans or other algorithms)
	3. Sample N seeds per cluster
	4. Downscale and compress images (JPEG, max dimension)
	5. Save seed manifest and cluster assignments

	Inputs:
	- `SynDatasetDefinition` configuration
	- Base dataset name (e.g., `docvqa`, `cord`, `publaynet`)
	- Clustering parameters from constants

	Outputs:
	```
	seeds.csv # Selected seed document IDs per prompt call
	clusters.csv # Cluster assignments for all documents
	seeds/ (directory) # Preprocessed seed images (JPEG, compressed)
	```

	Configuration Parameters:
	- `EMBEDDING_MODEL`: Specifies which embedding model was used
	- `IMAGE_MAX_DIMENSION`: Max width/height for compression
	- `JPEG_QUALITY`: Compression quality (0-100)

	Example Usage:
	```python
	from docgenie.generation import pipeline_01_select_seeds

	pipeline_01_select_seeds.main(
	syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
	base_dataset="docvqa"
	)
	```

	---

	### Stage 02: Prompt LLM

	File: `pipeline_02_prompt_llm.py`

	Purpose: Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.

	Key Functions:
	- `main()`: Main orchestrator for LLM prompting
	- `create_batched_messages()`: Constructs API-compatible message batches
	- `track_batch_completion()`: Polls API for batch status
	- Cost calculation utilities in `pipeline_01/cost.py`

	Process:
	1. Load seed images and encode as base64
	2. Build prompts from template with parameter injection
	3. Create batched API requests (Claude Batch API for cost efficiency)
	4. Submit batches and track completion
	5. Save results for processing in stage 03

	Inputs:
	- Prompt template from `data/prompt_templates/<template_name>/`
	- Seed images from stage 01
	- API credentials from environment variables
	- `SynDatasetDefinition` parameters

	Outputs:
	```
	prompt_batches/ # Batch metadata (batch IDs, status)
	message_results/ # JSON response files per batch
	logs/ # Prompting logs and progress
	```

	API Configuration:
	- Claude: Uses Batch API with prompt caching for cost efficiency
	- Batch Size: Configurable via `BATCH_SIZE` constant
	- Polling Interval: Configurable wait time between status checks
	- Model Selection: Specified in `SynDatasetDefinition.llm_model`

	Cost Tracking:
	- Input/output token counts per request
	- Cached token usage (for Claude)
	- Total cost estimation per batch

	Example Configuration:
	```yaml
	# In syn_dataset_definition YAML
	llm_model: "claude-sonnet-4-20250514"
	prompt_template: "DocGenie"
	num_solutions: 1 # Documents per prompt
	language: "English"
	doc_type: "business and administrative"
	```

	---

	### Stage 03: Process Response

	File: `pipeline_03_process_response.py`

	Purpose: Extract and validate HTML documents and ground truth annotations from LLM responses.

	Key Functions:
	- `main()`: Main processor
	- `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks
	- `extract_gt_from_html()`: Parse JSON ground truth from `<script id="GT">` tags
	- `validate_and_save_gt()`: Task-specific validation and formatting

	Process:
	1. Load message results from stage 02
	2. Extract HTML content using regex patterns
	3. Parse and validate ground truth JSON
	4. Apply task-specific formatting (QA, KIE, DLA, CLS)
	5. Save raw HTML and annotations separately

	Inputs:
	- Message results from `message_results/` (stage 02)
	- `SynDatasetDefinition` task type and prompt format

	Outputs:
	```
	raw_html/ # HTML files (one per document)
	raw_annotations/ # Ground truth JSON files
	qa/ # QA format: {"question": "answer", ...}
	kie/ # KIE format: {"entity": "value", ...}
	dla/ # DLA format: {"element_id": "label", ...}
	cls/ # CLS format: {"class": "category"}
	logs/message_processing/ # Processing logs per message
	```

	Ground Truth Formats:

	QA Format:
	```json
	{
	"What is the invoice number?": "INV-12345",
	"What is the total amount?": "$1,234.56"
	}
	```

	KIE Format:
	```json
	{
	"company_name": "Acme Corp",
	"invoice_date": "2024-01-15",
	"total_amount": "1234.56"
	}
	```

	DLA Format:
	```json
	{
	"layout_0": "title",
	"layout_1": "text",
	"layout_2": "table"
	}
	```

	Validation Checks:
	- Expected document count matches actual
	- Valid JSON structure in GT
	- Task-specific field presence
	- HTML completeness (valid tags)

	Error Handling:
	- Malformed HTML → Logged, skipped
	- Missing GT → Document flagged in logs
	- Invalid JSON → Parsing error logged

	---

	### Stage 04: Render PDF and Extract Geometries

	File: `pipeline_04_render_pdf_and_extract_geos.py`

	Purpose: Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.

	Key Functions:
	- `main()`: Async batch rendering orchestrator
	- `render_pdf_with_playwright()`: Single PDF rendering using Playwright/Chromium
	- `preprocess_css()`: Remove conflicting CSS `@page` rules
	- `validate_pdf()`: Check page count and file integrity
	- `extract_geometries()`: Parse element positions from injected JavaScript

	Process:
	1. Load raw HTML from stage 03
	2. Inject geometry extraction JavaScript
	3. Preprocess CSS (remove fixed page sizes)
	4. Render with Playwright in headless Chromium
	5. Measure content dimensions dynamically
	6. Export PDF with calculated page size
	7. Extract and save element geometries

	Inputs:
	- Raw HTML files from `raw_html/` (stage 03)

	Outputs:
	```
	pdf_initial/ # Initial PDFs
	geometries/ # Element geometry JSON files
	{doc_id}.json # Contains positions, dimensions, CSS classes
	render_html/ # HTML with geometry extraction scripts
	debug/pdf_with_geos/ # Debug PDFs with geometry overlays (optional)
	logs/rendering/ # Rendering logs and errors
	```

	Geometry JSON Structure:
	```json
	{
	"page_width_mm": 210.0,
	"page_height_mm": 297.0,
	"elements": [
	{
	"id": "layout_0",
	"type": "div",
	"class": "title",
	"rect": {
	"x": 20.0,
	"y": 30.0,
	"width": 170.0,
	"height": 15.0
	},
	"text": "Invoice",
	"attributes": {
	"data-label": "title",
	"data-handwriting": null,
	"data-visual-element": null
	}
	}
	]
	}
	```

	Key Features:
	- Automatic Page Sizing: No fixed dimensions; page size adapts to content
	- Concurrent Rendering: Semaphore-controlled parallel processing
	- Geometry Extraction: JavaScript injected to capture element positions
	- CSS Coordinate Conversion: 96 DPI (CSS) → 72 DPI (PDF)
	- Retry Logic: Up to 3 attempts with configurable timeout
	- Debug Visualizations: Optional overlay of geometries on PDFs

	Configuration Constants:
	- `CHROMIUM_CONCURRENCY`: 10 (parallel render limit)
	- `PER_PDF_RENDER_TIMEOUT`: 60 seconds
	- `PER_PDF_RENDER_MAX_RETRIES`: 3 attempts
	- `PDF_POINT_SCALING`: 72/96 for DPI conversion

	Error Handling:
	- Timeout → Retry with increased timeout
	- Multi-page PDFs → Flagged and skipped
	- Missing geometries → Error logged, document marked invalid

	---

	### Stage 05: Extract BBoxes from PDF

	File: `pipeline_05_extract_bboxes_from_pdf.py`

	Purpose: Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.

	Key Functions:
	- `main()`: Main extraction orchestrator
	- `extract_bboxes_from_pdf()`: PyMuPDF-based extraction
	- `verify_char_to_word_mapping()`: Validate character-to-word relationships
	- `create_bbox_debug_pdf()`: Generate debug visualizations

	Process:
	1. Load PDFs from stage 04
	2. Extract words with PyMuPDF `get_text("words")`
	3. Extract characters with `get_text("rawdict")`
	4. Map characters to parent words
	5. Validate mappings (required for handwriting splitting)
	6. Save both word and character bboxes

	Inputs:
	- PDFs from `pdf_initial/` (stage 04)

	Outputs:
	```
	pdf_word_bboxes/ # Word-level bounding boxes
	{doc_id}.json # Format: [x0, y0, x1, y1, text]
	pdf_char_bboxes/ # Character-level bounding boxes (if mappable)
	{doc_id}.json # Format: [x0, y0, x1, y1, char, word_idx]
	debug/pdf_bboxes/ # Debug PDFs with bbox overlays (optional)
	logs/bbox_extraction/ # Extraction logs
	```

	BBox JSON Format:

	Word BBoxes:
	```json
	[
	{"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
	{"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
	]
	```

	Char BBoxes (with word mapping):
	```json
	[
	{"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
	{"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
	]
	```

	Key Features:
	- Dual-Level Extraction: Both word and character granularity
	- Mapping Validation: Ensures characters can be mapped to words (critical for handwriting)
	- Debug Visualizations: Color-coded bbox overlays on PDFs
	- PDF Coordinate System: Uses 72 DPI (PDF points)

	Character-to-Word Mapping:
	Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.

	Error Handling:
	- Empty PDFs → Skipped with warning
	- Unmappable characters → Word-level fallback
	- Invalid bboxes (negative dimensions) → Filtered out

	---

	### Stage 06: Extract Layout Element Definitions and Annotation GT

	File: `pipeline_06_extract_layout_element_definitions_and_annotation_gt.py`

	Purpose: Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).

	Key Functions:
	- `main()`: Task router
	- `handle_dla()`: Process DLA tasks
	- `handle_kie()`: Process KIE tasks
	- `parse_layout_elements()`: Extract layout elements from geometries
	- `parse_kie_fields()`: Extract KIE annotated fields from geometries

	Process (DLA):
	1. Load geometries from stage 04
	2. Filter elements with `data-label` attribute
	3. Validate labels against task-specific valid labels
	4. Extract bounding boxes and labels
	5. Save layout element definitions

	Process (KIE):
	1. Load geometries from stage 04
	2. Filter elements with `data-annotation` attribute
	3. Extract entity names and text values
	4. Validate against expected entity schema
	5. Save raw annotations for stage 17

	Inputs:
	- Geometries from `geometries/` (stage 04)
	- Valid labels from `SynDatasetDefinition.valid_labels`
	- Task type from `SynDatasetDefinition.task`

	Outputs:
	```
	layout_element_definitions/ # For DLA tasks
	{doc_id}.json # [{id, label, rect}, ...]
	raw_annotations/ # For KIE tasks
	kie_annotations/
	{doc_id}.json # {entity_name: text_value, ...}
	```

	Layout Element Definition Format (DLA):
	```json
	[
	{
	"id": "layout_0",
	"label": "title",
	"rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
	},
	{
	"id": "layout_1",
	"label": "text",
	"rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
	}
	]
	```

	KIE Annotation Format:
	```json
	{
	"company_name": "Acme Corporation",
	"invoice_number": "INV-12345",
	"invoice_date": "January 15, 2024",
	"total_amount": "$1,234.56"
	}
	```

	Validation Checks:
	- Missing labels: Elements without `data-label` → Warning
	- Invalid labels: Labels not in valid_labels → Error
	- Zero dimensions: Width or height = 0 → Error
	- Multiple labels (KIE only): One element, multiple entities → Error
	- Missing text: KIE elements without text → Warning

	Task-Specific Handling:
	- DLA: Converts geometries to `LayoutBBox` format for stage 17
	- KIE: Extracts entity-value pairs for BIO tagging in stage 17
	- QA/CLS: No processing in this stage (uses raw GT from stage 03)

	---

	### Stage 07: Extract Handwriting

	File: `pipeline_07_extract_handwriting.py`

	Purpose: Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.

	Key Functions:
	- `main()`: Main extraction orchestrator
	- `parse_handwriting_geometries()`: Extract elements with `data-handwriting` attribute
	- `map_handwriting_to_bboxes()`: Match handwriting text to OCR bboxes spatially
	- `split_long_words()`: Split words exceeding diffusion model's character limit
	- `extract_handwriting_style()`: Parse author ID from CSS class

	Process:
	1. Load geometries from stage 04
	2. Filter elements with `data-handwriting` attribute
	3. Load word/character bboxes from stage 05
	4. Spatially match handwriting regions to bboxes
	5. Split long words for diffusion model constraints
	6. Extract author ID for style consistency
	7. Save handwriting definitions with bbox mappings

	Inputs:
	- Geometries from `geometries/` (stage 04)
	- Word bboxes from `pdf_word_bboxes/` (stage 05)
	- Character bboxes from `pdf_char_bboxes/` (stage 05, if available)

	Outputs:
	```
	handwriting_definitions/
	{doc_id}.json # Handwriting region definitions
	logs/handwriting_extraction/ # Extraction logs and warnings
	```

	Handwriting Definition Format:
	```json
	[
	{
	"id": "hw0",
	"text": "John Smith",
	"author_id": "author1",
	"bboxes": [
	"20.0,30.0,60.0,45.0,John",
	"65.0,30.0,95.0,45.0,Smith"
	],
	"rect": {
	"x": 20.0,
	"y": 30.0,
	"width": 75.0,
	"height": 15.0
	},
	"is_signature": false
	}
	]
	```

	Key Features:
	- Author ID Extraction: Parses CSS classes like `handwriting-author1` for style consistency
	- Spatial Matching: Uses bbox overlap/proximity to map handwriting regions to text
	- Long Word Splitting: Splits words > `MAX_HANDWRITING_CHARS` (default: 7) into multiple images
	- Signature Detection: Identifies signature fields for special handling
	- Character-Level Fallback: Uses word bboxes if char-level mapping unavailable

	Configuration Constants:
	- `MAX_HANDWRITING_CHARS`: 7 (max characters per diffusion generation)
	- `SPATIAL_MATCH_THRESHOLD`: Pixel tolerance for bbox matching

	Mapping Logic:
	1. Find all words whose bboxes overlap/intersect with handwriting rect
	2. Extract text from matched bboxes
	3. Compare with expected handwriting text
	4. If char bboxes available: split long words at char boundaries
	5. If char bboxes unavailable: split at word boundaries

	Error Handling:
	- No bbox matches → Warning, handwriting region skipped
	- Text mismatch → Logged, uses best match
	- Missing char bboxes → Word-level fallback (may affect quality)

	---

	### Stage 08: Extract Visual Element Definitions

	File: `pipeline_08_extract_visual_element_definitions.py`

	Purpose: Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.

	Key Functions:
	- `main()`: Main extraction orchestrator
	- `parse_visual_element_geometries()`: Extract elements with `data-visual-element` attribute
	- `parse_css_dimension()`: Parse width/height from CSS
	- `parse_css_rotation()`: Extract rotation angle from CSS transform

	Process:
	1. Load geometries from stage 04
	2. Filter elements with `data-visual-element` attribute
	3. Parse element type (stamp, logo, barcode, etc.)
	4. Extract dimensions and rotation from CSS
	5. Parse content (e.g., stamp text, barcode value)
	6. Validate element types and dimensions
	7. Save visual element definitions

	Inputs:
	- Geometries from `geometries/` (stage 04)

	Outputs:
	```
	visual_element_definitions/
	{doc_id}.json # Visual element definitions
	logs/visual_element_extraction/ # Extraction logs
	```

	Visual Element Definition Format:
	```json
	[
	{
	"id": "ve0",
	"type": "stamp",
	"type_unmapped": "stamp",
	"content": "CONFIDENTIAL",
	"rect": {
	"x": 150.0,
	"y": 250.0,
	"width": 60.0,
	"height": 30.0
	},
	"rotation": -15.0
	},
	{
	"id": "ve1",
	"type": "barcode",
	"type_unmapped": "barcode",
	"content": "1234567890",
	"rect": {
	"x": 20.0,
	"y": 270.0,
	"width": 100.0,
	"height": 20.0
	},
	"rotation": 0.0
	}
	]
	```

	Supported Visual Element Types:
	- `stamp`: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
	- `logo`: Company/brand logos (selected from prefabs)
	- `barcode`: Code128 barcodes
	- `chart`: Charts and graphs (selected from prefabs)
	- `photo`: Photographic images (selected from prefabs)

	Type Mapping:
	Maps LLM-generated type names to standard types using `VISUAL_ELEMENT_TYPE_MAPPING` constant.

	Key Features:
	- Content Extraction: Parses text content for stamps, values for barcodes
	- Rotation Support: Extracts CSS `rotate()` transform angles
	- Dimension Parsing: Handles CSS units (px, mm, %, etc.)
	- Type Validation: Warns about unknown types, uses fallback mapping

	Validation Checks:
	- Unknown type → Logged, mapped to closest known type
	- Zero dimensions → Error, element skipped
	- Invalid rotation angle → Defaults to 0°
	- Missing content (for stamps/barcodes) → Warning

	CSS Parsing Examples:
	```css
	/* Rotation extraction */
	transform: rotate(-15deg); → -15.0
	transform: rotate(0.5turn); → 180.0

	/* Dimension extraction */
	width: 60mm; → 60.0 (mm)
	height: 30px; → Converted to mm
	```

	---

	### Stage 09: Create Handwriting Images

	File: `pipeline_09_create_handwriting_images.py`

	Purpose: Generate realistic handwriting images using a diffusion model, with per-author style consistency.

	Key Functions:
	- `main()`: Main generation orchestrator
	- `generate_handwriting_diffusion()`: Call diffusion model API/local model
	- `add_handwriting_blur()`: Optional post-processing for realism

	Process:
	1. Load handwriting definitions from stage 07
	2. Group by author ID for style consistency
	3. For each text segment:
	- Generate handwriting image with diffusion model
	- Apply optional blur for realism
	- Save image with metadata
	4. Track author-to-writer style mapping
	5. Save generation logs

	Inputs:
	- Handwriting definitions from `handwriting_definitions/` (stage 07)
	- Diffusion model checkpoint (local or API)

	Outputs:
	```
	handwriting_images/
	{doc_id}/
	hw0_0.png # First bbox of handwriting region hw0
	hw0_1.png # Second bbox (if multi-word)
	hw1_0.png
	logs/handwriting_generation/ # Generation logs with author mappings
	```

	Handwriting Image Specifications:
	- Format: PNG with transparency
	- Height: 40 pixels (configurable via `HANDWRITING_HEIGHT_PX`)
	- Padding: 0 pixels horizontal (configurable via `HANDWRITING_PADDING_PX`)
	- Background: Transparent
	- Color: Black text on transparent

	Diffusion Model Integration:
	Located in `handwriting_diffusion/` module:
	- `generate_handwriting_diffusion_raw.py`: Core generation logic
	- `text_encoder.py`: Text encoding for model input
	- `tokenizer.py`: Character tokenization

	Key Features:
	- Author-Style Consistency: Same author ID → consistent handwriting style
	- Writer Style Mapping: Maps author IDs to writer style IDs (e.g., "author1" → writer_42)
	- Batch Processing: Generates multiple images efficiently
	- Optional Blur: Post-processing for more realistic appearance
	- Quality Control: Validates generated images (non-empty, correct dimensions)

	Configuration Constants:
	- `HANDWRITING_HEIGHT_PX`: 40 pixels
	- `HANDWRITING_PADDING_PX`: 0 pixels
	- `DIFFUSION_NUM_INFERENCE_STEPS`: 50 (generation quality/speed tradeoff)
	- `HANDWRITING_BLUR_ENABLED`: Optional blur toggle
	- `HANDWRITING_STYLES`: List of available writer styles

	Author-to-Writer Mapping Example:
	```json
	{
	"author1": "writer_42",
	"author2": "writer_17",
	"author3": "writer_89"
	}
	```

	Error Handling:
	- Generation failure → Retry up to 3 times
	- Empty image → Logged, document flagged
	- Model timeout → Skipped, logged

	---

	### Stage 10: Create Visual Elements

	File: `pipeline_10_create_visual_elements.py`

	Purpose: Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.

	Key Functions:
	- `main()`: Main generation orchestrator
	- `route_visual_element_generation()`: Router for element type
	- `generate_stamp()`: Create stamp image with text
	- `select_logo()`: Random selection from logo prefabs
	- `generate_barcode()`: Create Code128 barcode
	- `select_photo()`: Random selection from photo prefabs
	- `select_chart()`: Random selection from chart prefabs

	Process:
	1. Load visual element definitions from stage 08
	2. For each element:
	- Route to type-specific generator
	- Generate or select image
	- Resize to target dimensions
	- Save with transparency (if applicable)
	3. Cache prefab directories for performance
	4. Save generation logs

	Inputs:
	- Visual element definitions from `visual_element_definitions/` (stage 08)
	- Prefab directories in `data/visual_element_prefabs/`:
	- `logos/`
	- `photos/`
	- `charts/`

	Outputs:
	```
	visual_element_images/
	{doc_id}/
	ve0.png # Stamp image
	ve1.png # Logo image
	ve2.png # Barcode image
	logs/visual_element_generation/ # Generation logs
	```

	Type-Specific Generation:

	Stamps:
	- Font: Configurable (default: Arial Bold)
	- Color: Configurable (default: red/blue)
	- Background: Transparent
	- Border: Optional rounded rectangle
	- Rotation: Applied during insertion (stage 13)

	Logos:
	- Source: Random selection from `data/visual_element_prefabs/logos/`
	- Format: PNG with transparency preserved
	- Caching: Directory contents cached after first scan

	Barcodes:
	- Type: Code128 (supports alphanumeric)
	- Library: `python-barcode`
	- Background: White
	- Content validation: Numeric or alphanumeric

	Photos:
	- Source: Random selection from `data/visual_element_prefabs/photos/`
	- Format: JPEG or PNG
	- Aspect ratio: Preserved during resize

	Charts/Figures:
	- Source: Random selection from `data/visual_element_prefabs/charts/`
	- Format: PNG with transparency
	- Types: Bar charts, line graphs, pie charts, etc.

	Key Features:
	- Type-Specific Logic: Each element type has dedicated generation function
	- Prefab Caching: Directory scans cached for performance
	- Transparent Backgrounds: Stamps and some logos support transparency
	- Content Validation: Barcodes validate numeric content
	- Aspect Ratio Preservation: Images scaled without distortion

	Configuration Constants:
	- `STAMP_FONT_SIZE`: Calculated from target dimensions
	- `STAMP_BORDER_WIDTH`: 2 pixels
	- `BARCODE_DPI`: 300 for high quality
	- `PREFAB_CACHE_SIZE`: In-memory cache limit

	Error Handling:
	- Missing prefab directory → Error, element skipped
	- Empty prefab directory → Warning, fallback placeholder
	- Invalid barcode content → Logged, uses fallback text
	- Image generation failure → Placeholder created

	---

	### Stage 11: Render PDF Second Pass

	File: `pipeline_11_render_pdf_second_pass.py`

	Purpose: Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.

	Key Functions:
	- `main()`: Main rendering orchestrator
	- `render_pdf_playwright_async()`: Async Playwright rendering
	- `remove_handwriting_placeholders_from_html()`: Strip handwriting elements

	Process:
	1. Load HTML from `render_html/` (stage 04)
	2. Remove all elements with `data-handwriting` attribute
	3. Re-render PDF using same dimensions from stage 04
	4. Validate PDF output
	5. Save PDFs without handwriting placeholders

	Inputs:
	- HTML from `render_html/` (stage 04)
	- Page dimensions from stage 04 logs

	Outputs:
	```
	pdf_without_handwriting_placeholder/
	{doc_id}.pdf # PDFs with handwriting regions blank
	logs/rendering_second_pass/ # Rendering logs
	```

	Key Differences from Stage 04:
	- Uses Pre-calculated Dimensions: No content measurement needed
	- Handwriting Elements Removed: Not just hidden (visibility: hidden), but removed from DOM
	- No Geometry Extraction: Geometries already saved in stage 04
	- Faster Rendering: No JavaScript injection for measurement

	HTML Preprocessing:
	```html
	<!-- Before (Stage 04) -->
	<div data-handwriting="author1" class="handwriting">John Smith</div>

	<!-- After (Stage 11) -->
	<!-- Element completely removed -->
	```

	Why This Stage Exists:
	Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.

	Rendering Configuration:
	- Same timeout and retry logic as stage 04
	- Reuses Playwright browser context for efficiency
	- No debug output (geometries not extracted)

	Error Handling:
	- Multi-page PDF → Flagged and skipped
	- Rendering timeout → Retry with increased timeout
	- HTML parsing error → Logged, document marked invalid

	---

	### Stage 12: Insert Handwriting Images

	File: `pipeline_12_insert_handwriting_images.py`

	Purpose: Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.

	Key Functions:
	- `main()`: Main insertion orchestrator
	- `insert_handwriting_into_pdf()`: Per-document insertion
	- `scale_image_with_aspect_ratio()`: Resize images while preserving aspect
	- `group_bboxes_by_line()`: Group multi-word handwriting by line

	Process:
	1. Load PDFs from stage 11
	2. Load handwriting images from stage 09
	3. Load handwriting definitions (bbox mappings) from stage 07
	4. For each handwriting region:
	- Group bboxes by line/block
	- Scale images to fit bboxes
	- Apply random offsets for natural variation
	- Insert images at calculated positions
	5. Save PDFs with handwriting

	Inputs:
	- PDFs from `pdf_without_handwriting_placeholder/` (stage 11)
	- Handwriting images from `handwriting_images/` (stage 09)
	- Handwriting definitions from `handwriting_definitions/` (stage 07)

	Outputs:
	```
	pdf_with_handwriting/
	{doc_id}.pdf # PDFs with handwriting inserted
	logs/handwriting_insertion/ # Insertion logs
	debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)
	```

	Insertion Logic:

	1. Image Scaling:
	- High-res scaling: 3x upsampling before insertion
	- Aspect ratio preserved
	- Left-aligned within bbox (respects layout rect)

	2. Positioning:
	- X coordinate: Left edge of bbox + random offset
	- Y coordinate: Top edge of bbox + random offset
	- Multi-word lines: Consistent Y offset for line

	3. Random Offsets:
	```python
	x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
	y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)
	```

	4. Block/Line Grouping:
	For multi-word handwriting (e.g., "John Smith"):
	- Group bboxes by Y coordinate (same line)
	- Apply consistent Y offset to entire line
	- Individual X offsets per word for natural spacing

	Key Features:
	- High-Resolution Insertion: 3x scaling for quality
	- Natural Variation: Random offsets simulate handwriting imperfection
	- Line Consistency: Multi-word lines maintain baseline
	- Aspect Ratio Preservation: Images not distorted
	- Transparency Support: PNG alpha channel preserved

	Configuration Constants:
	- `HANDWRITING_IMAGE_UPSCALE_FACTOR`: 3x
	- `MAX_HANDWRITING_RAND_X`: ±2 pixels
	- `MAX_HANDWRITING_RAND_Y`: ±1 pixel
	- `HANDWRITING_LINE_Y_CONSISTENCY`: Same Y offset per line

	Coordinate System:
	- PDF uses 72 DPI (points)
	- Bboxes from stage 07 are in PDF coordinates
	- No conversion needed

	Error Handling:
	- Missing handwriting image → Warning, bbox skipped
	- Image too large for bbox → Scaled down with warning
	- PyMuPDF insertion failure → Logged, document flagged

	---

	### Stage 13: Insert Visual Elements

	File: `pipeline_13_insert_visual_elements.py`

	Purpose: Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.

	Key Functions:
	- `main()`: Main insertion orchestrator
	- `insert_visual_elements_into_pdf()`: Per-document insertion
	- `scale_image_with_aspect_ratio()`: Resize images (same as stage 12)

	Process:
	1. Load PDFs from stage 12
	2. Load visual element images from stage 10
	3. Load visual element definitions from stage 08
	4. For each visual element:
	- Scale image to fit bbox
	- Calculate centered position
	- Apply rotation (if specified)
	- Insert image at calculated position
	5. Save final PDFs
	6. If no visual elements: copy PDF from stage 12

	Inputs:
	- PDFs from `pdf_with_handwriting/` (stage 12)
	- Visual element images from `visual_element_images/` (stage 10)
	- Visual element definitions from `visual_element_definitions/` (stage 08)

	Outputs:
	```
	pdf_final/
	{doc_id}.pdf # Final PDFs with all elements
	logs/visual_element_insertion/ # Insertion logs
	debug/visual_element_insertion/ # Debug PDFs (optional)
	```

	Insertion Logic:

	1. Image Scaling:
	- High-res scaling: 3x upsampling
	- Aspect ratio preserved
	- Centered within bbox (not left-aligned like handwriting)

	2. Positioning:
	- X coordinate: Center of bbox - half image width
	- Y coordinate: Center of bbox - half image height

	3. Rotation:
	- Applied via PyMuPDF transformation matrix
	- Rotation around image center
	- Angle from visual element definition

	Key Differences from Stage 12 (Handwriting):
	- Centered placement: Visual elements centered in bbox
	- No random offsets: Precise placement for logos/stamps
	- Rotation support: Stamps often rotated for "APPROVED" effect
	- Fallback: Copies PDF if no visual elements (ensures output exists)

	Rotation Transformation:
	```python
	# PyMuPDF rotation matrix
	rotation_matrix = fitz.Matrix(rotation_angle)
	image_rect = image_rect * rotation_matrix
	```

	Key Features:
	- High-Resolution Insertion: 3x scaling for quality
	- Centered Alignment: Visual elements centered in bboxes
	- Rotation Support: Arbitrary angles for stamps
	- Transparency Preservation: PNG alpha channel maintained
	- Fallback Handling: Copies PDF if no visual elements

	Configuration Constants:
	- `VISUAL_ELEMENT_UPSCALE_FACTOR`: 3x
	- `ROTATION_PRECISION`: Angle precision in degrees

	Coordinate System:
	- Same as stage 12 (PDF 72 DPI)
	- Rotation applied after positioning

	Error Handling:
	- Missing visual element image → Warning, element skipped
	- Image too large for bbox → Scaled down with warning
	- PyMuPDF insertion failure → Logged, document flagged
	- No visual elements → PDF copied from stage 12

	---

	### Stage 14: Render Image

	File: `pipeline_14_render_image.py`

	Purpose: Convert final PDFs to high-quality PNG images for OCR and dataset distribution.

	Key Functions:
	- `main()`: Main conversion orchestrator
	- `convert_pdf_to_image()`: PDF to PNG conversion

	Process:
	1. Load PDFs from stage 13
	2. Convert each PDF to PNG using custom PDF-to-image module
	3. Validate image dimensions
	4. Save images

	Inputs:
	- PDFs from `pdf_final/` (stage 13)

	Outputs:
	```
	images/
	{doc_id}.png # Final document images
	logs/image_rendering/ # Conversion logs
	```

	Image Specifications:
	- Format: PNG
	- DPI: Configurable (default: 200 DPI for quality OCR)
	- Color Mode: RGB (24-bit)
	- Compression: PNG lossless

	PDF-to-Image Module:
	Located in custom module (not standard library):
	- Handles PDF rendering at specified DPI
	- Single-page conversion only (multi-page PDFs skipped)
	- Uses PDF coordinate system: 72 DPI internally

	Key Features:
	- High DPI: 200+ DPI for accurate OCR
	- Single Page Only: Multi-page PDFs flagged as errors
	- Lossless Compression: PNG preserves all details
	- Size Validation: Checks image dimensions match PDF

	Configuration Constants:
	- `IMAGE_DPI`: 200 (OCR quality vs file size tradeoff)
	- `IMAGE_MAX_DIMENSION`: Optional max width/height

	Coordinate System Conversion:
	```
	PDF: 210mm × 297mm @ 72 DPI → 595 × 842 points
	PNG: 210mm × 297mm @ 200 DPI → 1654 × 2339 pixels
	```

	Why This Stage:
	- OCR performs better on high-DPI images
	- Images are final output format for datasets
	- PNG preserves quality better than JPEG for text

	Error Handling:
	- Multi-page PDF → Flagged and skipped
	- Conversion failure → Logged, document marked invalid
	- Empty image → Error, document flagged
	- Dimension mismatch → Warning logged

	---

	### Stage 15: Perform OCR

	File: `pipeline_15_perform_ocr.py`

	Purpose: Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.

	Key Functions:
	- `main()`: Main OCR orchestrator
	- `call_microsoft_ocr()`: Microsoft Azure OCR API call
	- `convert_ocr_to_bbox_format()`: Transform OCR results to internal format
	- `aggregate_words_to_segments()`: Group words into lines

	Process:
	1. Determine which documents need OCR:
	- Has handwriting → Requires OCR
	- Has visual elements → Requires OCR
	- Neither → Copy PDF bboxes from stage 05
	2. For documents requiring OCR:
	- Call Microsoft OCR service
	- Parse word-level bboxes
	- Aggregate into line-level segments
	- Convert coordinates to PDF space
	3. Save final bboxes

	Inputs:
	- Images from `images/` (stage 14)
	- Handwriting definitions from `handwriting_definitions/` (stage 07)
	- Visual element definitions from `visual_element_definitions/` (stage 08)
	- PDF bboxes from `pdf_word_bboxes/` (stage 05, for non-OCR documents)

	Outputs:
	```
	final_word_bboxes/
	{doc_id}.json # Word-level bounding boxes
	final_segment_bboxes/
	{doc_id}.json # Line-level bounding boxes
	ocr_results_cache/
	{doc_id}.json # Raw OCR API responses (cached)
	logs/ocr/ # OCR logs and errors
	```

	OCR Decision Logic:
	```python
	requires_ocr = (
	has_handwriting(doc_id) or
	has_visual_elements(doc_id)
	)

	if requires_ocr:
	perform_microsoft_ocr()
	else:
	copy_pdf_bboxes() # Reuse stage 05 results
	```

	Why Handwriting/Visual Elements Require OCR:
	- Handwriting images inserted in stage 12 → Not in PDF text layer
	- Visual elements inserted in stage 13 → Not in PDF text layer
	- PDF text extraction (stage 05) misses these elements
	- OCR captures all visible text on rendered image

	Microsoft OCR API:
	- Service: Azure Computer Vision (Read API)
	- Input: PNG image (from stage 14)
	- Output: Word polygons, text, confidence scores
	- Caching: Results cached to avoid re-processing

	OCR Response Format:
	```json
	{
	"readResults": [{
	"page": 1,
	"words": [
	{
	"boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
	"text": "Invoice",
	"confidence": 0.98
	}
	],
	"lines": [
	{
	"boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
	"text": "Invoice Date",
	"words": [...]
	}
	]
	}]
	}
	```

	Coordinate Conversion:
	OCR returns pixel coordinates; converted to PDF points:
	```python
	pdf_x = ocr_x * (pdf_width / image_width)
	pdf_y = ocr_y * (pdf_height / image_height)
	```

	Segment Aggregation:
	Groups words into lines based on Y coordinate proximity.

	Key Features:
	- Selective OCR: Only runs OCR when necessary (cost/time savings)
	- Result Caching: Avoids redundant API calls
	- Dual-Level Output: Word and line (segment) bboxes
	- Coordinate Conversion: OCR pixels → PDF points
	- Fallback: Copies PDF bboxes for documents without handwriting/VEs

	Configuration:
	- OCR API key from environment variables
	- Timeout: 30 seconds per request
	- Retry: Up to 3 attempts

	Error Handling:
	- OCR API failure → Retry, fallback to PDF bboxes if all retries fail
	- Empty OCR result → Warning, uses PDF bboxes
	- Coordinate conversion error → Logged, bbox skipped

	---

	### Stage 16: Normalize BBoxes

	File: `pipeline_16_normalize_bboxes.py`

	Purpose: Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.

	Key Functions:
	- `main()`: Main normalization orchestrator
	- `normalize_word_and_segment_bboxes()`: Normalize word/segment bboxes
	- `normalize_layout_bboxes()`: Normalize layout element bboxes (DLA only)
	- `normalize_coordinates()`: Core coordinate transformation

	Process:
	1. Load final bboxes from stage 15
	2. Load image dimensions (for normalization denominators)
	3. For each bbox:
	- Transform: `normalized_x = pixel_x / image_width`
	- Transform: `normalized_y = pixel_y / image_height`
	- Preserve text content
	4. Save normalized bboxes
	5. For DLA tasks: also normalize layout element bboxes

	Inputs:
	- Word bboxes from `final_word_bboxes/` (stage 15)
	- Segment bboxes from `final_segment_bboxes/` (stage 15)
	- Layout element definitions from `layout_element_definitions/` (stage 06, for DLA)
	- Image dimensions from stage 14 logs

	Outputs:
	```
	normalized_word_bboxes/
	{doc_id}.json # Normalized word bboxes
	normalized_segment_bboxes/
	{doc_id}.json # Normalized segment bboxes
	normalized_gt/ # For DLA tasks only
	{doc_id}.json # Normalized layout elements
	logs/normalization/ # Normalization logs
	```

	Normalization Formula:
	```python
	normalized_bbox = {
	"x0": bbox["x0"] / image_width,
	"y0": bbox["y0"] / image_height,
	"x1": bbox["x1"] / image_width,
	"y1": bbox["y1"] / image_height,
	"text": bbox["text"] # Preserved
	}
	```

	Normalized BBox Format:
	```json
	[
	{
	"x0": 0.095, # Was: 20 pixels out of 210mm @ 200dpi
	"y0": 0.128, # Was: 30 pixels
	"x1": 0.286, # Was: 60 pixels
	"y1": 0.192, # Was: 45 pixels
	"text": "Invoice"
	}
	]
	```

	DLA-Specific Normalization:
	For Document Layout Analysis tasks, layout element bboxes are also normalized:
	```json
	[
	{
	"label": "title",
	"bbox": [0.095, 0.128, 0.905, 0.192] # [x0, y0, x1, y1]
	},
	{
	"label": "text",
	"bbox": [0.095, 0.213, 0.905, 0.534]
	}
	]
	```

	Why Normalization:
	- Model Training: Most models expect [0, 1] coordinates
	- Resolution Independence: Works across different image sizes
	- Standard Format: Matches common dataset formats (e.g., LayoutLM)

	Coordinate System Mapping:
	```
	PDF Points (72 DPI):
	595 × 842 points (A4)
	↓ (stage 14 conversion @ 200 DPI)
	Image Pixels (200 DPI):
	1654 × 2339 pixels
	↓ (stage 16 normalization)
	Normalized [0, 1]:
	0.0-1.0 × 0.0-1.0
	```

	Key Features:
	- Preserves Text: Text content unchanged during normalization
	- Task-Specific: DLA tasks get additional layout bbox normalization
	- Validation: Checks for out-of-bounds coordinates (clamps to [0, 1])
	- Precision: Full float precision maintained

	Error Handling:
	- Out-of-bounds coordinates → Clamped to [0, 1] with warning
	- Missing image dimensions → Error, document skipped
	- Zero image dimensions → Error, document marked invalid

	---

	### Stage 17: GT Preparation & Verification

	File: `pipeline_17_gt_preparation_verification.py`

	Purpose: Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.

	Key Functions:
	- `main()`: Main verification orchestrator
	- `route_task()`: Route to task-specific handler
	- `handle_qa()`: QA ground truth processing
	- `handle_kie()`: KIE ground truth with BIO tagging
	- `handle_dla()`: DLA ground truth processing
	- `fuzzy_match_text()`: Levenshtein-based text matching

	Process:
	1. Load raw GT from stage 03/06
	2. Load final word bboxes from stage 15
	3. Route to task-specific handler
	4. Perform fuzzy text matching (GT text → OCR text)
	5. Map GT annotations to bbox indices
	6. Apply task-specific formatting
	7. Validate and save verified GT

	Inputs:
	- Raw GT from `raw_annotations/` (stage 03/06)
	- Word bboxes from `final_word_bboxes/` (stage 15)
	- Layout elements from `layout_element_definitions/` (stage 06, for DLA)
	- Visual elements from `visual_element_definitions/` (stage 08, for DLA)

	Outputs:
	```
	verified_gt/
	qa/
	{doc_id}.json # QA format
	kie/
	{doc_id}.json # KIE format with BIO tagging
	dla/
	{doc_id}.json # DLA format with normalized bboxes
	cls/
	{doc_id}.json # Classification format
	logs/gt_verification/ # Verification logs with match statistics
	```

	---

	#### QA (Question Answering) Format

	Process:
	1. Load QA pairs: `{"question": "answer", ...}`
	2. For each answer:
	- Find words in bboxes matching answer text (fuzzy)
	- Record bbox indices of matching words
	3. Save verified GT with bbox mappings

	Output Format:
	```json
	{
	"questions": [
	{
	"question": "What is the invoice number?",
	"answer": "INV-12345",
	"answer_bbox_indices": [15, 16] # Word indices in final_word_bboxes
	},
	{
	"question": "What is the total amount?",
	"answer": "$1,234.56",
	"answer_bbox_indices": [42]
	}
	]
	}
	```

	Fuzzy Matching:
	Uses Levenshtein distance with 0.85 similarity cutoff:
	```python
	similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
	if similarity >= 0.85:
	match_found = True
	```

	---

	#### KIE (Key Information Extraction) Format

	Process:
	1. Load entity annotations: `{entity_name: text_value, ...}`
	2. For each entity:
	- Find words matching entity value (fuzzy)
	- Generate BIO tags for all words
	3. Save verified GT with BIO tagging

	Output Format:
	```json
	{
	"entities": [
	{
	"entity": "company_name",
	"value": "Acme Corporation",
	"bbox_indices": [5, 6]
	},
	{
	"entity": "invoice_date",
	"value": "January 15, 2024",
	"bbox_indices": [20, 21, 22]
	}
	],
	"word_labels": [
	"O", "O", "O", "O", "O", # Words 0-4: Outside entities
	"B-company_name", "I-company_name", # Words 5-6: Company name
	"O", "O", ..., # Words 7-19: Outside
	"B-invoice_date", "I-invoice_date", "I-invoice_date" # Words 20-22
	]
	}
	```

	BIO Tagging:
	- `B-entity`: Beginning of entity
	- `I-entity`: Inside entity (continuation)
	- `O`: Outside any entity

	---

	#### DLA (Document Layout Analysis) Format

	Process:
	1. Load layout element definitions from stage 06
	2. Load visual element definitions from stage 08
	3. Validate labels (must be in `valid_labels`)
	4. Check spatial constraints (no containment, minimal overlap)
	5. Merge visual elements into layout annotations
	6. Normalize bboxes to [0, 1]
	7. Save verified GT

	Output Format:
	```json
	{
	"layout_elements": [
	{
	"id": "layout_0",
	"label": "title",
	"bbox": [0.095, 0.128, 0.905, 0.192] # Normalized [x0, y0, x1, y1]
	},
	{
	"id": "layout_1",
	"label": "text",
	"bbox": [0.095, 0.213, 0.905, 0.534]
	},
	{
	"id": "ve0",
	"label": "figure", # Visual element mapped to layout label
	"bbox": [0.714, 0.895, 0.905, 0.980]
	}
	]
	}
	```

	Visual Element Merging:
	Visual elements from stage 08 are converted to layout labels:
	- `stamp` → `figure` (or custom mapping)
	- `logo` → `figure`
	- `chart` → `figure`
	- `barcode` → `figure`
	- `photo` → `figure`

	Spatial Validation:
	- No containment: One bbox fully inside another → Error
	- Minimal overlap: Overlap area < 5% of smaller bbox → Warning
	- Valid labels: All labels must be in `SynDatasetDefinition.valid_labels`

	---

	#### CLS (Classification) Format

	Process:
	1. Load classification label from raw GT
	2. Validate label against expected classes
	3. Save verified GT

	Output Format:
	```json
	{
	"document_class": "invoice",
	"confidence": 1.0
	}
	```

	---

	Fuzzy Matching Details:

	Uses Levenshtein distance (via `fuzz` library) to handle OCR discrepancies:
	```python
	# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
	similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
	# similarity = 0.89 (above 0.85 threshold) → Match!
	```

	Match Statistics Logged:
	- Total GT annotations
	- Successfully matched annotations
	- Failed matches (similarity < 0.85)
	- Average similarity score

	Key Features:
	- Fuzzy Matching: Handles OCR errors gracefully
	- Task-Specific Formatting: QA, KIE, DLA, CLS all handled differently
	- BIO Tagging: Automatic generation for KIE
	- Visual Element Integration: DLA merges visual elements as layout annotations
	- Spatial Validation: Detects overlapping/contained layout elements
	- Similarity Tracking: Logs match quality for analysis

	Error Handling:
	- Match failure (similarity < 0.85) → Logged, annotation skipped
	- Invalid labels (DLA) → Error, document marked invalid
	- Spatial violations (DLA) → Warning, elements flagged
	- Missing bboxes → Error, document marked invalid

	---

	### Stage 18: Analyze

	File: `pipeline_18_analyze.py`

	Purpose: Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.

	Key Functions:
	- `main()`: Main analysis orchestrator
	- `calculate_api_costs()`: Compute LLM API costs from token usage

	Process:
	1. Load all document logs from stages 01-17
	2. Categorize documents: valid vs. invalid
	3. Calculate error distributions
	4. Compute API usage and costs
	5. Generate statistics (handwriting, visual elements, annotations)
	6. Save comprehensive dataset log

	Inputs:
	- All document logs from previous stages
	- Batch results from stage 02 (for cost calculation)
	- Message processing logs from stage 03
	- Prompt usage statistics

	Outputs:
	```
	dataset_log.json # Comprehensive dataset statistics
	logs/analysis/ # Analysis logs
	```

	Dataset Log Structure:
	```json
	{
	"metadata": {
	"syndatadef_name": "docvqa_alpha=1.0",
	"task": "qa",
	"total_documents_requested": 1000,
	"generation_date": "2026-02-07"
	},

	"prompting": {
	"total_prompts": 100,
	"total_batches": 10,
	"llm_model": "claude-sonnet-4-20250514"
	},

	"total_cost_summary": {
	"total_cost_usd": 123.45,
	"input_tokens": 1500000,
	"output_tokens": 800000,
	"cached_tokens": 500000,
	"cost_per_document": 0.12
	},

	"valid_samples_stats": {
	"total_valid": 847,
	"total_invalid": 153,
	"validity_rate": 0.847,
	"avg_handwriting_regions_per_doc": 2.3,
	"avg_visual_elements_per_doc": 1.5,
	"avg_annotations_per_doc": 8.7,
	"documents_with_handwriting": 654,
	"documents_with_visual_elements": 512
	},

	"valid_samples": [
	{
	"doc_id": "doc_0001",
	"seed_image": "docvqa_train_12345",
	"has_handwriting": true,
	"has_visual_elements": true,
	"num_annotations": 10,
	"num_words": 247,
	"image_size": [1654, 2339]
	}
	],

	"valid_samples_by_category": {
	"invoice": 234,
	"receipt": 198,
	"form": 415
	},

	"errors": {
	"multipage_pdf": 12,
	"missing_ocr_result": 5,
	"failed_gt_verification": 38,
	"rendering_timeout": 8,
	"llm_parsing_error": 23,
	"bbox_extraction_failed": 4,
	"handwriting_generation_failed": 7,
	"visual_element_generation_failed": 3,
	"other": 53
	}
	}
	```

	Error Categories:

	\| Error Category \| Description \|
	\|---------------\|-------------\|
	\| `multipage_pdf` \| PDF rendered with multiple pages (invalid) \|
	\| `missing_ocr_result` \| OCR failed or returned empty result \|
	\| `failed_gt_verification` \| GT matching/validation failed in stage 17 \|
	\| `rendering_timeout` \| PDF rendering exceeded timeout \|
	\| `llm_parsing_error` \| Failed to extract HTML/GT from LLM response \|
	\| `bbox_extraction_failed` \| PyMuPDF failed to extract bboxes \|
	\| `handwriting_generation_failed` \| Diffusion model failed \|
	\| `visual_element_generation_failed` \| VE generation/selection failed \|
	\| `other` \| Miscellaneous errors \|

	Cost Calculation:

	Claude API Pricing (example):
	```python
	costs = {
	"input": input_tokens * 0.003 / 1000, # $3 per 1M tokens
	"output": output_tokens * 0.015 / 1000, # $15 per 1M tokens
	"cached": cached_tokens * 0.0003 / 1000 # $0.30 per 1M tokens (prompt caching)
	}
	total_cost = costs["input"] + costs["output"] + costs["cached"]
	```

	Statistics Computed:
	- Validity Rate: Percentage of documents passing all stages
	- Handwriting Stats: Documents with handwriting, avg regions per doc
	- Visual Element Stats: Documents with VEs, avg elements per doc
	- Annotation Stats: Avg QA pairs/KIE entities/DLA elements per doc
	- Token Usage: Input/output/cached token totals
	- Cost Metrics: Total cost, cost per document, cost per valid document

	Key Features:
	- Comprehensive Error Tracking: All error categories logged
	- Cost Transparency: Token-level cost breakdown
	- Quality Metrics: Validity rate, avg annotations, etc.
	- Category Breakdown: Valid samples grouped by document type
	- Per-Document Tracking: Each valid document's metadata saved

	Usage:
	This log is essential for:
	- Understanding dataset quality
	- Optimizing pipeline (identify bottlenecks)
	- Cost estimation for future runs
	- Debugging (error distribution analysis)

	---

	### Stage 19: Create Debug Data

	File: `pipeline_19_create_debug_data.py`

	Purpose: Generate comprehensive debug visualizations for manual inspection and quality assurance.

	Key Functions:
	- `main()`: Main debug generator
	- `visualize_visual_element_bboxes()`: Overlay VE bboxes on PDFs
	- `visualize_final_bboxes_on_images()`: Overlay OCR bboxes on images
	- `visualize_pdf_bboxes()`: Overlay PDF bboxes

	Process:
	1. Load all generated documents
	2. Create debug subdirectories
	3. For each document:
	- Generate PDF bbox overlays
	- Generate VE bbox overlays
	- Generate handwriting insertion region overlays
	- Generate final OCR bbox overlays on images
	4. Copy raw HTML with debug.js script for browser inspection

	Inputs:
	- All intermediate and final outputs from stages 01-18

	Outputs:
	```
	debug/
	pdf_bboxes/ # PDF word bboxes overlaid (stage 05)
	{doc_id}.pdf
	visual_element_bboxes/ # VE bboxes overlaid (stage 08)
	{doc_id}.pdf
	handwriting_insertion/ # Handwriting regions overlaid (stage 12)
	{doc_id}.pdf
	final_bboxes_on_images/ # OCR bboxes on final images (stage 15)
	{doc_id}.png
	html_with_debug/ # Raw HTML + debug.js
	{doc_id}.html
	debug.js # Browser-based inspection script
	```

	Debug Visualizations:

	1. PDF BBoxes (Stage 05):
	- Red rectangles: Word bounding boxes from PyMuPDF
	- Annotated with word text
	- Purpose: Verify PDF text extraction quality

	2. Visual Element BBoxes (Stage 08):
	- Blue rectangles: Visual element placeholder regions
	- Annotated with element type (stamp, logo, etc.)
	- Purpose: Verify VE extraction and positioning

	3. Handwriting Insertion Regions (Stage 12):
	- Green rectangles: Handwriting bbox regions
	- Annotated with handwriting text
	- Purpose: Verify handwriting placement accuracy

	4. Final BBoxes on Images (Stage 15):
	- Orange rectangles: OCR word bounding boxes
	- Overlaid on final rendered images
	- Purpose: Verify OCR accuracy and coverage

	Debug JavaScript (debug.js):
	```javascript
	// Browser-based inspection tool
	// Features:
	// - Highlight elements on hover
	// - Show geometry data in console
	// - Toggle element visibility
	// - Measure element dimensions
	```

	Key Features:
	- Color-Coded Overlays: Different colors for different stages
	- Text Annotations: Bboxes labeled with content
	- Multi-Format: Both PDF and PNG visualizations
	- Browser Inspection: HTML with interactive debug script
	- Selective Generation: Only enabled with `DEBUG_MODE=true` flag

	Configuration:
	- `DEBUG_MODE`: Enable/disable debug output (default: false)
	- `DEBUG_BBOX_LINE_WIDTH`: Line thickness for overlays (default: 2)
	- `DEBUG_BBOX_OPACITY`: Overlay transparency (default: 0.5)

	When to Use:
	- Visual quality assurance
	- Debugging bbox extraction issues
	- Verifying handwriting/VE insertion
	- Identifying OCR problems
	- Manual inspection of edge cases

	Performance Note:
	Debug generation adds ~20% processing time; disabled by default in production.

	---

	## API Implementation

	The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.

	### Architecture Overview

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ API ARCHITECTURE │
	└─────────────────────────────────────────────────────────────────┘

	Client Request
	│
	↓
	┌─────────────────────┐
	│ FastAPI Server │
	│ (main.py) │
	└─────────────────────┘
	│
	├──► Validate Request (schemas.py)
	│
	├──► Download Seed Images (utils.py)
	│
	├──► Build Prompt (utils.py)
	│
	├──► Call Claude API (Synchronous)
	│ └──► Claude Sonnet 4.5
	│
	├──► Extract HTML & GT (pipeline_03 functions)
	│
	├──► Render PDF (pipeline_04 functions)
	│ └──► Playwright/Chromium
	│
	├──► Extract BBoxes (pipeline_05 functions)
	│ └──► PyMuPDF
	│
	└──► Return Response
	└──► JSON with base64-encoded PDFs
	```

	---

	### Endpoints

	#### GET /
	Health check endpoint.

	Response:
	```json
	{
	"message": "DocGenie API is running",
	"version": "1.0",
	"status": "healthy"
	}
	```

	---

	#### GET /health
	Detailed health status.

	Response:
	```json
	{
	"status": "healthy",
	"api_version": "1.0",
	"playwright_available": true,
	"llm_model": "claude-sonnet-4-5-20250929"
	}
	```

	---

	#### POST /generate
	Generate documents with ground truth annotations.

	Request Body (`schemas.GenerateDocumentRequest`):
	```json
	{
	"seed_images": [
	"https://example.com/seed1.jpg",
	"https://example.com/seed2.jpg"
	],
	"prompt_params": {
	"language": "English",
	"doc_type": "business and administrative",
	"gt_type": "Multiple questions and their answers",
	"gt_format": "{\"question\": \"answer\", ...}",
	"num_solutions": 2
	}
	}
	```

	Request Validation:
	- `seed_images`: 1-10 URLs (HTTPS only)
	- `num_solutions`: 1-5 documents
	- All prompt parameters required

	Response (`schemas.GenerateDocumentResponse`):
	```json
	{
	"success": true,
	"message": "Successfully generated 2 documents",
	"documents": [
	{
	"document_id": "doc_20260207_001",
	"html": "<html>...</html>",
	"css": "body { ... }",
	"ground_truth": {
	"What is the invoice number?": "INV-12345"
	},
	"pdf_base64": "JVBERi0xLjQK...",
	"bboxes": [
	{
	"x0": 20.0,
	"y0": 30.0,
	"x1": 60.0,
	"y1": 45.0,
	"text": "Invoice"
	}
	],
	"page_width_mm": 210.0,
	"page_height_mm": 297.0
	}
	],
	"total_documents": 2
	}
	```

	---

	#### POST /generate-files
	Generate documents and return as downloadable files.

	Request: Same as `/generate`

	Response:
	- Content-Type: `application/zip`
	- File: ZIP archive containing:
	- `doc_001.pdf`
	- `doc_001_gt.json`
	- `doc_001_bboxes.json`
	- `doc_002.pdf`
	- ...

	File Structure:
	```
	generated_documents.zip
	├── doc_001.pdf
	├── doc_001_gt.json
	├── doc_001_bboxes.json
	├── doc_001_metadata.json
	├── doc_002.pdf
	├── ...
	```

	---

	### Request/Response Schemas

	Defined in `api/schemas.py`:

	#### PromptParameters
	```python
	class PromptParameters(BaseModel):
	language: str = "English"
	doc_type: str = "business and administrative"
	gt_type: str = "Multiple questions and their answers"
	gt_format: str = '{"question": "answer", ...}'
	num_solutions: int = Field(default=1, ge=1, le=5)
	```

	#### GenerateDocumentRequest
	```python
	class GenerateDocumentRequest(BaseModel):
	seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
	prompt_params: PromptParameters
	```

	#### BoundingBox
	```python
	class BoundingBox(BaseModel):
	x0: float
	y0: float
	x1: float
	y1: float
	text: str
	```

	#### DocumentResult
	```python
	class DocumentResult(BaseModel):
	document_id: str
	html: str
	css: str
	ground_truth: Optional[dict]
	pdf_base64: str
	bboxes: List[BoundingBox]
	page_width_mm: float
	page_height_mm: float
	```

	#### GenerateDocumentResponse
	```python
	class GenerateDocumentResponse(BaseModel):
	success: bool
	message: str
	documents: List[DocumentResult]
	total_documents: int
	```

	---

	### API Pipeline Flow

	Detailed integration with pipeline stages:

	```python
	# Simplified API flow (from api/main.py)

	@app.post("/generate", response_model=GenerateDocumentResponse)
	async def generate_documents(request: GenerateDocumentRequest):
	# 1. Download and encode seed images
	seed_images_base64 = await download_and_encode_images(request.seed_images)

	# 2. Build prompt from template
	prompt = build_prompt_from_template(
	seed_images=seed_images_base64,
	params=request.prompt_params
	)

	# 3. Call Claude API (synchronous, not batched)
	llm_response = await call_claude_api(
	prompt=prompt,
	model="claude-sonnet-4-5-20250929"
	)

	# 4. Extract HTML and GT (from pipeline_03)
	documents_html = extract_html_from_message(llm_response)
	documents_gt = extract_gt_from_html(documents_html)

	# 5. Validate HTML (from utils.py)
	validate_html_structure(documents_html)

	# 6. Render PDFs (from pipeline_04)
	pdfs = await render_pdfs_with_playwright(documents_html)

	# 7. Validate PDFs (from utils.py)
	validate_pdf_pages(pdfs)

	# 8. Extract bboxes (from pipeline_05)
	bboxes = extract_bboxes_from_pdfs(pdfs)

	# 9. Validate bboxes (from utils.py)
	validate_bbox_completeness(bboxes)

	# 10. Encode PDFs to base64
	pdfs_base64 = encode_pdfs_to_base64(pdfs)

	# 11. Build response
	return GenerateDocumentResponse(
	success=True,
	documents=[...],
	total_documents=len(documents_html)
	)
	```

	---

	### Integration with Pipeline Functions

	Reused from Pipeline:
	- `extract_html_from_message()` from `pipeline_03`
	- `extract_gt_from_html()` from `pipeline_03`
	- `render_pdf_with_playwright()` from `pipeline_04`
	- `extract_bboxes_from_pdf()` from `pipeline_05`
	- Various utilities from `docgenie.generation.utils`

	API-Specific Functions (api/utils.py):
	- `download_seed_images()`: Fetch images from URLs
	- `encode_images_to_base64()`: Convert images for API transmission
	- `build_prompt_from_template()`: Template-based prompt construction
	- `call_claude_api_sync()`: Synchronous Claude API call (non-batched)
	- `encode_pdf_to_base64()`: PDF encoding for response
	- `validate_html_structure()`: HTML validation
	- `validate_pdf_pages()`: PDF page count/size validation
	- `validate_bbox_completeness()`: Ensure bboxes extracted

	---

	### Configuration

	#### Environment Variables (.env)
	```bash
	ANTHROPIC_API_KEY=sk-ant-... # Required for Claude API
	LLM_MODEL=claude-sonnet-4-5-20250929 # Default model
	API_PORT=8000 # Server port
	DEBUG_MODE=false # Enable debug logging
	```

	#### Prompt Templates
	Located in `data/prompt_templates/`:

	Template Structure:
	```
	data/prompt_templates/<template_name>/
	├── system_prompt.txt # System message
	├── user_prompt_template.txt # User message template
	└── example_output.html # Example for few-shot
	```

	Placeholder Substitution:
	```python
	# In user_prompt_template.txt
	"""
	Please generate {num_solutions} {doc_type} documents in {language}.

	Ground truth format: {gt_format}
	Ground truth type: {gt_type}
	"""

	# Substituted with request.prompt_params
	```

	#### CORS Configuration
	```python
	# In api/main.py
	app.add_middleware(
	CORSMiddleware,
	allow_origins=["*"], # Open for development
	allow_credentials=True,
	allow_methods=["*"],
	allow_headers=["*"],
	)
	```

	---

	### Authentication & Security

	Current State:
	- No built-in authentication: API is open (for development)
	- API Key Management: Claude API key in `.env`, not passed in requests
	- CORS: Open (`allow_origins=["*"]`) for development

	Production Recommendations:
	- Add API key authentication (e.g., Bearer tokens)
	- Restrict CORS origins to known frontends
	- Rate limiting (e.g., Redis-based)
	- Input sanitization for HTML injection prevention
	- HTTPS only (terminate SSL at reverse proxy)

	---

	### Performance Considerations

	Async Rendering:
	- Playwright rendering uses async/await
	- Concurrent requests supported by FastAPI
	- Semaphore control prevents resource exhaustion

	Image Size Limits:
	- Seed images compressed before API transmission
	- Max dimension: 1024px (configurable)
	- JPEG quality: 85% (configurable)

	Timeouts:
	- PDF render timeout: 60 seconds per document
	- Claude API timeout: 120 seconds
	- Total request timeout: 300 seconds (5 minutes)

	Retry Logic:
	- PDF rendering: Up to 3 attempts
	- Claude API: Up to 2 retries on network errors
	- No retry on validation failures

	Concurrency:
	- FastAPI default: Multiple workers (configurable)
	- Playwright: Semaphore-controlled (10 concurrent renders)

	---

	### Limitations

	Current Limitations:
	1. Single-Page Only: Multi-page PDFs flagged as errors
	2. Seed Image Limit: Maximum 10 seed images per request
	3. Document Limit: Maximum 5 document variations (`num_solutions`)
	4. No Handwriting/Visual Elements: Stages 07-13 not integrated (API stops at stage 06)
	5. Synchronous LLM: No batching (higher cost per document)
	6. No Dataset Export: No `/export-dataset` endpoint for full pipeline runs

	Known Issues:
	- Large documents (>10 pages worth of content) may timeout
	- Complex CSS (animations, 3D transforms) may not render correctly
	- Some Unicode characters may not display in PDFs

	---

	### Future Integration Plan

	From `api/PIPELINE_INTEGRATION.md`:

	#### Stage 3: Handwriting & Visual Elements (Stages 07-11)

	New Request Parameters:
	```python
	class PromptParameters(BaseModel):
	# ... existing ...
	enable_handwriting: bool = False
	handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
	enable_visual_elements: bool = False
	visual_element_types: List[str] = ["stamp", "logo"]
	```

	New Response Fields:
	```python
	class DocumentResult(BaseModel):
	# ... existing ...
	handwriting_regions: Optional[List[HandwritingRegion]]
	visual_elements: Optional[List[VisualElement]]
	```

	Impact:
	- Longer processing time (diffusion model: ~5s per handwriting region)
	- Larger response size (additional images)

	---

	#### Stage 4: Image Finalization & OCR (Stages 12-15)

	New Response Fields:
	```python
	class DocumentResult(BaseModel):
	# ... existing ...
	image_base64: str # Final rendered image (PNG)
	ocr_text: str # Full OCR text
	ocr_confidence: float # Average OCR confidence
	```

	Impact:
	- OCR API costs (~$1.50 per 1000 images)
	- Additional 2-3 seconds per document (OCR latency)

	---

	#### Stage 5: Dataset Packaging (Stages 16-19)

	New Endpoint:
	```python
	@app.post("/export-dataset")
	async def export_dataset(request: ExportDatasetRequest):
	"""
	Run full pipeline (stages 01-19) and return packaged dataset.
	"""
	# Run pipeline with syndatadef
	# Return ZIP with images, GTs, statistics
	```

	Request:
	```python
	class ExportDatasetRequest(BaseModel):
	syndatadef_config: dict # Full SynDatasetDefinition
	output_format: str = "huggingface" # "huggingface", "coco", "custom"
	```

	Response:
	- ZIP archive with full dataset
	- Includes `dataset_log.json` from stage 18

	---

	### Example Usage

	#### Python Client (api/example_usage.py)

	```python
	import requests
	import base64
	from pathlib import Path

	# API endpoint
	API_URL = "http://localhost:8000/generate"

	# Prepare request
	request_data = {
	"seed_images": [
	"https://example.com/invoice_seed.jpg",
	"https://example.com/receipt_seed.jpg"
	],
	"prompt_params": {
	"language": "English",
	"doc_type": "invoices and receipts",
	"gt_type": "Multiple questions and their answers",
	"gt_format": '{"question": "answer", ...}',
	"num_solutions": 3
	}
	}

	# Call API
	response = requests.post(API_URL, json=request_data)
	result = response.json()

	# Process results
	if result["success"]:
	for doc in result["documents"]:
	doc_id = doc["document_id"]

	# Save PDF
	pdf_data = base64.b64decode(doc["pdf_base64"])
	Path(f"{doc_id}.pdf").write_bytes(pdf_data)

	# Save GT
	Path(f"{doc_id}_gt.json").write_text(
	json.dumps(doc["ground_truth"], indent=2)
	)

	# Save HTML
	Path(f"{doc_id}.html").write_text(doc["html"])

	print(f"Saved: {doc_id}")
	print(f" - BBoxes: {len(doc['bboxes'])}")
	print(f" - GT Annotations: {len(doc['ground_truth'])}")
	```

	#### cURL Example

	```bash
	curl -X POST http://localhost:8000/generate \
	-H "Content-Type: application/json" \
	-d '{
	"seed_images": [
	"https://example.com/seed.jpg"
	],
	"prompt_params": {
	"language": "English",
	"doc_type": "business documents",
	"gt_type": "Questions and answers",
	"gt_format": "{\"question\": \"answer\"}",
	"num_solutions": 1
	}
	}'
	```

	#### JavaScript/TypeScript Client

	```typescript
	const response = await fetch('http://localhost:8000/generate', {
	method: 'POST',
	headers: { 'Content-Type': 'application/json' },
	body: JSON.stringify({
	seed_images: ['https://example.com/seed.jpg'],
	prompt_params: {
	language: 'English',
	doc_type: 'invoices',
	gt_type: 'Questions and answers',
	gt_format: '{"question": "answer"}',
	num_solutions: 2
	}
	})
	});

	const result = await response.json();

	// Decode PDF
	const pdfBlob = new Blob(
	[Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
	{ type: 'application/pdf' }
	);

	// Download PDF
	const url = URL.createObjectURL(pdfBlob);
	const a = document.createElement('a');
	a.href = url;
	a.download = `${result.documents[0].document_id}.pdf`;
	a.click();
	```

	---

	### Testing

	Test File: `api/test_api.py`

	```python
	import pytest
	from fastapi.testclient import TestClient
	from api.main import app

	client = TestClient(app)

	def test_health_check():
	response = client.get("/health")
	assert response.status_code == 200
	assert response.json()["status"] == "healthy"

	def test_generate_documents():
	request_data = {
	"seed_images": ["https://example.com/seed.jpg"],
	"prompt_params": {
	"language": "English",
	"doc_type": "invoices",
	"gt_type": "Questions",
	"gt_format": "{\"q\": \"a\"}",
	"num_solutions": 1
	}
	}
	response = client.post("/generate", json=request_data)
	assert response.status_code == 200
	result = response.json()
	assert result["success"] is True
	assert len(result["documents"]) > 0

	def test_invalid_request():
	request_data = {
	"seed_images": [], # Invalid: empty
	"prompt_params": {"num_solutions": 10} # Invalid: > 5
	}
	response = client.post("/generate", json=request_data)
	assert response.status_code == 422 # Validation error
	```

	Run Tests:
	```bash
	cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
	pytest api/test_api.py -v
	```

	---

	### Deployment

	Start Server:
	```bash
	cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
	chmod +x start.sh
	./start.sh
	```

	start.sh Contents:
	```bash
	#!/bin/bash
	export $(cat .env \| xargs)
	uvicorn main:app --host 0.0.0.0 --port 8000 --reload
	```

	Production Deployment:
	```bash
	# With Gunicorn (multi-worker)
	gunicorn main:app \
	--workers 4 \
	--worker-class uvicorn.workers.UvicornWorker \
	--bind 0.0.0.0:8000 \
	--timeout 300
	```

	Docker Deployment:
	```dockerfile
	FROM python:3.10-slim

	WORKDIR /app
	COPY requirements.txt .
	RUN pip install --no-cache-dir -r requirements.txt

	# Install Playwright browsers
	RUN playwright install chromium
	RUN playwright install-deps

	COPY . .

	CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
	```

	---

	## Core Models & Utilities

	### SynDatasetDefinition Model

	File: `docgenie/generation/models/_syndatadef.py`

	Purpose: Configuration object for synthetic dataset generation, loaded from YAML files.

	Key Attributes:
	```python
	@dataclass
	class SynDatasetDefinition:
	# Dataset metadata
	name: str # "docvqa_alpha=1.0"
	task: TaskType # TaskType.QA, .KIE, .DLA, .CLS
	base_dataset: str # "docvqa", "cord", etc.

	# LLM configuration
	llm_model: str # "claude-sonnet-4-20250514"
	prompt_template: str # "DocGenie"
	num_solutions: int # Documents per prompt

	# Prompting parameters
	language: str # "English", "German", etc.
	doc_type: str # "business and administrative"
	gt_type: str # Task-specific GT description
	gt_format: str # Expected GT format

	# Dataset parameters
	num_samples: int # Total documents to generate
	alpha: float # Clustering diversity parameter

	# Seed selection
	seeds_per_cluster: int # Seeds sampled per cluster
	clustering_method: str # "kmeans", "agglomerative"

	# Task-specific
	valid_labels: Optional[List[str]] # For DLA: ["title", "text", ...]

	# Output paths
	output_dir: Path # Root output directory
	```

	Key Methods:
	```python
	def get_file_structure(self) -> FileStructure:
	"""Returns FileStructure manager for output directories."""

	def build_prompt(self, seed_images: List[str]) -> str:
	"""Builds prompt from template with parameter substitution."""

	def iter_document_logs(self) -> Iterator[DocumentLog]:
	"""Iterates over all document logs."""

	def update_document_status(self, doc_id: str, status: Status):
	"""Updates document status in log."""

	@classmethod
	def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
	"""Load configuration from YAML file."""
	```

	Example YAML:
	```yaml
	name: "docvqa_alpha=1.0"
	task: "qa"
	base_dataset: "docvqa"

	llm_model: "claude-sonnet-4-20250514"
	prompt_template: "DocGenie"
	num_solutions: 1

	language: "English"
	doc_type: "business and administrative documents"
	gt_type: "Multiple questions and their answers"
	gt_format: '{"question": "answer", ...}'

	num_samples: 1000
	alpha: 1.0
	seeds_per_cluster: 10
	clustering_method: "kmeans"

	output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"
	```

	---

	### PipelineParameters Model

	File: `docgenie/generation/models/_pipeline.py`

	Purpose: Runtime parameters for pipeline execution.

	Attributes:
	```python
	@dataclass
	class PipelineParameters:
	# Execution control
	start_stage: int = 1 # First stage to execute
	end_stage: int = 19 # Last stage to execute
	skip_existing: bool = True # Skip documents with existing outputs

	# Parallelization
	max_workers: int = 10 # Concurrent processing
	chromium_concurrency: int = 10 # Parallel PDF renders

	# Debug mode
	debug_mode: bool = False # Enable debug visualizations

	# Retry configuration
	max_retries: int = 3 # Retry attempts
	retry_delay: float = 2.0 # Seconds between retries

	# Timeouts
	pdf_render_timeout: int = 60 # Seconds
	ocr_timeout: int = 30 # Seconds
	```

	---

	### FileStructure Model

	File: `docgenie/generation/models/_file.py`

	Purpose: Manages directory structure for generated data.

	Key Properties:
	```python
	@dataclass
	class FileStructure:
	root: Path # Root output directory

	# Directory properties (all return Path)
	seeds_directory: Path
	prompt_batches_directory: Path
	message_results_directory: Path
	raw_html_directory: Path
	raw_annotations_directory: Path
	geometries_directory: Path
	pdf_initial_directory: Path
	render_html_directory: Path
	pdf_word_bboxes_directory: Path
	pdf_char_bboxes_directory: Path
	layout_element_definitions_directory: Path
	handwriting_definitions_directory: Path
	visual_element_definitions_directory: Path
	handwriting_images_directory: Path
	visual_element_images_directory: Path
	pdf_without_handwriting_placeholder_directory: Path
	pdf_with_handwriting_directory: Path
	pdf_final_directory: Path
	images_directory: Path
	final_word_bboxes_directory: Path
	final_segment_bboxes_directory: Path
	normalized_word_bboxes_directory: Path
	normalized_segment_bboxes_directory: Path
	verified_gt_directory: Path

	# Debug subdirectories
	debug_directory: Path
	debug_pdf_bboxes_directory: Path
	debug_visual_element_bboxes_directory: Path
	debug_handwriting_insertion_directory: Path
	debug_final_bboxes_on_images_directory: Path
	debug_html_with_debug_directory: Path
	```

	Key Methods:
	```python
	def create_all_directories(self):
	"""Create all output directories."""

	def get_document_path(self, doc_id: str, stage: str) -> Path:
	"""Get path for specific document and stage."""
	```

	---

	### DocumentLog Model

	File: `docgenie/generation/models/_log.py`

	Purpose: Document-level metadata and status tracking.

	Attributes:
	```python
	@dataclass
	class DocumentLog:
	doc_id: str # Unique document ID
	seed_image_id: str # Source seed image
	prompt_call_id: str # Prompt batch/call ID

	status: Status # VALID, INVALID, PROCESSING

	# Stage completion flags
	has_raw_html: bool = False
	has_raw_gt: bool = False
	has_pdf_initial: bool = False
	has_geometries: bool = False
	has_bboxes: bool = False
	has_handwriting: bool = False
	has_visual_elements: bool = False
	has_final_image: bool = False
	has_ocr_result: bool = False
	has_verified_gt: bool = False

	# Statistics
	num_words: int = 0
	num_annotations: int = 0
	num_handwriting_regions: int = 0
	num_visual_elements: int = 0

	# Error tracking
	error_stage: Optional[str] = None
	error_message: Optional[str] = None
	error_category: Optional[str] = None

	# Timestamps
	created_at: datetime
	updated_at: datetime
	```

	Key Methods:
	```python
	def mark_stage_complete(self, stage: str):
	"""Mark pipeline stage as complete."""

	def mark_error(self, stage: str, error_msg: str, category: str):
	"""Record error and mark document as invalid."""

	def is_valid(self) -> bool:
	"""Check if document passed all stages."""
	```

	---

	### BBox Model

	File: `docgenie/generation/models/_bbox.py`

	Purpose: Bounding box representation with text content.

	Attributes:
	```python
	@dataclass
	class BBox:
	rect: Rect # {x0, y0, x1, y1}
	text: str # Text content
	metadata: Optional[dict] = None # Additional data
	```

	Key Methods:
	```python
	@property
	def width(self) -> float:
	return self.rect["x1"] - self.rect["x0"]

	@property
	def height(self) -> float:
	return self.rect["y1"] - self.rect["y0"]

	def normalize(self, image_width: float, image_height: float) -> "BBox":
	"""Convert to normalized [0, 1] coordinates."""

	def unnormalize(self, image_width: float, image_height: float) -> "BBox":
	"""Convert from normalized to pixel coordinates."""

	def to_dict(self) -> dict:
	"""Serialize to dictionary."""

	@classmethod
	def from_dict(cls, data: dict) -> "BBox":
	"""Deserialize from dictionary."""
	```

	---

	### LayoutBBox Model

	File: `docgenie/generation/models/_bbox.py`

	Purpose: Layout element bounding box (for DLA tasks).

	Attributes:
	```python
	@dataclass
	class LayoutBBox:
	label: str # "title", "text", "table", etc.
	rect: Rect # {x0, y0, x1, y1}
	metadata: Optional[dict] = None
	```

	Key Methods:
	```python
	def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
	"""Normalize coordinates."""

	def contains(self, other: "LayoutBBox") -> bool:
	"""Check if this bbox fully contains another."""

	def overlaps(self, other: "LayoutBBox") -> bool:
	"""Check if this bbox overlaps with another."""

	def overlap_area(self, other: "LayoutBBox") -> float:
	"""Calculate overlap area with another bbox."""
	```

	---

	### Utility Modules

	#### BBox Utilities (utils/bboxes.py)

	```python
	def load_bboxes_from_file(file_path: Path) -> List[BBox]:
	"""Load bboxes from JSON file."""

	def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
	"""Save bboxes to JSON file."""

	def visualize_bboxes_on_pdf(
	pdf_path: Path,
	bboxes: List[BBox],
	output_path: Path,
	color: str = "red"
	):
	"""Draw bbox overlays on PDF."""

	def visualize_bboxes_on_image(
	image_path: Path,
	bboxes: List[BBox],
	output_path: Path,
	color: str = "orange"
	):
	"""Draw bbox overlays on image."""

	def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
	"""Check if bbox1 contains bbox2."""
	```

	---

	#### Geometry Utilities (utils/geos.py)

	```python
	def filter_layout_elements(geometries: dict) -> List[dict]:
	"""Extract elements with data-label attribute."""

	def filter_handwriting_elements(geometries: dict) -> List[dict]:
	"""Extract elements with data-handwriting attribute."""

	def filter_visual_elements(geometries: dict) -> List[dict]:
	"""Extract elements with data-visual-element attribute."""

	def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
	"""Extract elements with specific CSS class."""
	```

	---

	#### Serialization Utilities (utils/serialization.py)

	```python
	def encode_image_to_base64(image_path: Path) -> str:
	"""Encode image file to base64 string."""

	def decode_base64_to_image(base64_str: str, output_path: Path):
	"""Decode base64 string and save as image."""

	def serialize_dataclass(obj: Any) -> dict:
	"""Serialize dataclass to dictionary."""

	def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
	"""Deserialize dictionary to dataclass."""
	```

	---

	## Configuration & Constants

	### Key Constants (generation/constants.py)

	```python
	# Document processing
	HTML_PARSER = "html.parser" # BeautifulSoup parser
	PDF_POINT_SCALING = 72 / 96 # CSS DPI to PDF DPI

	# Bounding boxes
	BBOX_OVERLAP_THRESHOLD = 0.05 # 5% overlap tolerance
	SPATIAL_MATCH_THRESHOLD = 10.0 # Pixel tolerance for matching

	# PDF rendering
	CHROMIUM_CONCURRENCY = 10 # Parallel renders
	PER_PDF_RENDER_TIMEOUT = 60 # Seconds
	PER_PDF_RENDER_MAX_RETRIES = 3 # Retry attempts

	# Handwriting
	MAX_HANDWRITING_CHARS = 7 # Max chars per diffusion generation
	HANDWRITING_HEIGHT_PX = 40 # Image height
	HANDWRITING_PADDING_PX = 0 # Horizontal padding
	DIFFUSION_NUM_INFERENCE_STEPS = 50 # Generation quality
	HANDWRITING_IMAGE_UPSCALE_FACTOR = 3 # Insertion scaling
	MAX_HANDWRITING_RAND_X = 2 # Random X offset (pixels)
	MAX_HANDWRITING_RAND_Y = 1 # Random Y offset (pixels)

	# Visual elements
	VISUAL_ELEMENT_UPSCALE_FACTOR = 3 # Insertion scaling
	STAMP_BORDER_WIDTH = 2 # Stamp border thickness
	BARCODE_DPI = 300 # Barcode image quality

	# OCR
	IMAGE_DPI = 200 # Final image DPI
	OCR_CONFIDENCE_THRESHOLD = 0.8 # Min confidence

	# Ground truth
	FUZZY_MATCH_THRESHOLD = 0.85 # Levenshtein similarity cutoff

	# Handwriting styles
	HANDWRITING_STYLES = [
	"writer_0", "writer_1", "writer_2", ..., "writer_99"
	]

	# Visual element types
	VISUAL_ELEMENT_TYPES = [
	"stamp", "logo", "barcode", "chart", "photo"
	]

	# Visual element type mapping (LLM output → standard)
	VISUAL_ELEMENT_TYPE_MAPPING = {
	"stamp": "stamp",
	"company_stamp": "stamp",
	"approval_stamp": "stamp",
	"logo": "logo",
	"company_logo": "logo",
	"brand_logo": "logo",
	"barcode": "barcode",
	"code128": "barcode",
	"chart": "chart",
	"graph": "chart",
	"figure": "chart",
	"photo": "photo",
	"image": "photo",
	"picture": "photo"
	}
	```

	---

	### Environment Variables

	Required:
	```bash
	ANTHROPIC_API_KEY=sk-ant-... # Claude API key
	MICROSOFT_AZURE_OCR_KEY=... # Azure OCR key
	MICROSOFT_AZURE_OCR_ENDPOINT=... # Azure OCR endpoint
	```

	Optional:
	```bash
	LLM_MODEL=claude-sonnet-4-20250514 # Override model
	DEBUG_MODE=false # Enable debug output
	LOG_LEVEL=INFO # Logging verbosity
	CHROMIUM_CONCURRENCY=10 # Override concurrency
	```

	---

	## Error Handling & Debugging

	### Error Categories

	Comprehensive error tracking in stage 18:

	\| Category \| Description \| Resolution \|
	\|----------\|-------------\|------------\|
	\| `multipage_pdf` \| PDF rendered with >1 page \| Check HTML content size, CSS page breaks \|
	\| `missing_ocr_result` \| OCR API failed or empty \| Check Azure credentials, retry \|
	\| `failed_gt_verification` \| GT text not found in OCR \| Review fuzzy match threshold, inspect HTML \|
	\| `rendering_timeout` \| PDF render exceeded timeout \| Increase timeout, simplify HTML \|
	\| `llm_parsing_error` \| Failed to extract HTML/GT \| Review LLM response format, update regex \|
	\| `bbox_extraction_failed` \| PyMuPDF extraction error \| Check PDF validity, inspect fonts \|
	\| `handwriting_generation_failed` \| Diffusion model error \| Check model checkpoint, GPU availability \|
	\| `visual_element_generation_failed` \| VE creation error \| Check prefab directories, image validity \|

	---

	### Debug Visualizations (Stage 19)

	Generated debug outputs:

	1. PDF BBoxes: Verify PyMuPDF extraction quality
	2. Visual Element BBoxes: Verify VE extraction and positioning
	3. Handwriting Insertion: Verify handwriting placement
	4. Final BBoxes on Images: Verify OCR accuracy

	Enable debug mode:
	```python
	# In pipeline execution
	pipeline_params = PipelineParameters(
	debug_mode=True,
	# ... other params
	)
	```

	---

	### Logging

	Log locations:
	```
	data/datasets/synthesized_datasets/<dataset_name>/logs/
	├── pipeline_01/
	│ └── seed_selection.log
	├── pipeline_02/
	│ └── prompting.log
	├── pipeline_03/
	│ └── message_processing/
	│ ├── batch_001.log
	│ └── batch_002.log
	├── ...
	└── pipeline_19/
	└── debug_generation.log
	```

	Log format:
	```
	[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
	[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
	[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout
	```

	---

	### Common Issues & Solutions

	Issue: Multi-page PDFs
	- Cause: HTML content exceeds page size
	- Solution: Reduce content, check for long tables, use CSS `overflow: hidden`

	Issue: Handwriting not appearing
	- Cause: Character bboxes not available (stage 05)
	- Solution: Use simpler fonts in HTML, ensure PyMuPDF can extract chars

	Issue: OCR missing text
	- Cause: Low image DPI, poor contrast
	- Solution: Increase `IMAGE_DPI` (stage 14), adjust HTML styling

	Issue: GT verification failures
	- Cause: OCR discrepancies, fuzzy match threshold too high
	- Solution: Lower `FUZZY_MATCH_THRESHOLD`, improve HTML text rendering

	Issue: Claude API timeout
	- Cause: Large seed images, complex prompts
	- Solution: Compress seed images, simplify prompt template

	---

	## Usage Examples

	### Full Pipeline Execution

	```python
	from docgenie.generation import pipeline_01_select_seeds
	from docgenie.generation.models import SynDatasetDefinition, PipelineParameters

	# Load configuration
	syndatadef = SynDatasetDefinition.from_yaml(
	"data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
	)

	# Configure pipeline
	params = PipelineParameters(
	start_stage=1,
	end_stage=19,
	skip_existing=True,
	debug_mode=False,
	max_workers=10
	)

	# Execute pipeline stages
	for stage in range(params.start_stage, params.end_stage + 1):
	print(f"Executing stage {stage:02d}...")

	# Import and run stage module
	stage_module = importlib.import_module(
	f"docgenie.generation.pipeline_{stage:02d}_*"
	)
	stage_module.main(syndatadef, params)

	print(f"Stage {stage:02d} complete.")

	print("Pipeline execution complete!")
	```

	---

	### API Usage

	See API Implementation section for detailed examples.

	---

	### Custom Dataset Definition

	```yaml
	# data/syn_dataset_definitions/custom_invoices.yaml

	name: "custom_invoices_v1"
	task: "kie"
	base_dataset: "cord"

	llm_model: "claude-sonnet-4-20250514"
	prompt_template: "ClaudeRefined12"
	num_solutions: 1

	language: "English"
	doc_type: "invoices and receipts"
	gt_type: "Key-value pairs (entity extraction)"
	gt_format: '{"entity_name": "entity_value", ...}'

	num_samples: 500
	alpha: 0.75
	seeds_per_cluster: 5
	clustering_method: "kmeans"

	# KIE-specific
	valid_labels: null # Not used for KIE

	output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"
	```

	---

	## Conclusion

	This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:

	- Pipeline Integration Guide: `api/PIPELINE_INTEGRATION.md`
	- API README: `api/README.md`
	- Source Code: `docgenie/generation/` and `api/`

	Key Takeaways:

	1. 19-Stage Pipeline: Modular design from seed selection to GT verification
	2. Multi-Task Support: QA, KIE, DLA, CLS with task-specific handling
	3. Realistic Documents: LLM-generated content, diffusion handwriting, visual elements
	4. Quality Assurance: Comprehensive validation, OCR verification, error tracking
	5. API Integration: FastAPI service for synchronous document generation (stages 01-06)
	6. Extensibility: Modular code, clear interfaces, easy to extend

	Pipeline Strengths:

	- Task-agnostic core with task-specific adapters
	- Extensive logging and error tracking
	- Parallel processing where applicable
	- Debug visualizations at every stage
	- Integration with state-of-the-art models

	Future Directions:

	- Full API integration (stages 07-19)
	- Additional document types (forms, legal, medical)
	- Multi-language support expansion
	- Enhanced visual element generation (charts, diagrams)
	- Real-time generation optimization