# DocGenie Project Context Note (LLM Ready) ## 1) Executive Summary DocGenie is an AI-driven synthetic document generation platform designed to create realistic, annotated datasets for document intelligence tasks. The project combines: - LLM-based document content and layout generation - PDF rendering and geometric extraction - Optional handwriting synthesis (diffusion model) - Optional visual element insertion (logos, stamps, barcodes, charts, photos) - OCR extraction and bbox normalization - Ground-truth preparation for downstream machine learning - API-first and async batch workflows for production-scale generation The core idea is to transform a small set of real seed document images plus high-level generation parameters into large, diverse, reproducible synthetic datasets suitable for training and evaluation. ## 2) Problem Statement Real document datasets are expensive and slow to collect, often constrained by privacy, class imbalance, and weak annotation quality. This limits model quality for tasks like DocVQA, KIE, and layout understanding. Key challenges: - Lack of large high-quality labeled datasets - Domain mismatch between training and production documents - Manual labeling cost and inconsistency - Need for handwriting and visual artifacts in realistic layouts - Need for reproducibility and controllable data generation ## 3) Proposed Solution DocGenie proposes a modular synthetic dataset engine with controllable realism. High-level solution flow: 1. Select and ingest seed images that represent target document style. 2. Use LLM prompting (vision + text) to generate HTML/CSS-based document variants and structured GT. 3. Render HTML to PDF and extract text geometry/bboxes. 4. Optionally replace selected text with generated handwriting. 5. Optionally insert visual elements (stamp/logo/barcode/photo/figure). 6. Produce final PDFs/images + OCR + normalized bboxes + verified GT + export packages. Design principles: - Stage-wise pipeline (clear inputs/outputs per stage) - Reproducibility via seeds - Production-ready API endpoints - Async job orchestration for large runs - Separation of CPU API workloads and GPU handwriting inference workloads ## 4) Project Goals Primary goals: - Generate realistic synthetic documents at scale - Support multiple document AI tasks with rich annotation - Provide configurable realism controls (handwriting ratio, visual element types, OCR toggles) - Minimize generation cost with batched LLM calls - Enable operational deployment with monitoring and async processing Secondary goals: - Improve dataset diversity through seed and prompt strategies - Support rapid experimentation for model development - Keep architecture modular for independent upgrades ## 5) Core Capabilities - Seed-image-guided generation (1-8 images per request) - Configurable document language/type and GT format - Multi-output generation per seed set (num_solutions) - Handwriting synthesis with writer-style consistency - Visual element synthesis and insertion - OCR extraction from final rendered artifacts - Normalized bbox outputs for ML pipelines - Optional dataset packaging/export (for training pipelines) - Async batch generation with status polling and result retrieval ## 6) 19-Stage Pipeline (Conceptual) DocGenie follows a full multi-stage pipeline: 1. Seed selection/download 2. Prompt LLM 3. Process LLM response and extract HTML/GT 4. Render PDF and extract geometries 5. Extract text bboxes 6. Validate generated artifacts 7. Extract handwriting region definitions 8. Extract visual element definitions 9. Generate handwriting images 10. Generate visual element images 11. Re-render PDF (without placeholders where required) 12. Insert handwriting overlays 13. Insert visual overlays 14. Render document images 15. Run OCR 16. Normalize bboxes 17. Prepare/verify GT 18. Analyze run statistics 19. Create debug/export outputs Important detail: - Browser geometries are often in 96 DPI and PDF geometry in 72 DPI, requiring coordinate transforms. - Handwriting insertion requires text-to-bbox matching and deduplication logic. ## 7) API Product Surface Main API behavior is centered around three use patterns: 1) Synchronous generation endpoint - Returns generated documents and metadata directly in response. - Suitable for development and debugging. 2) Synchronous PDF/ZIP artifact endpoint - Returns packaged artifacts (PDF, metadata, optional assets) in downloadable form. - Suitable for practical batch outputs. 3) Asynchronous batch endpoint - Queues long-running generation jobs. - Returns request/task id. - Client polls status endpoint. - Client fetches/downloads final output when completed. - Best for production and larger workloads. Typical request dimensions: - seed_images: list of remote URLs - prompt_params: language, doc type, GT settings, feature toggles, reproducibility seed ## 8) System Architecture Monorepo-style architecture with independent service boundaries: A) Core package - Shared generation logic and pipeline stages. B) API service (CPU) - FastAPI interface - Orchestrates generation pipeline - Manages async queue and external integrations C) Background worker (CPU) - Executes queued async jobs - Handles long-running generation and packaging workflows D) Handwriting service (GPU) - Separate service for diffusion-based handwriting generation - Designed to be deployable independently E) Data stores and platform services - Queue broker (Redis) - Metadata storage (database) - File delivery/storage integration Architecture intent: - Keep API orchestration scalable and light - Offload expensive handwriting generation to GPU service - Enable independent deployment and scaling per component ## 9) Handwriting Subsystem Handwriting generation is treated as a specialized capability: - Uses diffusion-style generation with writer IDs/styles - Supports per-word token generation and mapping - Supports post-processing (blur, anti-aliasing, cropping) - Designed for realism and style consistency within a document Operational notes: - Batch handling is optimized for service cost and startup overhead - Some model/sampling settings are constrained by the underlying handwriting model implementation ## 10) Visual Element Subsystem Visual elements include artifacts commonly found in real documents: - logos - stamps - barcodes - photos - figures/charts Key behavior: - Placeholder-based extraction from generated HTML/geometries - Type normalization and filtering by request settings - Coordinate-aware insertion into final PDF/image artifacts ## 11) Data and Output Contracts The project outputs ML-ready artifacts with rich metadata: Typical outputs: - Generated HTML/CSS - Intermediate and final PDFs - Rasterized page images - Word/segment/layout bboxes - Normalized coordinate variants - Handwriting images and maps - Visual element images and maps - Ground-truth objects (task dependent) - Optional packaged export for training pipelines This enables direct use for training/evaluation datasets, debugging, and pipeline QA. ## 12) Deployment Strategy (Current Direction) Recommended deployment split: - API + worker on CPU-friendly platform - Handwriting service on GPU-capable platform - Redis and database as managed services Why this split works: - Different resource profiles (CPU orchestration vs GPU inference) - Independent scaling and cost control - Service isolation improves reliability and debugging ## 13) Testing and Quality Strategy Project testing plan emphasizes: - Unit tests per critical stage function - Integration tests for service boundaries (LLM, handwriting service, queue) - System tests for end-to-end generation - Non-functional tests: performance, reliability, scalability, security Key risk areas tested heavily: - External API failures/retries - Geometry and bbox alignment - Async job state transitions - Handwriting/visual overlay correctness ## 14) Known Constraints and Practical Considerations - Quality depends on seed representativeness and prompt quality. - External service availability (LLM providers, handwriting endpoint) impacts runtime reliability. - Coordinate conversion and matching edge cases can affect overlay precision. - Large batch jobs require async orchestration and observability. - Some advanced generation realism features may still be iterative/improving. ## 15) Why This Project Matters DocGenie addresses a real bottleneck in document AI: obtaining large, diverse, high-quality labeled training data. It provides a controllable synthetic data engine that can: - accelerate experimentation - reduce dependence on private data access - improve model robustness through diversity and controlled perturbations - support multiple document AI tasks in one platform ## 16) Suggested Prompt Context for Future LLM Tasks Use the following when asking an LLM to help with this codebase: Project summary: I am working on DocGenie, a synthetic document generation platform with a 19-stage pipeline. It uses LLM-generated HTML/CSS from seed images, renders to PDF, extracts bboxes/geometries, optionally inserts diffusion-generated handwriting and visual elements, runs OCR, normalizes bboxes, verifies GT, and exports ML-ready artifacts. The system has a FastAPI service, async worker, and separate GPU handwriting service. Primary objective: Improve reliability, generation quality, and production scalability of synthetic dataset generation for DocVQA/KIE/layout tasks. Technical priorities: - API and worker robustness - bbox/geometry correctness - handwriting and visual insertion accuracy - async job reliability and observability - deployment and cost optimization Constraints: - External dependencies (LLM APIs, managed queue/db, GPU service) - need reproducibility through seeded runs - preserve compatibility of output metadata for downstream ML pipelines When proposing changes: - Keep stage boundaries clear - Avoid breaking output contracts - Include failure handling and retries - Prefer measurable improvements (latency, cost, quality, reliability) ## 17) Fast Context Snapshot (Short Version) DocGenie is an API-first synthetic document dataset generator for document AI. It takes seed images and generation settings, uses an LLM to generate document HTML/GT, renders PDFs, extracts geometry, optionally adds handwriting and visual artifacts, runs OCR, normalizes annotations, and returns/exports ML-ready data. It is built as a modular 19-stage pipeline with async job processing and a separate GPU handwriting service for scalable production usage.