Docgenie-API / LLM_PROJECT_CONTEXT_NOTE.md
Ahadhassan-2003
deploy: update HF Space
dc4e6da
# DocGenie Project Context Note (LLM Ready)
## 1) Executive Summary
DocGenie is an AI-driven synthetic document generation platform designed to create realistic, annotated datasets for document intelligence tasks.
The project combines:
- LLM-based document content and layout generation
- PDF rendering and geometric extraction
- Optional handwriting synthesis (diffusion model)
- Optional visual element insertion (logos, stamps, barcodes, charts, photos)
- OCR extraction and bbox normalization
- Ground-truth preparation for downstream machine learning
- API-first and async batch workflows for production-scale generation
The core idea is to transform a small set of real seed document images plus high-level generation parameters into large, diverse, reproducible synthetic datasets suitable for training and evaluation.
## 2) Problem Statement
Real document datasets are expensive and slow to collect, often constrained by privacy, class imbalance, and weak annotation quality. This limits model quality for tasks like DocVQA, KIE, and layout understanding.
Key challenges:
- Lack of large high-quality labeled datasets
- Domain mismatch between training and production documents
- Manual labeling cost and inconsistency
- Need for handwriting and visual artifacts in realistic layouts
- Need for reproducibility and controllable data generation
## 3) Proposed Solution
DocGenie proposes a modular synthetic dataset engine with controllable realism.
High-level solution flow:
1. Select and ingest seed images that represent target document style.
2. Use LLM prompting (vision + text) to generate HTML/CSS-based document variants and structured GT.
3. Render HTML to PDF and extract text geometry/bboxes.
4. Optionally replace selected text with generated handwriting.
5. Optionally insert visual elements (stamp/logo/barcode/photo/figure).
6. Produce final PDFs/images + OCR + normalized bboxes + verified GT + export packages.
Design principles:
- Stage-wise pipeline (clear inputs/outputs per stage)
- Reproducibility via seeds
- Production-ready API endpoints
- Async job orchestration for large runs
- Separation of CPU API workloads and GPU handwriting inference workloads
## 4) Project Goals
Primary goals:
- Generate realistic synthetic documents at scale
- Support multiple document AI tasks with rich annotation
- Provide configurable realism controls (handwriting ratio, visual element types, OCR toggles)
- Minimize generation cost with batched LLM calls
- Enable operational deployment with monitoring and async processing
Secondary goals:
- Improve dataset diversity through seed and prompt strategies
- Support rapid experimentation for model development
- Keep architecture modular for independent upgrades
## 5) Core Capabilities
- Seed-image-guided generation (1-8 images per request)
- Configurable document language/type and GT format
- Multi-output generation per seed set (num_solutions)
- Handwriting synthesis with writer-style consistency
- Visual element synthesis and insertion
- OCR extraction from final rendered artifacts
- Normalized bbox outputs for ML pipelines
- Optional dataset packaging/export (for training pipelines)
- Async batch generation with status polling and result retrieval
## 6) 19-Stage Pipeline (Conceptual)
DocGenie follows a full multi-stage pipeline:
1. Seed selection/download
2. Prompt LLM
3. Process LLM response and extract HTML/GT
4. Render PDF and extract geometries
5. Extract text bboxes
6. Validate generated artifacts
7. Extract handwriting region definitions
8. Extract visual element definitions
9. Generate handwriting images
10. Generate visual element images
11. Re-render PDF (without placeholders where required)
12. Insert handwriting overlays
13. Insert visual overlays
14. Render document images
15. Run OCR
16. Normalize bboxes
17. Prepare/verify GT
18. Analyze run statistics
19. Create debug/export outputs
Important detail:
- Browser geometries are often in 96 DPI and PDF geometry in 72 DPI, requiring coordinate transforms.
- Handwriting insertion requires text-to-bbox matching and deduplication logic.
## 7) API Product Surface
Main API behavior is centered around three use patterns:
1) Synchronous generation endpoint
- Returns generated documents and metadata directly in response.
- Suitable for development and debugging.
2) Synchronous PDF/ZIP artifact endpoint
- Returns packaged artifacts (PDF, metadata, optional assets) in downloadable form.
- Suitable for practical batch outputs.
3) Asynchronous batch endpoint
- Queues long-running generation jobs.
- Returns request/task id.
- Client polls status endpoint.
- Client fetches/downloads final output when completed.
- Best for production and larger workloads.
Typical request dimensions:
- seed_images: list of remote URLs
- prompt_params: language, doc type, GT settings, feature toggles, reproducibility seed
## 8) System Architecture
Monorepo-style architecture with independent service boundaries:
A) Core package
- Shared generation logic and pipeline stages.
B) API service (CPU)
- FastAPI interface
- Orchestrates generation pipeline
- Manages async queue and external integrations
C) Background worker (CPU)
- Executes queued async jobs
- Handles long-running generation and packaging workflows
D) Handwriting service (GPU)
- Separate service for diffusion-based handwriting generation
- Designed to be deployable independently
E) Data stores and platform services
- Queue broker (Redis)
- Metadata storage (database)
- File delivery/storage integration
Architecture intent:
- Keep API orchestration scalable and light
- Offload expensive handwriting generation to GPU service
- Enable independent deployment and scaling per component
## 9) Handwriting Subsystem
Handwriting generation is treated as a specialized capability:
- Uses diffusion-style generation with writer IDs/styles
- Supports per-word token generation and mapping
- Supports post-processing (blur, anti-aliasing, cropping)
- Designed for realism and style consistency within a document
Operational notes:
- Batch handling is optimized for service cost and startup overhead
- Some model/sampling settings are constrained by the underlying handwriting model implementation
## 10) Visual Element Subsystem
Visual elements include artifacts commonly found in real documents:
- logos
- stamps
- barcodes
- photos
- figures/charts
Key behavior:
- Placeholder-based extraction from generated HTML/geometries
- Type normalization and filtering by request settings
- Coordinate-aware insertion into final PDF/image artifacts
## 11) Data and Output Contracts
The project outputs ML-ready artifacts with rich metadata:
Typical outputs:
- Generated HTML/CSS
- Intermediate and final PDFs
- Rasterized page images
- Word/segment/layout bboxes
- Normalized coordinate variants
- Handwriting images and maps
- Visual element images and maps
- Ground-truth objects (task dependent)
- Optional packaged export for training pipelines
This enables direct use for training/evaluation datasets, debugging, and pipeline QA.
## 12) Deployment Strategy (Current Direction)
Recommended deployment split:
- API + worker on CPU-friendly platform
- Handwriting service on GPU-capable platform
- Redis and database as managed services
Why this split works:
- Different resource profiles (CPU orchestration vs GPU inference)
- Independent scaling and cost control
- Service isolation improves reliability and debugging
## 13) Testing and Quality Strategy
Project testing plan emphasizes:
- Unit tests per critical stage function
- Integration tests for service boundaries (LLM, handwriting service, queue)
- System tests for end-to-end generation
- Non-functional tests: performance, reliability, scalability, security
Key risk areas tested heavily:
- External API failures/retries
- Geometry and bbox alignment
- Async job state transitions
- Handwriting/visual overlay correctness
## 14) Known Constraints and Practical Considerations
- Quality depends on seed representativeness and prompt quality.
- External service availability (LLM providers, handwriting endpoint) impacts runtime reliability.
- Coordinate conversion and matching edge cases can affect overlay precision.
- Large batch jobs require async orchestration and observability.
- Some advanced generation realism features may still be iterative/improving.
## 15) Why This Project Matters
DocGenie addresses a real bottleneck in document AI: obtaining large, diverse, high-quality labeled training data.
It provides a controllable synthetic data engine that can:
- accelerate experimentation
- reduce dependence on private data access
- improve model robustness through diversity and controlled perturbations
- support multiple document AI tasks in one platform
## 16) Suggested Prompt Context for Future LLM Tasks
Use the following when asking an LLM to help with this codebase:
Project summary:
I am working on DocGenie, a synthetic document generation platform with a 19-stage pipeline. It uses LLM-generated HTML/CSS from seed images, renders to PDF, extracts bboxes/geometries, optionally inserts diffusion-generated handwriting and visual elements, runs OCR, normalizes bboxes, verifies GT, and exports ML-ready artifacts. The system has a FastAPI service, async worker, and separate GPU handwriting service.
Primary objective:
Improve reliability, generation quality, and production scalability of synthetic dataset generation for DocVQA/KIE/layout tasks.
Technical priorities:
- API and worker robustness
- bbox/geometry correctness
- handwriting and visual insertion accuracy
- async job reliability and observability
- deployment and cost optimization
Constraints:
- External dependencies (LLM APIs, managed queue/db, GPU service)
- need reproducibility through seeded runs
- preserve compatibility of output metadata for downstream ML pipelines
When proposing changes:
- Keep stage boundaries clear
- Avoid breaking output contracts
- Include failure handling and retries
- Prefer measurable improvements (latency, cost, quality, reliability)
## 17) Fast Context Snapshot (Short Version)
DocGenie is an API-first synthetic document dataset generator for document AI. It takes seed images and generation settings, uses an LLM to generate document HTML/GT, renders PDFs, extracts geometry, optionally adds handwriting and visual artifacts, runs OCR, normalizes annotations, and returns/exports ML-ready data. It is built as a modular 19-stage pipeline with async job processing and a separate GPU handwriting service for scalable production usage.