Spaces:

Text-to-Document-Generation
/

Docgenie-API

Paused

App Files Files Community

Docgenie-API / LLM_PROJECT_CONTEXT_NOTE.md

Ahadhassan-2003

deploy: update HF Space

dc4e6da 22 days ago

preview code

raw

history blame contribute delete

10.8 kB

	# DocGenie Project Context Note (LLM Ready)

	## 1) Executive Summary
	DocGenie is an AI-driven synthetic document generation platform designed to create realistic, annotated datasets for document intelligence tasks.

	The project combines:
	- LLM-based document content and layout generation
	- PDF rendering and geometric extraction
	- Optional handwriting synthesis (diffusion model)
	- Optional visual element insertion (logos, stamps, barcodes, charts, photos)
	- OCR extraction and bbox normalization
	- Ground-truth preparation for downstream machine learning
	- API-first and async batch workflows for production-scale generation

	The core idea is to transform a small set of real seed document images plus high-level generation parameters into large, diverse, reproducible synthetic datasets suitable for training and evaluation.

	## 2) Problem Statement
	Real document datasets are expensive and slow to collect, often constrained by privacy, class imbalance, and weak annotation quality. This limits model quality for tasks like DocVQA, KIE, and layout understanding.

	Key challenges:
	- Lack of large high-quality labeled datasets
	- Domain mismatch between training and production documents
	- Manual labeling cost and inconsistency
	- Need for handwriting and visual artifacts in realistic layouts
	- Need for reproducibility and controllable data generation

	## 3) Proposed Solution
	DocGenie proposes a modular synthetic dataset engine with controllable realism.

	High-level solution flow:
	1. Select and ingest seed images that represent target document style.
	2. Use LLM prompting (vision + text) to generate HTML/CSS-based document variants and structured GT.
	3. Render HTML to PDF and extract text geometry/bboxes.
	4. Optionally replace selected text with generated handwriting.
	5. Optionally insert visual elements (stamp/logo/barcode/photo/figure).
	6. Produce final PDFs/images + OCR + normalized bboxes + verified GT + export packages.

	Design principles:
	- Stage-wise pipeline (clear inputs/outputs per stage)
	- Reproducibility via seeds
	- Production-ready API endpoints
	- Async job orchestration for large runs
	- Separation of CPU API workloads and GPU handwriting inference workloads

	## 4) Project Goals
	Primary goals:
	- Generate realistic synthetic documents at scale
	- Support multiple document AI tasks with rich annotation
	- Provide configurable realism controls (handwriting ratio, visual element types, OCR toggles)
	- Minimize generation cost with batched LLM calls
	- Enable operational deployment with monitoring and async processing

	Secondary goals:
	- Improve dataset diversity through seed and prompt strategies
	- Support rapid experimentation for model development
	- Keep architecture modular for independent upgrades

	## 5) Core Capabilities
	- Seed-image-guided generation (1-8 images per request)
	- Configurable document language/type and GT format
	- Multi-output generation per seed set (num_solutions)
	- Handwriting synthesis with writer-style consistency
	- Visual element synthesis and insertion
	- OCR extraction from final rendered artifacts
	- Normalized bbox outputs for ML pipelines
	- Optional dataset packaging/export (for training pipelines)
	- Async batch generation with status polling and result retrieval

	## 6) 19-Stage Pipeline (Conceptual)
	DocGenie follows a full multi-stage pipeline:
	1. Seed selection/download
	2. Prompt LLM
	3. Process LLM response and extract HTML/GT
	4. Render PDF and extract geometries
	5. Extract text bboxes
	6. Validate generated artifacts
	7. Extract handwriting region definitions
	8. Extract visual element definitions
	9. Generate handwriting images
	10. Generate visual element images
	11. Re-render PDF (without placeholders where required)
	12. Insert handwriting overlays
	13. Insert visual overlays
	14. Render document images
	15. Run OCR
	16. Normalize bboxes
	17. Prepare/verify GT
	18. Analyze run statistics
	19. Create debug/export outputs

	Important detail:
	- Browser geometries are often in 96 DPI and PDF geometry in 72 DPI, requiring coordinate transforms.
	- Handwriting insertion requires text-to-bbox matching and deduplication logic.

	## 7) API Product Surface
	Main API behavior is centered around three use patterns:

	1) Synchronous generation endpoint
	- Returns generated documents and metadata directly in response.
	- Suitable for development and debugging.

	2) Synchronous PDF/ZIP artifact endpoint
	- Returns packaged artifacts (PDF, metadata, optional assets) in downloadable form.
	- Suitable for practical batch outputs.

	3) Asynchronous batch endpoint
	- Queues long-running generation jobs.
	- Returns request/task id.
	- Client polls status endpoint.
	- Client fetches/downloads final output when completed.
	- Best for production and larger workloads.

	Typical request dimensions:
	- seed_images: list of remote URLs
	- prompt_params: language, doc type, GT settings, feature toggles, reproducibility seed

	## 8) System Architecture
	Monorepo-style architecture with independent service boundaries:

	A) Core package
	- Shared generation logic and pipeline stages.

	B) API service (CPU)
	- FastAPI interface
	- Orchestrates generation pipeline
	- Manages async queue and external integrations

	C) Background worker (CPU)
	- Executes queued async jobs
	- Handles long-running generation and packaging workflows

	D) Handwriting service (GPU)
	- Separate service for diffusion-based handwriting generation
	- Designed to be deployable independently

	E) Data stores and platform services
	- Queue broker (Redis)
	- Metadata storage (database)
	- File delivery/storage integration

	Architecture intent:
	- Keep API orchestration scalable and light
	- Offload expensive handwriting generation to GPU service
	- Enable independent deployment and scaling per component

	## 9) Handwriting Subsystem
	Handwriting generation is treated as a specialized capability:
	- Uses diffusion-style generation with writer IDs/styles
	- Supports per-word token generation and mapping
	- Supports post-processing (blur, anti-aliasing, cropping)
	- Designed for realism and style consistency within a document

	Operational notes:
	- Batch handling is optimized for service cost and startup overhead
	- Some model/sampling settings are constrained by the underlying handwriting model implementation

	## 10) Visual Element Subsystem
	Visual elements include artifacts commonly found in real documents:
	- logos
	- stamps
	- barcodes
	- photos
	- figures/charts

	Key behavior:
	- Placeholder-based extraction from generated HTML/geometries
	- Type normalization and filtering by request settings
	- Coordinate-aware insertion into final PDF/image artifacts

	## 11) Data and Output Contracts
	The project outputs ML-ready artifacts with rich metadata:

	Typical outputs:
	- Generated HTML/CSS
	- Intermediate and final PDFs
	- Rasterized page images
	- Word/segment/layout bboxes
	- Normalized coordinate variants
	- Handwriting images and maps
	- Visual element images and maps
	- Ground-truth objects (task dependent)
	- Optional packaged export for training pipelines

	This enables direct use for training/evaluation datasets, debugging, and pipeline QA.

	## 12) Deployment Strategy (Current Direction)
	Recommended deployment split:
	- API + worker on CPU-friendly platform
	- Handwriting service on GPU-capable platform
	- Redis and database as managed services

	Why this split works:
	- Different resource profiles (CPU orchestration vs GPU inference)
	- Independent scaling and cost control
	- Service isolation improves reliability and debugging

	## 13) Testing and Quality Strategy
	Project testing plan emphasizes:
	- Unit tests per critical stage function
	- Integration tests for service boundaries (LLM, handwriting service, queue)
	- System tests for end-to-end generation
	- Non-functional tests: performance, reliability, scalability, security

	Key risk areas tested heavily:
	- External API failures/retries
	- Geometry and bbox alignment
	- Async job state transitions
	- Handwriting/visual overlay correctness

	## 14) Known Constraints and Practical Considerations
	- Quality depends on seed representativeness and prompt quality.
	- External service availability (LLM providers, handwriting endpoint) impacts runtime reliability.
	- Coordinate conversion and matching edge cases can affect overlay precision.
	- Large batch jobs require async orchestration and observability.
	- Some advanced generation realism features may still be iterative/improving.

	## 15) Why This Project Matters
	DocGenie addresses a real bottleneck in document AI: obtaining large, diverse, high-quality labeled training data.

	It provides a controllable synthetic data engine that can:
	- accelerate experimentation
	- reduce dependence on private data access
	- improve model robustness through diversity and controlled perturbations
	- support multiple document AI tasks in one platform

	## 16) Suggested Prompt Context for Future LLM Tasks
	Use the following when asking an LLM to help with this codebase:

	Project summary:
	I am working on DocGenie, a synthetic document generation platform with a 19-stage pipeline. It uses LLM-generated HTML/CSS from seed images, renders to PDF, extracts bboxes/geometries, optionally inserts diffusion-generated handwriting and visual elements, runs OCR, normalizes bboxes, verifies GT, and exports ML-ready artifacts. The system has a FastAPI service, async worker, and separate GPU handwriting service.

	Primary objective:
	Improve reliability, generation quality, and production scalability of synthetic dataset generation for DocVQA/KIE/layout tasks.

	Technical priorities:
	- API and worker robustness
	- bbox/geometry correctness
	- handwriting and visual insertion accuracy
	- async job reliability and observability
	- deployment and cost optimization

	Constraints:
	- External dependencies (LLM APIs, managed queue/db, GPU service)
	- need reproducibility through seeded runs
	- preserve compatibility of output metadata for downstream ML pipelines

	When proposing changes:
	- Keep stage boundaries clear
	- Avoid breaking output contracts
	- Include failure handling and retries
	- Prefer measurable improvements (latency, cost, quality, reliability)

	## 17) Fast Context Snapshot (Short Version)
	DocGenie is an API-first synthetic document dataset generator for document AI. It takes seed images and generation settings, uses an LLM to generate document HTML/GT, renders PDFs, extracts geometry, optionally adds handwriting and visual artifacts, runs OCR, normalizes annotations, and returns/exports ML-ready data. It is built as a modular 19-stage pipeline with async job processing and a separate GPU handwriting service for scalable production usage.