| # Complete API Flow Documentation |
|
|
| ## Overview |
| The DocGenie API provides three endpoints for synthetic document generation, implementing a 19-stage pipeline that transforms seed images and prompts into complete datasets with OCR, ground truth, and optional handwriting/visual elements. |
|
|
| **Base URL**: `http://localhost:8000` (development) or Railway deployment |
| **Documentation**: `/docs` (FastAPI auto-generated Swagger UI) |
|
|
| --- |
|
|
| ## API Endpoints |
|
|
| ### 1. `/generate` - Legacy JSON Response (POST) |
| **Purpose**: Generate documents and return complete JSON metadata |
| **Response**: JSON with HTML, PDF (base64), bounding boxes, optional handwriting/visual elements |
| **Use Case**: Testing, development, full metadata inspection |
| **Pipeline Stages**: 1-19 (configurable via parameters) |
|
|
| ### 2. `/generate/pdf` - Sync PDF+Dataset ZIP (POST) |
| **Purpose**: Generate documents and return ZIP file with all artifacts |
| **Response**: ZIP file containing: |
| - `*.pdf` - Generated document PDFs |
| - `*_final.pdf` - PDFs with handwriting/visual elements (if enabled) |
| - `*.msgpack` - Dataset format (if export enabled) |
| - `metadata.json` - Complete generation metadata |
| - `handwriting/` - Individual handwriting images |
| - `visual_elements/` - Individual visual element images |
|
|
| **Use Case**: Production dataset generation, batch processing |
| **Pipeline Stages**: 1-19 (all features available) |
|
|
| ### 3. `/generate/async` - Async Batch Processing (POST) |
| **Purpose**: Queue large batch jobs via background worker (Redis Queue) |
| **Response**: Task ID for status polling |
| **Status Check**: `GET /generate/async/status/{task_id}` |
| **Result Download**: `GET /generate/async/result/{task_id}` (returns ZIP) |
| **Use Case**: Large-scale dataset generation (100+ documents) |
| **Pipeline Stages**: 1-19 (via worker.py) |
|
|
| --- |
|
|
| ## Request Parameters |
|
|
| ```python |
| class GenerateDocumentRequest: |
| seed_images: List[HttpUrl] # 1-8 seed images from web URLs |
| prompt_params: PromptParameters # Generation configuration |
| |
| class PromptParameters: |
| # Core Parameters |
| language: str = "english" # Document language |
| doc_type: str = "invoice" # Document type (invoice, receipt, form, etc.) |
| gt_type: str = "qa" # Ground truth format (qa, kie) |
| gt_format: str = "json" # GT encoding (json, annotation) |
| num_solutions: int = 1 # Documents per seed set |
| |
| # Feature Toggles (Stages 07-19) |
| enable_handwriting: bool = False # Stage 07-09, 12 |
| handwriting_ratio: float = 0.2 # Probabilistic filter (0.0-1.0) |
| enable_visual_elements: bool = False # Stage 08, 10, 13 |
| visual_element_types: List[str] = [] # Filter types: logo, photo, figure, barcode, etc. |
| enable_ocr: bool = True # Stage 15 |
| enable_bbox_normalization: bool = True # Stage 16 |
| enable_gt_verification: bool = False # Stage 17 |
| enable_analysis: bool = False # Stage 18 |
| enable_debug_visualization: bool = False # Stage 19 |
| enable_dataset_export: bool = False # Stage 19 (msgpack format) |
| dataset_export_format: str = "msgpack" # Currently only msgpack supported |
| |
| # Reproducibility |
| seed: Optional[int] = None # Random seed (null = random, int = reproducible) |
| ``` |
|
|
| --- |
|
|
| ## Pipeline Architecture: The 19 Stages |
|
|
| The API implements all 19 stages of the original batch pipeline in `docgenie/generation/`. Each stage is mapped to corresponding functions in `api/utils.py`. |
|
|
| ### **Phase 1: Core Pipeline (Stages 01-06)** |
| Generate base documents from seed images and LLM prompts. |
|
|
| #### **Stage 01: Seed Selection & Download** |
| - **Original**: `pipeline_01_select_seeds.py` |
| - **API**: `download_seed_images()` in `api/utils.py:117-161` |
| - **Process**: |
| 1. Accept user-provided seed image URLs (1-8 images) |
| 2. Download with retry logic (3 attempts, exponential backoff) |
| 3. Handle transient HTTP errors (502, 503, 504, 429) |
| 4. Convert to base64 for LLM input |
| - **Error Handling**: Retry with 2s, 4s, 8s delays; raise HTTPException on failure |
|
|
| #### **Stage 02: Prompt LLM** |
| - **Original**: `pipeline_02_prompt_llm.py` |
| - **API**: `call_claude_api_direct()` in `api/utils.py:550-600` |
| - **Process**: |
| 1. Load prompt template: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt` |
| 2. Build prompt with parameters: language, doc_type, gt_type, num_solutions |
| 3. Call Claude API (Anthropic Messages API v1) |
| - Model: `claude-3-5-sonnet-20241022` (configurable) |
| - Max tokens: 16,000 |
| - Temperature: 1.0 |
| - Vision: Send base64-encoded seed images |
| 4. Receive HTML documents with embedded ground truth |
| - **LLM Output Format**: Multiple `<!DOCTYPE html>...</html>` blocks with: |
| - CSS styling with page dimensions |
| - HTML elements with semantic classes |
| - Handwriting markers: `class="handwritten author1"` (author1, author2, etc.) |
| - Visual element placeholders: `data-placeholder="logo"`, `data-content="company-logo"` |
| - Ground truth: `<script id="GT">{...json...}</script>` |
| |
| #### **Stage 03: Process Response & Extract HTML** |
| - **Original**: `pipeline_03_process_response.py` |
| - **API**: `extract_html_documents_from_response()` in `api/utils.py:605-635` |
| - **Process**: |
| 1. Parse LLM response for `<!DOCTYPE html>...</html>` blocks (regex) |
| 2. Prettify HTML with BeautifulSoup |
| 3. Validate HTML structure |
| 4. Extract ground truth JSON from `<script id="GT">` tag |
| 5. Remove GT script tag, clean HTML for rendering |
| - **Validation**: Check for required elements, CSS, proper structure |
|
|
| #### **Stage 04: Render PDF & Extract Geometries** |
| - **Original**: `pipeline_04_render_pdf_and_extract_geos.py` |
| - **API**: `render_html_to_pdf()` in `api/utils.py:650-740` |
| - **Process**: |
| 1. Launch Playwright browser (Chromium) |
| 2. Set page dimensions from CSS `@page` rule |
| 3. Render HTML to PDF via `page.pdf()` |
| 4. Extract element geometries: |
| - Handwriting elements: `.handwritten` class β `{rect, text, classes, selectorTypes: ["handwriting"]}` |
| - Visual elements: `[data-placeholder]` attribute β `{rect, dataPlaceholder, dataContent, selectorTypes: ["visual_element"]}` |
| 5. Save PDF and geometries JSON |
| - **Output**: |
| - PDF at 72 DPI (PyMuPDF standard) |
| - Geometries at 96 DPI (browser rendering) |
| - Dimensions in mm |
|
|
| #### **Stage 05: Extract Bounding Boxes** |
| - **Original**: `pipeline_05_extract_bboxes_from_pdf.py` |
| - **API**: `extract_bboxes_from_rendered_pdf()` in `api/utils.py:750-825` |
| - **Process**: |
| 1. Open PDF with PyMuPDF (fitz) |
| 2. Extract text at word level: `page.get_text("words")` |
| 3. Structure bboxes as: |
| ```python |
| { |
| "text": "word", |
| "x0": float, # left |
| "y0": float, # top |
| "x1": float, # right (x2) |
| "y1": float, # bottom (y2) |
| "block_no": int, |
| "line_no": int, |
| "word_no": int |
| } |
| ``` |
| 4. Filter whitespace-only text |
| 5. Convert to OCRBox objects for processing |
| - **Coordinate System**: PDF points (72 DPI), origin top-left |
| |
| #### **Stage 06: Validation** |
| - **Original**: `pipeline_06_validation.py` (implicit) |
| - **API**: `validate_html_structure()`, `validate_pdf()`, `validate_bboxes()` in `api/utils.py:830-890` |
| - **Checks**: |
| - HTML: Required DOCTYPE, head, body, CSS |
| - PDF: File readable, page count = 1, has text |
| - Bboxes: Minimum count (configurable), valid coordinates |
|
|
| --- |
|
|
| ### **Phase 2: Feature Synthesis (Stages 07-13)** |
| Add handwriting and visual elements to base documents. |
|
|
| #### **Stage 07: Extract Handwriting Definitions** |
| - **Original**: `pipeline_07_extract_handwriting.py` |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1150-1235` |
| - **Process**: |
| 1. Filter geometries: `"handwriting" in geo['selectorTypes']` |
| 2. Parse classes: Extract `author1`, `author2`, etc. from `class="handwritten author1"` |
| 3. **Probabilistic filtering** (handwriting_ratio): |
| ```python |
| if random.random() > handwriting_ratio: |
| continue # Skip this element |
| ``` |
| - `ratio=0.0`: No handwriting (0%) |
| - `ratio=0.5`: ~50% of marked elements |
| - `ratio=1.0`: All marked elements (100%) |
| 4. Match geometries to word bboxes: |
| - Convert browser coords (96 DPI) to PDF coords (72 DPI): `scale = 72/96 = 0.75` |
| - Find consecutive word bboxes matching geometry text |
| - Check bboxes are within geometry rect (threshold: 0.7) |
| - Track taken bbox indices to avoid duplicates |
| 5. Build handwriting region definitions: |
| ```python |
| { |
| "id": "hw0", |
| "text": "Patient Name", |
| "author_id": "author1", |
| "is_signature": False, |
| "rect": {x, y, width, height}, # in points |
| "bboxes": ["0_0_0 Patient 10.0 20.0 50.0 35.0", ...] |
| } |
| ``` |
| - **Reproducibility**: Use `seed + i` for each region to maintain order consistency |
| |
| #### **Stage 08: Extract Visual Element Definitions** |
| - **Original**: `pipeline_08_extract_visual_element_definitions.py` |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1237-1275` |
| - **Process**: |
| 1. Filter geometries: `"visual_element" in geo['selectorTypes']` |
| 2. Parse attributes: |
| - `data-placeholder`: Element type (logo, photo, figure, chart, barcode, etc.) |
| - `data-content`: Semantic description (e.g., "company-logo", "product-photo") |
| 3. Normalize types using synonyms: |
| - "chart" β "figure" |
| - "image" β "photo" |
| 4. Filter by `visual_element_types` parameter (if specified) |
| 5. Convert coordinates: pixels (96 DPI) β mm |
| 6. Extract rotation from CSS `transform: rotate(Xdeg)` |
| 7. Build visual element definitions: |
| ```python |
| { |
| "id": "ve0", |
| "type": "logo", # normalized |
| "content": "company-logo", |
| "rect": {x, y, width, height}, # in mm |
| "rotation": 0 # degrees |
| } |
| ``` |
| |
| #### **Stage 09: Create Handwriting Images** |
| - **Original**: `pipeline_09_create_handwriting_images.py` |
| - **API**: `call_handwriting_service_batch()` in `api/utils.py:785-920` |
| - **Handwriting Service**: RunPod serverless endpoint hosting WordStylist diffusion model |
| - **Service Implementation**: `handwriting_service/handler.py`, `handwriting_service/inference.py` |
|
|
| **π Handwriting Service Integration Details:** |
|
|
| ##### **Service Architecture** |
| - **Platform**: RunPod Serverless (GPU: NVIDIA A4000, Cost: ~$0.00025/s active) |
| - **Model**: WordStylist (Diffusion-based handwriting synthesis) |
| - Architecture: UNet with conditional style embeddings |
| - Input: Text (A-Z, a-z only, no spaces), Writer style ID (0-656) |
| - Output: PNG image with transparent background |
| - Inference time: ~18s per text on A4000 |
| - Weights: `handwriting_service/WordStylist/models/` |
| - **Endpoints**: |
| - `/run` (async): Queue job, return ID, poll `/status/{id}` (10MB limit) |
| - `/runsync` (sync): Wait for completion, return result (20MB limit, used by API) |
|
|
| ##### **Batch Processing (Cost Optimization)** |
| The API uses TRUE batch processing to minimize RunPod activation overhead: |
|
|
| ```python |
| # β
NEW: Batch all texts in ONE request |
| runpod_request = { |
| "input": { |
| "texts": [ |
| {"text": "Hello", "author_id": 42, "hw_id": "hw0_b0_l0_w0"}, |
| {"text": "World", "author_id": 42, "hw_id": "hw0_b0_l0_w1"}, |
| # ... 10-100 texts |
| ], |
| "apply_blur": True |
| } |
| } |
| # Result: 1 worker activation Γ (N Γ 18s) = ~40-60% cost savings |
| ``` |
|
|
| **Cost Comparison for 10 texts:** |
| - β OLD (parallel): 10 workers Γ 18s = 180 worker-seconds + 10Γ activation fee |
| - β
NEW (batched): 1 worker Γ 190s = 190 worker-seconds + 1Γ activation fee |
|
|
| ##### **API Processing Flow** |
| 1. **Group by region and line**: Split handwriting regions into word-level requests |
| ```python |
| # Text: "Patient Name" β 2 word-level generations |
| texts_to_generate = [ |
| {"text": "Patient", "author_id": 42, "hw_id": "hw0_b0_l0_w0"}, |
| {"text": "Name", "author_id": 42, "hw_id": "hw0_b0_l0_w1"} |
| ] |
| ``` |
|
|
| 2. **Map author IDs to numeric styles**: |
| ```python |
| # "author1" β WRITER_STYLES[1] = 42 (deterministic) |
| # "author2" β WRITER_STYLES[2] = 137 |
| # 657 total writer styles available |
| ``` |
|
|
| 3. **Sanitize text** (WordStylist constraint): |
| ```python |
| # Only A-Z, a-z allowed (no spaces, numbers, punctuation) |
| "Hello123!" β "Hello" |
| "first-name" β "firstname" |
| ``` |
|
|
| 4. **Send batch request** to RunPod `/runsync` endpoint: |
| ```python |
| POST https://api.runpod.ai/v2/{endpoint_id}/runsync |
| Authorization: Bearer {RUNPOD_API_KEY} |
| Content-Type: application/json |
| |
| { |
| "input": { |
| "texts": [...], |
| "apply_blur": True # Gaussian blur for realism |
| } |
| } |
| ``` |
|
|
| 5. **Handle async responses**: |
| - If `status: "IN_PROGRESS"`: Poll `/status/{job_id}` every 5-10s (max 30 polls) |
| - If `status: "COMPLETED"`: Extract `output.images[]` |
| - If `status: "FAILED"`: Raise exception (stops entire generation) |
|
|
| 6. **Response format**: |
| ```python |
| { |
| "status": "COMPLETED", |
| "output": { |
| "images": [ |
| { |
| "image_base64": "iVBORw0KGgoAAAANSU...", |
| "width": 200, |
| "height": 64, |
| "text": "Patient", |
| "author_id": 42, |
| "hw_id": "hw0_b0_l0_w0" |
| }, |
| ... |
| ], |
| "total_generated": 2 |
| } |
| } |
| ``` |
|
|
| 7. **Store generated images**: Map `hw_id β image_base64` for insertion |
|
|
| ##### **Error Handling** |
| - **Retry logic**: 3 attempts with exponential backoff (matching seed download) |
| - **Timeouts**: Dynamic based on batch size: `20s Γ num_texts + 30s buffer` |
| - **Failure behavior**: **RAISE EXCEPTION** (since session fix) |
| - β OLD: Silent continue β Documents without handwriting |
| - β
NEW: Raise exception β Generation fails when user requested handwriting |
|
|
| ##### **Service Code Structure** |
| **`handwriting_service/handler.py`** (RunPod handler): |
| ```python |
| # Initialize model ONCE at module level (not per request) |
| generator = HandwritingGenerator( |
| model_dir="WordStylist", |
| checkpoint_path="WordStylist/models", |
| device="cuda" |
| ) |
| |
| def handler(job): |
| """RunPod entry point - supports both /run and /runsync""" |
| texts = job["input"]["texts"] # Batch input |
| results = generator.generate_batch( |
| texts=[t["text"] for t in texts], |
| author_ids=[t["author_id"] for t in texts], |
| num_inference_steps=50, |
| temperature=1.0, |
| apply_blur=True |
| ) |
| return {"images": results, "total_generated": len(results)} |
| ``` |
| |
| **`handwriting_service/inference.py`** (WordStylist wrapper): |
| ```python |
| class HandwritingGenerator: |
| def generate_batch(self, texts, author_ids, ...): |
| results = [] |
| for text, author_id in zip(texts, author_ids): |
| # Load model checkpoint |
| unet = Unet(...) |
| unet.load_state_dict(checkpoint) |
| |
| # Prepare style condition |
| style_id_tensor = torch.tensor([author_id]) |
| |
| # Diffusion reverse process (50 steps) |
| img = self.sample(unet, style_id_tensor, text_length=len(text)) |
| |
| # Post-process: crop, resize, apply blur |
| img_pil = postprocess_image(img) |
| if apply_blur: |
| img_pil = img_pil.filter(ImageFilter.GaussianBlur(1.2)) |
| |
| # Encode to base64 |
| img_base64 = encode_pil_to_base64(img_pil) |
| results.append({ |
| "image_base64": img_base64, |
| "width": img_pil.width, |
| "height": img_pil.height |
| }) |
| |
| return results |
| ``` |
|
|
| #### **Stage 10: Create Visual Element Images** |
| - **Original**: `pipeline_10_create_visual_elements.py` |
| - **API**: `generate_visual_element_images()` in `api/utils.py:925-1020` |
| - **Process**: |
| 1. Load prefab images from `data/visual_element_prefabs/{type}/`: |
| - `logo/`: Company logos (50+ SVGs) |
| - `photo/`: Stock photos (100+ JPGs) |
| - `figure/`: Charts, graphs (30+ PNGs) |
| - `barcode/`: Generated barcodes |
| - `qr_code/`, `stamp/`, `signature/`, `checkbox/`, etc. |
| 2. **Random selection** (seed-based if provided): |
| ```python |
| if seed is not None: |
| random.seed(seed) |
| prefab_path = random.choice(list(prefab_dir.glob("*"))) |
| ``` |
| 3. **Special handling**: |
| - **Barcode**: Generate on-the-fly using `python-barcode` library |
| ```python |
| # Generate random EAN-13 barcode (12 digits + checksum) |
| barcode_num = random.randint(100000000000, 999999999999) |
| barcode = EAN13(str(barcode_num), writer=ImageWriter()) |
| ``` |
| - **QR Code**: Generate using `qrcode` library |
| - **Checkbox**: Render checked/unchecked SVG |
| 4. Load and convert to base64: |
| ```python |
| with open(prefab_path, 'rb') as f: |
| img_bytes = f.read() |
| img_base64 = base64.b64encode(img_bytes).decode('utf-8') |
| ``` |
| 5. Return mapping: `ve_id β image_base64` |
| |
| #### **Stage 11: Make Text Transparent (Implicit)** |
| - **Original**: `pipeline_11_make_text_transparent.py` |
| - **API**: Implemented as "whiteout" in `process_stage3_complete()` at `api/utils.py:1415-1427` |
| - **Process**: |
| ```python |
| # Draw white rectangles over original text to hide it |
| for hw_region in handwriting_regions: |
| for bbox_str in hw_region['bboxes']: |
| bbox = parse_bbox(bbox_str) |
| rect = fitz.Rect(bbox.x0, bbox.y0, bbox.x2, bbox.y2) |
| page.draw_rect(rect, color=(1,1,1), fill=(1,1,1)) # White fill |
| ``` |
| - **Why not transparent?**: PyMuPDF doesn't support making existing text transparent, so we use white rectangles instead (same visual result) |
|
|
| #### **Stage 12: Insert Handwriting Images** |
| - **Original**: `pipeline_12_insert_handwriting_images.py` |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1429-1520` |
| - **Process**: |
| 1. **Position calculation**: |
| ```python |
| # Get word bbox from PDF extraction |
| bbox_w = bbox.x2 - bbox.x0 # Width in points |
| bbox_h = bbox.y2 - bbox.y0 # Height in points |
| |
| # Resize handwriting image with aspect ratio |
| scale = min(bbox_w / img_width, bbox_h / img_height) |
| new_w = int(img_width * scale * SCALE_UP_FACTOR) # 3x upscale |
| new_h = int(img_height * scale * SCALE_UP_FACTOR) |
| |
| # Add random offsets for natural variation |
| offset_x = random.randint(-MAX_OFFSET_LEFT, MAX_OFFSET_RIGHT) + FIXED_OFFSET |
| offset_y = random.randint(-MAX_OFFSET_UP, MAX_OFFSET_DOWN) |
| |
| # Position at bbox coordinates |
| x0 = bbox.x0 + offset_x |
| y0 = bbox.y0 + offset_y - y_padding |
| ``` |
| |
| 2. **Insert into PDF**: |
| ```python |
| img_resized = img.resize((new_w, new_h), Image.LANCZOS).convert("RGBA") |
| img_bytes = pil_to_bytes(img_resized) |
| rect = fitz.Rect(x0, y0, x0 + bbox_w, y0 + bbox_h) |
| page.insert_image(rect, stream=img_bytes) |
| ``` |
| |
| 3. Save intermediate PDF: `{doc_id}_with_handwriting.pdf` |
| |
| #### **Stage 13: Insert Visual Elements** |
| - **Original**: `pipeline_13_insert_visual_elements.py` |
| - **API**: `process_stage3_complete()` section in `api/utils.py:1523-1625` |
| - **Process**: |
| 1. Convert mm β points: `mm_to_pt = 72 / 25.4` |
| 2. Resize with aspect ratio preservation (same as handwriting) |
| 3. Center image on white background (maintains bbox size) |
| 4. Insert into PDF at geometry coordinates |
| 5. Save final PDF: `{doc_id}_final.pdf` (includes both handwriting + visual elements) |
|
|
| --- |
|
|
| ### **Phase 3: Image Finalization & OCR (Stages 14-15)** |
| Convert final PDF to high-resolution image and extract OCR data. |
|
|
| #### **Stage 14: Render Image** |
| - **Original**: `pipeline_14_render_image.py` |
| - **API**: `process_stage4_ocr()` in `api/utils.py:1899-1940` |
| - **Process**: |
| ```python |
| # Render PDF page to high-res PNG |
| page = fitz.open(pdf_path)[0] |
| pix = page.get_pixmap(matrix=fitz.Matrix(3, 3)) # 3x scale = ~220 DPI |
| img_bytes = pix.tobytes("png") |
| img_base64 = base64.b64encode(img_bytes).decode('utf-8') |
| ``` |
| - **Output**: Base64-encoded PNG at 220 DPI (configurable via scale factor) |
|
|
| #### **Stage 15: Perform OCR** |
| - **Original**: `pipeline_15_perform_ocr.py` |
| - **API**: `run_paddle_ocr()` in `api/utils.py:1950-2080` |
| - **OCR Engine**: PaddleOCR v4 (multilingual) |
| - Models: `PP-OCRv4` detection + recognition |
| - Languages: Supports 80+ languages |
| - Accuracy: State-of-the-art open-source OCR |
| - **Process**: |
| 1. Render PDF to image via `pdf2image` at specified DPI (default: 300) |
| 2. Initialize PaddleOCR with language parameter |
| 3. Run detection + recognition: |
| ```python |
| ocr = PaddleOCR(lang=language, use_gpu=True) |
| results = ocr.ocr(img_array, cls=True) |
| ``` |
| 4. Parse results into word-level bboxes: |
| ```python |
| { |
| "text": "word", |
| "bbox": { |
| "x0": float, |
| "y0": float, |
| "x1": float, # right |
| "y1": float # bottom |
| }, |
| "confidence": 0.95 |
| } |
| ``` |
| - **Output**: Dictionary with `words` list, image dimensions, OCR engine info |
| |
| --- |
|
|
| ### **Phase 4: Dataset Packaging (Stages 16-19)** |
| Normalize, verify, analyze, and export final dataset. |
|
|
| #### **Stage 16: Normalize Bboxes** |
| - **Original**: `pipeline_16_normalize_bboxes.py` |
| - **API**: `normalize_bboxes()` in `api/utils.py:2100-2180` |
| - **Process**: |
| 1. Convert absolute pixel coordinates β normalized [0, 1] range: |
| ```python |
| norm_bbox = [ |
| bbox['x0'] / img_width, |
| bbox['y0'] / img_height, |
| bbox['x1'] / img_width, |
| bbox['y1'] / img_height |
| ] |
| ``` |
| 2. Clip to [0, 1]: `[max(0, min(1, x)) for x in norm_bbox]` |
| 3. Create word-level and segment-level bboxes |
| - **Output**: List of `{text, bbox: [x0, y0, x1, y1]}` where bbox is normalized |
| |
| #### **Stage 17: Ground Truth Verification** |
| - **Original**: `pipeline_17_gt_preparation_verification.py` |
| - **API**: `verify_ground_truth()` in `api/utils.py:2185-2250` |
| - **Checks**: |
| - GT structure: Valid JSON, required fields |
| - Text matching: GT text exists in OCR output |
| - Bbox coverage: GT answers have corresponding bboxes |
| - **Output**: Verification report with pass/fail status |
|
|
| #### **Stage 18: Analyze** |
| - **Original**: `pipeline_18_analyze.py` |
| - **API**: `analyze_document()` in `api/utils.py:2255-2320` |
| - **Metrics**: |
| - Word count, character count |
| - Average word length |
| - Handwriting regions count, coverage % |
| - Visual elements count by type |
| - OCR confidence statistics (mean, min, max) |
| - **Output**: Analysis dictionary with computed metrics |
|
|
| #### **Stage 19: Create Debug Data & Export** |
| - **Original**: `pipeline_19_create_debug_data.py` |
| - **API**: `export_to_msgpack()` in `api/utils.py:2350-2520` |
| - **Debug Visualization**: |
| - Draw bboxes on image with different colors: |
| - Green: Word bboxes |
| - Red: Handwriting regions |
| - Blue: Visual elements |
| - Yellow: Ground truth target regions |
| - Save annotated image |
| - **Dataset Export (msgpack)**: |
| ```python |
| dataset_entry = { |
| "image": img_bytes, # PNG bytes |
| "words": ["hello", "world"], |
| "word_bboxes": [[0.1, 0.2, 0.15, 0.25], ...], # Normalized |
| "segment_bboxes": [...], |
| "ground_truth": {"question": "answer"}, |
| "metadata": { |
| "document_id": "...", |
| "has_handwriting": True, |
| "num_visual_elements": 3 |
| } |
| } |
| msgpack.dump(dataset_entry, f) |
| ``` |
| - **Output**: `.msgpack` file compatible with PyTorch DataLoader |
|
|
| --- |
|
|
| ## Pipeline Verification: API vs Original Implementation |
|
|
| ### β
**Stage-by-Stage Mapping** |
|
|
| | Stage | Original File | API Function | Status | |
| |-------|--------------|--------------|--------| |
| | 01 | `pipeline_01_select_seeds.py` | `download_seed_images()` | β
Mapped (with retry logic) | |
| | 02 | `pipeline_02_prompt_llm.py` | `call_claude_api_direct()` | β
Mapped (uses Messages API) | |
| | 03 | `pipeline_03_process_response.py` | `extract_html_documents_from_response()` | β
Mapped | |
| | 04 | `pipeline_04_render_pdf_and_extract_geos.py` | `render_html_to_pdf()` | β
Mapped (Playwright) | |
| | 05 | `pipeline_05_extract_bboxes_from_pdf.py` | `extract_bboxes_from_rendered_pdf()` | β
Mapped | |
| | 06 | `pipeline_06_validation.py` | `validate_html_structure()`, `validate_pdf()` | β
Mapped | |
| | 07 | `pipeline_07_extract_handwriting.py` | `process_stage3_complete()` section | β
Mapped (with ratio filter) | |
| | 08 | `pipeline_08_extract_visual_element_definitions.py` | `process_stage3_complete()` section | β
Mapped | |
| | 09 | `pipeline_09_create_handwriting_images.py` | `call_handwriting_service_batch()` | β
Mapped (RunPod integration) | |
| | 10 | `pipeline_10_create_visual_elements.py` | `generate_visual_element_images()` | β
Mapped | |
| | 11 | `pipeline_11_make_text_transparent.py` | `process_stage3_complete()` (whiteout) | β
Mapped (white rectangles) | |
| | 12 | `pipeline_12_insert_handwriting_images.py` | `process_stage3_complete()` section | β
Mapped | |
| | 13 | `pipeline_13_insert_visual_elements.py` | `process_stage3_complete()` section | β
Mapped | |
| | 14 | `pipeline_14_render_image.py` | `process_stage4_ocr()` | β
Mapped | |
| | 15 | `pipeline_15_perform_ocr.py` | `run_paddle_ocr()` | β
Mapped | |
| | 16 | `pipeline_16_normalize_bboxes.py` | `normalize_bboxes()` | β
Mapped | |
| | 17 | `pipeline_17_gt_preparation_verification.py` | `verify_ground_truth()` | β
Mapped | |
| | 18 | `pipeline_18_analyze.py` | `analyze_document()` | β
Mapped | |
| | 19 | `pipeline_19_create_debug_data.py` | `export_to_msgpack()` | β
Mapped | |
|
|
| ### π **Key Differences: API vs Batch Pipeline** |
|
|
| #### **Processing Model** |
| - **Original**: Batch processing with file-based state management |
| - Input: CSV of seed selections, prompt parameters in JSON |
| - Output: Folder structure with intermediate files |
| - State: JSON logs per document + message |
| - Resumability: Can restart from any stage |
|
|
| - **API**: Request/response with in-memory processing |
| - Input: JSON request with seed URLs |
| - Output: JSON response or ZIP file |
| - State: Ephemeral (temporary directories) |
| - Resumability: None (single-shot generation) |
|
|
| #### **Handwriting Generation** |
| - **Original**: Local GPU with WordStylist model loaded in-process |
| - Location: `docgenie/generation/handwriting_diffusion/` |
| - Execution: `generate_handwriting_diffusion_raw.py` |
| - Cost: Free (local GPU) |
|
|
| - **API**: Remote RunPod serverless endpoint |
| - Location: `handwriting_service/` (deployed separately) |
| - Execution: HTTP POST to RunPod API |
| - Cost: ~$0.00025/s GPU time (pay-per-use) |
| - Benefit: No local GPU required, scales automatically |
|
|
| #### **Seed Selection** |
| - **Original**: Pre-crawled dataset with systematic selection |
| - Seeds stored in: `data/datasets/base_v2/` |
| - Selection: Clustering algorithm β balanced subset |
| - Tracking: CSV manifest with seed IDs |
|
|
| - **API**: User-provided URLs |
| - Seeds: Any publicly accessible image URL |
| - Selection: User chooses 1-8 images per request |
| - Tracking: URLs stored in request metadata |
|
|
| #### **Prompt Templates** |
| - **Original**: Multiple template versions in folders |
| - Path: `data/prompt_templates/{version}/seed-based-json.txt` |
| - Versioning: ClaudeRefined1 β ClaudeRefined12 |
| - Selection: Configurable per dataset |
|
|
| - **API**: Fixed template (latest version) |
| - Path: `data/prompt_templates/ClaudeRefined12/seed-based-json.txt` |
| - Hardcoded in: `api/main.py:171` |
| - **Future improvement**: Make template selectable via API parameter |
|
|
| --- |
|
|
| ## Complete Request Flow Example |
|
|
| ### Example Request (Sync Endpoint) |
| ```bash |
| POST /generate/pdf HTTP/1.1 |
| Content-Type: application/json |
| |
| { |
| "seed_images": [ |
| "https://example.com/seed1.jpg", |
| "https://example.com/seed2.jpg" |
| ], |
| "prompt_params": { |
| "language": "english", |
| "doc_type": "medical_form", |
| "gt_type": "kie", |
| "gt_format": "json", |
| "num_solutions": 2, |
| "enable_handwriting": true, |
| "handwriting_ratio": 0.3, |
| "enable_visual_elements": true, |
| "visual_element_types": ["logo", "signature"], |
| "enable_ocr": true, |
| "enable_dataset_export": true, |
| "seed": 42 |
| } |
| } |
| ``` |
|
|
| ### Processing Flow (Stages Executed) |
|
|
| **Phase 1: Core Document Generation (30-60s)** |
| 1. β
Download 2 seed images with retry β `[img1_b64, img2_b64]` |
| 2. β
Load prompt template β Build prompt for medical_form + KIE |
| 3. β
Call Claude API β LLM generates 2 HTML documents (~25s) |
| 4. β
Extract HTML + ground truth β 2 clean HTML files with GT JSON |
| 5. β
Render each HTML to PDF via Playwright β 2 PDFs + geometries |
| 6. β
Extract word bboxes from PDFs β ~200-500 words per document |
| |
| **Phase 2: Feature Synthesis (120-180s if handwriting enabled)** |
| 7. β
Parse geometries for handwriting markers |
| - Found: 12 elements with `class="handwritten"` |
| - Filtered by ratio: 12 Γ 0.3 = ~4 elements selected (probabilistic) |
| - Matched to word bboxes: 4 regions with 15 total words |
| 8. β
Parse geometries for visual elements |
| - Found: 3 elements (`data-placeholder="logo"`, `"signature"`, `"logo"`) |
| - Filtered by types: Keep logo + signature, remove others |
| - Result: 2 visual element definitions |
| 9. β
Generate handwriting images via RunPod |
| - **Batch request**: 15 words in ONE API call |
| - Map author IDs: `author1 β style 42`, `author2 β style 137` |
| - RunPod processing: 1 worker Γ (15 Γ 18s) = ~270s |
| - Result: 15 PNG images (base64-encoded) |
| 10. β
Generate visual element images |
| - Logo: Random selection from `data/visual_element_prefabs/logo/` (seed=42) |
| - Signature: Generate on-the-fly using signature prefab |
| - Result: 2 PNG images |
| 11. β
Whiteout original text: Draw white rectangles over 15 word positions |
| 12. β
Insert handwriting: Place 15 generated images at word bboxes with offsets |
| - Save: `doc1_with_handwriting.pdf`, `doc2_with_handwriting.pdf` |
| 13. β
Insert visual elements: Place logo + signature at geometry coords |
| - Save: `doc1_final.pdf`, `doc2_final.pdf` |
| |
| **Phase 3: Image + OCR (5-10s)** |
| 14. β
Render each final PDF to 220 DPI image β 2 PNG files (base64) |
| 15. β
Run PaddleOCR on each image |
| - Doc1: Detected 187 words, avg confidence 0.91 |
| - Doc2: Detected 203 words, avg confidence 0.94 |
| |
| **Phase 4: Dataset Packaging (2-5s)** |
| 16. β
Normalize OCR bboxes: Convert pixels β [0,1] range |
| 17. β
Verify ground truth: Check GT fields match OCR output (enabled=false, skipped) |
| 18. β
Analyze documents: Compute metrics (enabled=false, skipped) |
| 19. β
Export to msgpack: |
| - Doc1: Pack image + words + normalized bboxes + GT β `doc1.msgpack` |
| - Doc2: Pack image + words + normalized bboxes + GT β `doc2.msgpack` |
| |
| **Final Output: ZIP File Contents** |
| ``` |
| dataset.zip |
| βββ doc1_uuid_0.pdf # Original rendered PDF |
| βββ doc1_uuid_0_final.pdf # PDF with handwriting + visual elements |
| βββ doc1_uuid_0.msgpack # Dataset format |
| βββ doc2_uuid_1.pdf |
| βββ doc2_uuid_1_final.pdf |
| βββ doc2_uuid_1.msgpack |
| βββ metadata.json # Complete generation metadata |
| βββ handwriting/ |
| βββ hw0_b0_l0_w0.png # Individual handwriting images |
| βββ hw0_b0_l0_w1.png |
| βββ ... (13 more) |
| ``` |
| |
| ### Response (JSON Metadata) |
| ```json |
| { |
| "task_id": "uuid-here", |
| "status": "completed", |
| "num_documents": 2, |
| "processing_time_seconds": 305.7, |
| "stages_completed": [ |
| "seed_download", "llm_prompt", "html_extraction", |
| "pdf_render", "bbox_extraction", "handwriting_extraction", |
| "visual_element_extraction", "handwriting_generation", |
| "visual_element_generation", "handwriting_insertion", |
| "visual_element_insertion", "image_render", "ocr", |
| "bbox_normalization", "dataset_export" |
| ], |
| "documents": [ |
| { |
| "document_id": "doc1_uuid_0", |
| "ground_truth": {"patient_name": "John Doe", "date": "2024-01-15"}, |
| "num_words": 187, |
| "num_handwriting_regions": 2, |
| "num_visual_elements": 2, |
| "ocr_confidence_avg": 0.91 |
| }, |
| { |
| "document_id": "doc2_uuid_1", |
| "ground_truth": {"patient_name": "Jane Smith", "date": "2024-01-16"}, |
| "num_words": 203, |
| "num_handwriting_regions": 2, |
| "num_visual_elements": 2, |
| "ocr_confidence_avg": 0.94 |
| } |
| ], |
| "download_url": "/download/dataset_uuid.zip" |
| } |
| ``` |
|
|
| --- |
|
|
| ## Configuration & Environment |
|
|
| ### Required Environment Variables |
| ```bash |
| # LLM API |
| ANTHROPIC_API_KEY=sk-ant-... # Claude API key |
| CLAUDE_MODEL=claude-3-5-sonnet-20241022 # Default model |
| |
| # Handwriting Service (RunPod) |
| HANDWRITING_SERVICE_ENABLED=true |
| HANDWRITING_SERVICE_URL=https://api.runpod.ai/v2/{endpoint_id}/runsync |
| RUNPOD_API_KEY=... # RunPod API key |
| HANDWRITING_APPLY_BLUR=true # Gaussian blur for realism |
| HANDWRITING_SERVICE_MAX_RETRIES=3 |
| HANDWRITING_SERVICE_TIMEOUT=600 # 10 minutes for large batches |
| |
| # OCR Configuration |
| OCR_DPI=300 # Image resolution for OCR |
| OCR_LANGUAGE=en # PaddleOCR language code |
| |
| # File Paths |
| PROMPT_TEMPLATES_DIR=/path/to/data/prompt_templates |
| VISUAL_ELEMENT_PREFABS_DIR=/path/to/data/visual_element_prefabs |
| ``` |
|
|
| ### Docker Deployment (Railway) |
| ```dockerfile |
| # Dockerfile (api service) |
| FROM python:3.11-slim |
| RUN apt-get update && apt-get install -y \ |
| chromium chromium-driver \ # Playwright dependencies |
| libgl1 libglib2.0-0 \ # PaddleOCR dependencies |
| && rm -rf /var/lib/apt/lists/* |
| |
| COPY api/ /app/api |
| COPY docgenie/ /app/docgenie |
| COPY data/ /app/data |
| WORKDIR /app/api |
| RUN pip install -r requirements.txt |
| CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] |
| ``` |
|
|
| **Handwriting service**: See `handwriting_service/Dockerfile` (deployed separately to RunPod) |
|
|
| --- |
|
|
| ## Performance & Costs |
|
|
| ### Timing Breakdown (Single Document) |
| | Stage | Time | Notes | |
| |-------|------|-------| |
| | Seed download | 0.5-2s | Depends on image size + network | |
| | LLM prompt | 20-40s | Claude API latency | |
| | PDF render | 1-3s | Playwright initialization | |
| | Handwriting (10 words) | 180s | RunPod: 1 worker Γ (10Γ18s) | |
| | Visual elements | 0.5-1s | Local file selection | |
| | OCR | 3-5s | PaddleOCR inference | |
| | Dataset export | 0.5-1s | msgpack serialization | |
| | **TOTAL (no handwriting)** | **25-50s** | |
| | **TOTAL (with handwriting)** | **200-230s** | Batched | |
|
|
| ### Cost Breakdown (Per Document) |
| | Component | Cost | Notes | |
| |-----------|------|-------| |
| | Claude API | $0.015-0.03 | ~5K input + 16K output tokens | |
| | RunPod GPU (10 words) | $0.045 | 180s Γ $0.00025/s | |
| | Storage | Negligible | Temporary files deleted | |
| | **TOTAL (no handwriting)** | **$0.015-0.03** | |
| | **TOTAL (with handwriting)** | **$0.06-0.08** | |
|
|
| **Optimization**: Batch multiple documents in ONE RunPod call to share worker activation overhead. |
|
|
| --- |
|
|
| ## Error Handling & Reliability |
|
|
| ### Retry Mechanisms |
| 1. **Seed image download**: 3 attempts, exponential backoff (2s, 4s, 8s) |
| 2. **Handwriting service**: 3 attempts, status polling up to 30 times |
| 3. **LLM API**: Built-in Anthropic SDK retries (rate limits, 529 errors) |
|
|
| ### Failure Modes |
| | Error Type | Behavior | User Impact | |
| |------------|----------|-------------| |
| | Seed download failure | Raise HTTP 400 | Request rejected immediately | |
| | LLM API error | Raise HTTP 500 | No charge, can retry | |
| | Handwriting service failure | **Raise exception** (NEW) | Generation fails, prevents invalid outputs | |
| | OCR failure | Log warning, continue | Document generated without OCR data | |
| | PDF render failure | Raise HTTP 500 | Request fails, no partial results | |
|
|
| ### Session Fixes Applied |
| - β
**Handwriting service failure now raises exception** (previously silent) |
| - β
**Seed parameter defaults to null** (previously 0) |
| - β
**Seed image download retry logic** (handles 503 timeout errors) |
| - β
**API docs show correct examples** (seed: null, not 0) |
|
|
| --- |
|
|
| ## Future Enhancements |
|
|
| ### Short-term |
| 1. **Configurable prompt templates** via API parameter |
| 2. **Async endpoint progress tracking** (websocket or polling) |
| 3. **Batch ZIP download** with multiple documents in one archive |
| 4. **Cost estimation** before generation (preview mode) |
|
|
| ### Long-term |
| 1. **Custom visual element upload** (user-provided logos, signatures) |
| 2. **Multi-page document support** (currently single-page only) |
| 3. **Additional export formats** (COCO, YOLO, HuggingFace Datasets) |
| 4. **Fine-tuning handwriting styles** (train on user's handwriting samples) |
| 5. **LLM caching** (reduce cost for similar prompts) |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### Common Issues |
|
|
| **Q: "Handwriting service not called, but enable_handwriting=true"** |
| - Check: LLM output contains `class="handwritten"` in HTML |
| - Check: `handwriting_ratio` > 0 (default 0.2) |
| - Check: `HANDWRITING_SERVICE_ENABLED=true` in environment |
| - Debug: Look for "π DEBUG - Handwriting Service Check" in logs |
| |
| **Q: "RunPod job stuck IN_PROGRESS"** |
| - Cause: Large batch timing out |
| - Solution: Increase `HANDWRITING_SERVICE_TIMEOUT` (default 600s) |
| - Or: Reduce batch size by lowering `handwriting_ratio` |
|
|
| **Q: "503 first byte timeout" on seed download** |
| - Cause: CDN/storage provider temporary unavailability |
| - Solution: Retry logic automatically handles this (3 attempts) |
| - If persists: Use different image hosting (imgur, cloudinary) |
|
|
| **Q: "Seed parameter still shows 0 in API docs"** |
| - Fixed: Added `examples=[None, 42]` to Field definition |
| - Clear browser cache if seeing old docs |
|
|
| --- |
|
|
| ## Testing |
|
|
| ### Unit Tests |
| ```bash |
| # Test individual stages |
| pytest api/tests/test_utils.py::test_download_seed_images |
| pytest api/tests/test_utils.py::test_handwriting_service_batch |
| ``` |
|
|
| ### Integration Tests |
| ```bash |
| # Test sync endpoint (included in repo) |
| python api/test_sync_pdf_api.py |
| |
| # Test async endpoint |
| python api/test_async_api.py |
| ``` |
|
|
| ### Manual Testing via Docs UI |
| 1. Navigate to `http://localhost:8000/docs` |
| 2. Expand `/generate/pdf` endpoint |
| 3. Click "Try it out" |
| 4. Paste example request JSON |
| 5. Click "Execute" |
| 6. Download resulting ZIP file |
|
|
| ### Example Test Request (Minimal) |
| ```json |
| { |
| "seed_images": [ |
| "https://i.imgur.com/example.jpg" |
| ], |
| "prompt_params": { |
| "language": "english", |
| "doc_type": "invoice", |
| "num_solutions": 1, |
| "enable_handwriting": false, |
| "enable_visual_elements": false, |
| "enable_ocr": true, |
| "enable_dataset_export": true |
| } |
| } |
| ``` |
|
|
| --- |
|
|
| ## Conclusion |
|
|
| The DocGenie API successfully implements all 19 stages of the original batch pipeline in a request/response model suitable for real-time generation. Key architectural differences: |
|
|
| 1. **Handwriting generation**: Offloaded to RunPod serverless (cost-efficient batching) |
| 2. **Seed selection**: User-provided URLs instead of pre-crawled dataset |
| 3. **State management**: Ephemeral in-memory processing vs file-based |
| 4. **Scalability**: Horizontal scaling via FastAPI workers + async processing |
|
|
| The API maintains feature parity with the batch pipeline while providing a simpler interface for integration with external systems (web apps, mobile apps, data pipelines). |
|
|
| **Total Processing Time**: 25-50s (no handwriting) or 200-230s (with handwriting) |
| **Cost Per Document**: $0.015-0.08 depending on features |
| **Output Formats**: PDF, PNG, msgpack, ZIP archive |
|
|
| For questions or issues, see `api/README.md` or `TESTING.md`. |
|
|