File size: 97,784 Bytes
dc4e6da | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 | # DocGenie Generation Pipeline & API Documentation
**Version:** 1.0
**Last Updated:** February 7, 2026
**Purpose:** Comprehensive reference for the DocGenie synthetic document generation system
---
## Table of Contents
1. [Overview](#overview)
2. [Pipeline Architecture](#pipeline-architecture)
3. [Pipeline Stages (01-19)](#pipeline-stages-01-19)
4. [API Implementation](#api-implementation)
5. [Core Models & Utilities](#core-models--utilities)
6. [Configuration & Constants](#configuration--constants)
7. [Usage Examples](#usage-examples)
8. [Error Handling & Debugging](#error-handling--debugging)
---
## Overview
DocGenie is a sophisticated 19-stage pipeline for generating synthetic document datasets with ground truth annotations. It supports multiple document understanding tasks:
- **Document Question Answering (QA)**
- **Key Information Extraction (KIE)**
- **Document Layout Analysis (DLA)**
- **Document Classification (CLS)**
### Key Features
- **LLM-Powered Generation**: Uses Claude/Gemini/Open-source models to generate diverse document content
- **Realistic Handwriting**: Diffusion model-based handwriting synthesis with author-specific styles
- **Visual Element Integration**: Stamps, logos, barcodes, charts, and photos
- **Multi-Task Support**: Task-specific ground truth formatting and validation
- **Quality Assurance**: Comprehensive validation, OCR verification, and error tracking
- **Modular Design**: Each pipeline stage is independently executable with clear inputs/outputs
### Technology Stack
- **LLM APIs**: Claude (Anthropic), Gemini, DeepSeek, Qwen
- **PDF Rendering**: Playwright (Chromium), PyMuPDF
- **OCR**: Microsoft Azure OCR
- **Handwriting**: Custom diffusion model
- **Image Processing**: PIL, OpenCV
- **API Framework**: FastAPI
- **Data Processing**: Pandas, NumPy
---
## Pipeline Architecture
### High-Level Flow
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCGENIE GENERATION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ
β PHASE 1: SELECTION β
ββββββββββββββββββββββββ
β
[01] Select Seeds βββββββββββββΊ seeds.csv, clusters.csv
(Cluster-based diverse seed selection)
ββββββββββββββββββββββββ
β PHASE 2: LLM GEN β
ββββββββββββββββββββββββ
β
[02] Prompt LLM βββββββββββββββΊ batch_results/ (JSON)
β (Claude API batched calls)
β
[03] Process Response βββββββββΊ raw_html/, raw_annotations/
(Extract HTML & GT from responses)
ββββββββββββββββββββββββ
β PHASE 3: RENDERING β
ββββββββββββββββββββββββ
β
[04] Render PDF Initial βββββββΊ pdf_initial/, geometries/
β (HTMLβPDF with geometry extraction)
β
[05] Extract BBoxes βββββββββββΊ pdf_word_bboxes/, pdf_char_bboxes/
β (PyMuPDF text extraction)
β
[06] Extract Layout βββββββββββΊ layout_element_definitions/
(DLA/KIE-specific annotations)
ββββββββββββββββββββββββ
β PHASE 4: EXTRACTION β
ββββββββββββββββββββββββ
β
[07] Extract Handwriting ββββββΊ handwriting_definitions/
β (Identify handwriting regions)
β
[08] Extract Visual Elements ββΊ visual_element_definitions/
(Stamp/logo/barcode placeholders)
ββββββββββββββββββββββββ
β PHASE 5: GENERATION β
ββββββββββββββββββββββββ
β
[09] Create Handwriting βββββββΊ handwriting_images/
β (Diffusion model generation)
β
[10] Create Visual Elements βββΊ visual_element_images/
(Generate/select stamps, logos, etc.)
ββββββββββββββββββββββββ
β PHASE 6: COMPOSITION β
ββββββββββββββββββββββββ
β
[11] Render PDF (2nd Pass) ββββΊ pdf_without_handwriting_placeholder/
β (Remove handwriting placeholders)
β
[12] Insert Handwriting βββββββΊ pdf_with_handwriting/
β (Overlay handwriting images)
β
[13] Insert Visual Elements βββΊ pdf_final/
β (Overlay stamps, logos, etc.)
β
[14] Render Image βββββββββββββΊ images/
(PDFβPNG conversion)
ββββββββββββββββββββββββ
β PHASE 7: FINALIZATIONβ
ββββββββββββββββββββββββ
β
[15] Perform OCR ββββββββββββββΊ final_word_bboxes/, final_segment_bboxes/
β (Microsoft OCR)
β
[16] Normalize BBoxes βββββββββΊ normalized_word_bboxes/, normalized_segment_bboxes/
(Pixelβ[0,1] coordinates)
ββββββββββββββββββββββββ
β PHASE 8: VALIDATION β
ββββββββββββββββββββββββ
β
[17] GT Preparation βββββββββββΊ verified_gt/
β (Fuzzy matching, BIO tagging)
β
[18] Analyze ββββββββββββββββββΊ dataset_log.json
β (Statistics, cost analysis)
β
[19] Create Debug Data ββββββββΊ debug/ subdirectories
(Visualizations for inspection)
```
### Data Flow Between Stages
```
Seed Images βββ
ββββΊ [02] βββΊ HTML + GT βββΊ [04] βββΊ PDF + Geometries
Prompt Params β β
ββββΊ [05] βββΊ BBoxes
β β
β ββββΊ [07] βββΊ HW Defs βββΊ [09] βββΊ HW Images βββ
β β β
β ββββΊ [08] βββΊ VE Defs βββΊ [10] βββΊ VE Images βββ€
β β
ββββΊ [11] βββΊ PDF (no HW) βββ¬βββΊ [12] βββββββββββββββββββββββββββ€
β (Insert HW) β
ββββΊ [13] ββββββββββββββββββββββββββ
(Insert VE)
β
[14] βββΊ Image
β
[15] βββΊ OCR BBoxes
β
[16] βββΊ Normalized
β
[17] βββΊ Verified GT
```
---
## Pipeline Stages (01-19)
### Stage 01: Select Seeds
**File:** `pipeline_01_select_seeds.py`
**Purpose:** Select diverse seed images from base dataset using clustering algorithms to ensure variety in the generated documents.
**Key Functions:**
- `main()`: Orchestrates seed selection process
- `downscale_and_compress_seeds()`: Prepares seed images for efficient API transmission
- `plot_class_distribution()`: Visualizes class balance in selected seeds
- `visualize_cluster_histogram()`: Shows distribution across clusters
**Process:**
1. Load embeddings from base dataset
2. Perform clustering (KMeans or other algorithms)
3. Sample N seeds per cluster
4. Downscale and compress images (JPEG, max dimension)
5. Save seed manifest and cluster assignments
**Inputs:**
- `SynDatasetDefinition` configuration
- Base dataset name (e.g., `docvqa`, `cord`, `publaynet`)
- Clustering parameters from constants
**Outputs:**
```
seeds.csv # Selected seed document IDs per prompt call
clusters.csv # Cluster assignments for all documents
seeds/ (directory) # Preprocessed seed images (JPEG, compressed)
```
**Configuration Parameters:**
- `EMBEDDING_MODEL`: Specifies which embedding model was used
- `IMAGE_MAX_DIMENSION`: Max width/height for compression
- `JPEG_QUALITY`: Compression quality (0-100)
**Example Usage:**
```python
from docgenie.generation import pipeline_01_select_seeds
pipeline_01_select_seeds.main(
syndatadef_path="data/syn_dataset_definitions/docvqa_alpha=1.0.yaml",
base_dataset="docvqa"
)
```
---
### Stage 02: Prompt LLM
**File:** `pipeline_02_prompt_llm.py`
**Purpose:** Send batched prompts to LLM APIs (Claude, Gemini, DeepSeek, Qwen) with seed images to generate document HTML and ground truth.
**Key Functions:**
- `main()`: Main orchestrator for LLM prompting
- `create_batched_messages()`: Constructs API-compatible message batches
- `track_batch_completion()`: Polls API for batch status
- Cost calculation utilities in `pipeline_01/cost.py`
**Process:**
1. Load seed images and encode as base64
2. Build prompts from template with parameter injection
3. Create batched API requests (Claude Batch API for cost efficiency)
4. Submit batches and track completion
5. Save results for processing in stage 03
**Inputs:**
- Prompt template from `data/prompt_templates/<template_name>/`
- Seed images from stage 01
- API credentials from environment variables
- `SynDatasetDefinition` parameters
**Outputs:**
```
prompt_batches/ # Batch metadata (batch IDs, status)
message_results/ # JSON response files per batch
logs/ # Prompting logs and progress
```
**API Configuration:**
- **Claude:** Uses Batch API with prompt caching for cost efficiency
- **Batch Size:** Configurable via `BATCH_SIZE` constant
- **Polling Interval:** Configurable wait time between status checks
- **Model Selection:** Specified in `SynDatasetDefinition.llm_model`
**Cost Tracking:**
- Input/output token counts per request
- Cached token usage (for Claude)
- Total cost estimation per batch
**Example Configuration:**
```yaml
# In syn_dataset_definition YAML
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1 # Documents per prompt
language: "English"
doc_type: "business and administrative"
```
---
### Stage 03: Process Response
**File:** `pipeline_03_process_response.py`
**Purpose:** Extract and validate HTML documents and ground truth annotations from LLM responses.
**Key Functions:**
- `main()`: Main processor
- `extract_html_from_message()`: Regex-based HTML extraction from markdown code blocks
- `extract_gt_from_html()`: Parse JSON ground truth from `<script id="GT">` tags
- `validate_and_save_gt()`: Task-specific validation and formatting
**Process:**
1. Load message results from stage 02
2. Extract HTML content using regex patterns
3. Parse and validate ground truth JSON
4. Apply task-specific formatting (QA, KIE, DLA, CLS)
5. Save raw HTML and annotations separately
**Inputs:**
- Message results from `message_results/` (stage 02)
- `SynDatasetDefinition` task type and prompt format
**Outputs:**
```
raw_html/ # HTML files (one per document)
raw_annotations/ # Ground truth JSON files
qa/ # QA format: {"question": "answer", ...}
kie/ # KIE format: {"entity": "value", ...}
dla/ # DLA format: {"element_id": "label", ...}
cls/ # CLS format: {"class": "category"}
logs/message_processing/ # Processing logs per message
```
**Ground Truth Formats:**
**QA Format:**
```json
{
"What is the invoice number?": "INV-12345",
"What is the total amount?": "$1,234.56"
}
```
**KIE Format:**
```json
{
"company_name": "Acme Corp",
"invoice_date": "2024-01-15",
"total_amount": "1234.56"
}
```
**DLA Format:**
```json
{
"layout_0": "title",
"layout_1": "text",
"layout_2": "table"
}
```
**Validation Checks:**
- Expected document count matches actual
- Valid JSON structure in GT
- Task-specific field presence
- HTML completeness (valid tags)
**Error Handling:**
- Malformed HTML β Logged, skipped
- Missing GT β Document flagged in logs
- Invalid JSON β Parsing error logged
---
### Stage 04: Render PDF and Extract Geometries
**File:** `pipeline_04_render_pdf_and_extract_geos.py`
**Purpose:** Convert HTML to PDF with automatic content-based page sizing and extract element geometries (positions, dimensions) for downstream processing.
**Key Functions:**
- `main()`: Async batch rendering orchestrator
- `render_pdf_with_playwright()`: Single PDF rendering using Playwright/Chromium
- `preprocess_css()`: Remove conflicting CSS `@page` rules
- `validate_pdf()`: Check page count and file integrity
- `extract_geometries()`: Parse element positions from injected JavaScript
**Process:**
1. Load raw HTML from stage 03
2. Inject geometry extraction JavaScript
3. Preprocess CSS (remove fixed page sizes)
4. Render with Playwright in headless Chromium
5. Measure content dimensions dynamically
6. Export PDF with calculated page size
7. Extract and save element geometries
**Inputs:**
- Raw HTML files from `raw_html/` (stage 03)
**Outputs:**
```
pdf_initial/ # Initial PDFs
geometries/ # Element geometry JSON files
{doc_id}.json # Contains positions, dimensions, CSS classes
render_html/ # HTML with geometry extraction scripts
debug/pdf_with_geos/ # Debug PDFs with geometry overlays (optional)
logs/rendering/ # Rendering logs and errors
```
**Geometry JSON Structure:**
```json
{
"page_width_mm": 210.0,
"page_height_mm": 297.0,
"elements": [
{
"id": "layout_0",
"type": "div",
"class": "title",
"rect": {
"x": 20.0,
"y": 30.0,
"width": 170.0,
"height": 15.0
},
"text": "Invoice",
"attributes": {
"data-label": "title",
"data-handwriting": null,
"data-visual-element": null
}
}
]
}
```
**Key Features:**
- **Automatic Page Sizing:** No fixed dimensions; page size adapts to content
- **Concurrent Rendering:** Semaphore-controlled parallel processing
- **Geometry Extraction:** JavaScript injected to capture element positions
- **CSS Coordinate Conversion:** 96 DPI (CSS) β 72 DPI (PDF)
- **Retry Logic:** Up to 3 attempts with configurable timeout
- **Debug Visualizations:** Optional overlay of geometries on PDFs
**Configuration Constants:**
- `CHROMIUM_CONCURRENCY`: 10 (parallel render limit)
- `PER_PDF_RENDER_TIMEOUT`: 60 seconds
- `PER_PDF_RENDER_MAX_RETRIES`: 3 attempts
- `PDF_POINT_SCALING`: 72/96 for DPI conversion
**Error Handling:**
- Timeout β Retry with increased timeout
- Multi-page PDFs β Flagged and skipped
- Missing geometries β Error logged, document marked invalid
---
### Stage 05: Extract BBoxes from PDF
**File:** `pipeline_05_extract_bboxes_from_pdf.py`
**Purpose:** Extract word-level and character-level bounding boxes from PDFs using PyMuPDF for accurate text positioning.
**Key Functions:**
- `main()`: Main extraction orchestrator
- `extract_bboxes_from_pdf()`: PyMuPDF-based extraction
- `verify_char_to_word_mapping()`: Validate character-to-word relationships
- `create_bbox_debug_pdf()`: Generate debug visualizations
**Process:**
1. Load PDFs from stage 04
2. Extract words with PyMuPDF `get_text("words")`
3. Extract characters with `get_text("rawdict")`
4. Map characters to parent words
5. Validate mappings (required for handwriting splitting)
6. Save both word and character bboxes
**Inputs:**
- PDFs from `pdf_initial/` (stage 04)
**Outputs:**
```
pdf_word_bboxes/ # Word-level bounding boxes
{doc_id}.json # Format: [x0, y0, x1, y1, text]
pdf_char_bboxes/ # Character-level bounding boxes (if mappable)
{doc_id}.json # Format: [x0, y0, x1, y1, char, word_idx]
debug/pdf_bboxes/ # Debug PDFs with bbox overlays (optional)
logs/bbox_extraction/ # Extraction logs
```
**BBox JSON Format:**
**Word BBoxes:**
```json
[
{"x0": 20.0, "y0": 30.0, "x1": 60.0, "y1": 45.0, "text": "Invoice"},
{"x0": 65.0, "y0": 30.0, "x1": 95.0, "y1": 45.0, "text": "Date"}
]
```
**Char BBoxes (with word mapping):**
```json
[
{"x0": 20.0, "y0": 30.0, "x1": 25.0, "y1": 45.0, "char": "I", "word_idx": 0},
{"x0": 25.0, "y0": 30.0, "x1": 30.0, "y1": 45.0, "char": "n", "word_idx": 0}
]
```
**Key Features:**
- **Dual-Level Extraction:** Both word and character granularity
- **Mapping Validation:** Ensures characters can be mapped to words (critical for handwriting)
- **Debug Visualizations:** Color-coded bbox overlays on PDFs
- **PDF Coordinate System:** Uses 72 DPI (PDF points)
**Character-to-Word Mapping:**
Required for handwriting splitting in stage 07. If mapping fails (e.g., due to ligatures or complex fonts), char bboxes are not saved, and handwriting will use word-level fallback.
**Error Handling:**
- Empty PDFs β Skipped with warning
- Unmappable characters β Word-level fallback
- Invalid bboxes (negative dimensions) β Filtered out
---
### Stage 06: Extract Layout Element Definitions and Annotation GT
**File:** `pipeline_06_extract_layout_element_definitions_and_annotation_gt.py`
**Purpose:** Extract layout element definitions for Document Layout Analysis (DLA) and annotation-based ground truth for Key Information Extraction (KIE).
**Key Functions:**
- `main()`: Task router
- `handle_dla()`: Process DLA tasks
- `handle_kie()`: Process KIE tasks
- `parse_layout_elements()`: Extract layout elements from geometries
- `parse_kie_fields()`: Extract KIE annotated fields from geometries
**Process (DLA):**
1. Load geometries from stage 04
2. Filter elements with `data-label` attribute
3. Validate labels against task-specific valid labels
4. Extract bounding boxes and labels
5. Save layout element definitions
**Process (KIE):**
1. Load geometries from stage 04
2. Filter elements with `data-annotation` attribute
3. Extract entity names and text values
4. Validate against expected entity schema
5. Save raw annotations for stage 17
**Inputs:**
- Geometries from `geometries/` (stage 04)
- Valid labels from `SynDatasetDefinition.valid_labels`
- Task type from `SynDatasetDefinition.task`
**Outputs:**
```
layout_element_definitions/ # For DLA tasks
{doc_id}.json # [{id, label, rect}, ...]
raw_annotations/ # For KIE tasks
kie_annotations/
{doc_id}.json # {entity_name: text_value, ...}
```
**Layout Element Definition Format (DLA):**
```json
[
{
"id": "layout_0",
"label": "title",
"rect": {"x": 20.0, "y": 30.0, "width": 170.0, "height": 15.0}
},
{
"id": "layout_1",
"label": "text",
"rect": {"x": 20.0, "y": 50.0, "width": 170.0, "height": 50.0}
}
]
```
**KIE Annotation Format:**
```json
{
"company_name": "Acme Corporation",
"invoice_number": "INV-12345",
"invoice_date": "January 15, 2024",
"total_amount": "$1,234.56"
}
```
**Validation Checks:**
- **Missing labels:** Elements without `data-label` β Warning
- **Invalid labels:** Labels not in valid_labels β Error
- **Zero dimensions:** Width or height = 0 β Error
- **Multiple labels (KIE only):** One element, multiple entities β Error
- **Missing text:** KIE elements without text β Warning
**Task-Specific Handling:**
- **DLA:** Converts geometries to `LayoutBBox` format for stage 17
- **KIE:** Extracts entity-value pairs for BIO tagging in stage 17
- **QA/CLS:** No processing in this stage (uses raw GT from stage 03)
---
### Stage 07: Extract Handwriting
**File:** `pipeline_07_extract_handwriting.py`
**Purpose:** Identify text regions marked for handwriting generation, map them to character/word bounding boxes, and prepare definitions for the diffusion model.
**Key Functions:**
- `main()`: Main extraction orchestrator
- `parse_handwriting_geometries()`: Extract elements with `data-handwriting` attribute
- `map_handwriting_to_bboxes()`: Match handwriting text to OCR bboxes spatially
- `split_long_words()`: Split words exceeding diffusion model's character limit
- `extract_handwriting_style()`: Parse author ID from CSS class
**Process:**
1. Load geometries from stage 04
2. Filter elements with `data-handwriting` attribute
3. Load word/character bboxes from stage 05
4. Spatially match handwriting regions to bboxes
5. Split long words for diffusion model constraints
6. Extract author ID for style consistency
7. Save handwriting definitions with bbox mappings
**Inputs:**
- Geometries from `geometries/` (stage 04)
- Word bboxes from `pdf_word_bboxes/` (stage 05)
- Character bboxes from `pdf_char_bboxes/` (stage 05, if available)
**Outputs:**
```
handwriting_definitions/
{doc_id}.json # Handwriting region definitions
logs/handwriting_extraction/ # Extraction logs and warnings
```
**Handwriting Definition Format:**
```json
[
{
"id": "hw0",
"text": "John Smith",
"author_id": "author1",
"bboxes": [
"20.0,30.0,60.0,45.0,John",
"65.0,30.0,95.0,45.0,Smith"
],
"rect": {
"x": 20.0,
"y": 30.0,
"width": 75.0,
"height": 15.0
},
"is_signature": false
}
]
```
**Key Features:**
- **Author ID Extraction:** Parses CSS classes like `handwriting-author1` for style consistency
- **Spatial Matching:** Uses bbox overlap/proximity to map handwriting regions to text
- **Long Word Splitting:** Splits words > `MAX_HANDWRITING_CHARS` (default: 7) into multiple images
- **Signature Detection:** Identifies signature fields for special handling
- **Character-Level Fallback:** Uses word bboxes if char-level mapping unavailable
**Configuration Constants:**
- `MAX_HANDWRITING_CHARS`: 7 (max characters per diffusion generation)
- `SPATIAL_MATCH_THRESHOLD`: Pixel tolerance for bbox matching
**Mapping Logic:**
1. Find all words whose bboxes overlap/intersect with handwriting rect
2. Extract text from matched bboxes
3. Compare with expected handwriting text
4. If char bboxes available: split long words at char boundaries
5. If char bboxes unavailable: split at word boundaries
**Error Handling:**
- No bbox matches β Warning, handwriting region skipped
- Text mismatch β Logged, uses best match
- Missing char bboxes β Word-level fallback (may affect quality)
---
### Stage 08: Extract Visual Element Definitions
**File:** `pipeline_08_extract_visual_element_definitions.py`
**Purpose:** Extract placeholders for visual elements (stamps, logos, barcodes, charts, photos) from geometries for generation in stage 10.
**Key Functions:**
- `main()`: Main extraction orchestrator
- `parse_visual_element_geometries()`: Extract elements with `data-visual-element` attribute
- `parse_css_dimension()`: Parse width/height from CSS
- `parse_css_rotation()`: Extract rotation angle from CSS transform
**Process:**
1. Load geometries from stage 04
2. Filter elements with `data-visual-element` attribute
3. Parse element type (stamp, logo, barcode, etc.)
4. Extract dimensions and rotation from CSS
5. Parse content (e.g., stamp text, barcode value)
6. Validate element types and dimensions
7. Save visual element definitions
**Inputs:**
- Geometries from `geometries/` (stage 04)
**Outputs:**
```
visual_element_definitions/
{doc_id}.json # Visual element definitions
logs/visual_element_extraction/ # Extraction logs
```
**Visual Element Definition Format:**
```json
[
{
"id": "ve0",
"type": "stamp",
"type_unmapped": "stamp",
"content": "CONFIDENTIAL",
"rect": {
"x": 150.0,
"y": 250.0,
"width": 60.0,
"height": 30.0
},
"rotation": -15.0
},
{
"id": "ve1",
"type": "barcode",
"type_unmapped": "barcode",
"content": "1234567890",
"rect": {
"x": 20.0,
"y": 270.0,
"width": 100.0,
"height": 20.0
},
"rotation": 0.0
}
]
```
**Supported Visual Element Types:**
- `stamp`: Text-based stamps (e.g., "APPROVED", "CONFIDENTIAL")
- `logo`: Company/brand logos (selected from prefabs)
- `barcode`: Code128 barcodes
- `chart`: Charts and graphs (selected from prefabs)
- `photo`: Photographic images (selected from prefabs)
**Type Mapping:**
Maps LLM-generated type names to standard types using `VISUAL_ELEMENT_TYPE_MAPPING` constant.
**Key Features:**
- **Content Extraction:** Parses text content for stamps, values for barcodes
- **Rotation Support:** Extracts CSS `rotate()` transform angles
- **Dimension Parsing:** Handles CSS units (px, mm, %, etc.)
- **Type Validation:** Warns about unknown types, uses fallback mapping
**Validation Checks:**
- Unknown type β Logged, mapped to closest known type
- Zero dimensions β Error, element skipped
- Invalid rotation angle β Defaults to 0Β°
- Missing content (for stamps/barcodes) β Warning
**CSS Parsing Examples:**
```css
/* Rotation extraction */
transform: rotate(-15deg); β -15.0
transform: rotate(0.5turn); β 180.0
/* Dimension extraction */
width: 60mm; β 60.0 (mm)
height: 30px; β Converted to mm
```
---
### Stage 09: Create Handwriting Images
**File:** `pipeline_09_create_handwriting_images.py`
**Purpose:** Generate realistic handwriting images using a diffusion model, with per-author style consistency.
**Key Functions:**
- `main()`: Main generation orchestrator
- `generate_handwriting_diffusion()`: Call diffusion model API/local model
- `add_handwriting_blur()`: Optional post-processing for realism
**Process:**
1. Load handwriting definitions from stage 07
2. Group by author ID for style consistency
3. For each text segment:
- Generate handwriting image with diffusion model
- Apply optional blur for realism
- Save image with metadata
4. Track author-to-writer style mapping
5. Save generation logs
**Inputs:**
- Handwriting definitions from `handwriting_definitions/` (stage 07)
- Diffusion model checkpoint (local or API)
**Outputs:**
```
handwriting_images/
{doc_id}/
hw0_0.png # First bbox of handwriting region hw0
hw0_1.png # Second bbox (if multi-word)
hw1_0.png
logs/handwriting_generation/ # Generation logs with author mappings
```
**Handwriting Image Specifications:**
- **Format:** PNG with transparency
- **Height:** 40 pixels (configurable via `HANDWRITING_HEIGHT_PX`)
- **Padding:** 0 pixels horizontal (configurable via `HANDWRITING_PADDING_PX`)
- **Background:** Transparent
- **Color:** Black text on transparent
**Diffusion Model Integration:**
Located in `handwriting_diffusion/` module:
- `generate_handwriting_diffusion_raw.py`: Core generation logic
- `text_encoder.py`: Text encoding for model input
- `tokenizer.py`: Character tokenization
**Key Features:**
- **Author-Style Consistency:** Same author ID β consistent handwriting style
- **Writer Style Mapping:** Maps author IDs to writer style IDs (e.g., "author1" β writer_42)
- **Batch Processing:** Generates multiple images efficiently
- **Optional Blur:** Post-processing for more realistic appearance
- **Quality Control:** Validates generated images (non-empty, correct dimensions)
**Configuration Constants:**
- `HANDWRITING_HEIGHT_PX`: 40 pixels
- `HANDWRITING_PADDING_PX`: 0 pixels
- `DIFFUSION_NUM_INFERENCE_STEPS`: 50 (generation quality/speed tradeoff)
- `HANDWRITING_BLUR_ENABLED`: Optional blur toggle
- `HANDWRITING_STYLES`: List of available writer styles
**Author-to-Writer Mapping Example:**
```json
{
"author1": "writer_42",
"author2": "writer_17",
"author3": "writer_89"
}
```
**Error Handling:**
- Generation failure β Retry up to 3 times
- Empty image β Logged, document flagged
- Model timeout β Skipped, logged
---
### Stage 10: Create Visual Elements
**File:** `pipeline_10_create_visual_elements.py`
**Purpose:** Generate or select visual element images (stamps, logos, barcodes, charts, photos) based on definitions from stage 08.
**Key Functions:**
- `main()`: Main generation orchestrator
- `route_visual_element_generation()`: Router for element type
- `generate_stamp()`: Create stamp image with text
- `select_logo()`: Random selection from logo prefabs
- `generate_barcode()`: Create Code128 barcode
- `select_photo()`: Random selection from photo prefabs
- `select_chart()`: Random selection from chart prefabs
**Process:**
1. Load visual element definitions from stage 08
2. For each element:
- Route to type-specific generator
- Generate or select image
- Resize to target dimensions
- Save with transparency (if applicable)
3. Cache prefab directories for performance
4. Save generation logs
**Inputs:**
- Visual element definitions from `visual_element_definitions/` (stage 08)
- Prefab directories in `data/visual_element_prefabs/`:
- `logos/`
- `photos/`
- `charts/`
**Outputs:**
```
visual_element_images/
{doc_id}/
ve0.png # Stamp image
ve1.png # Logo image
ve2.png # Barcode image
logs/visual_element_generation/ # Generation logs
```
**Type-Specific Generation:**
**Stamps:**
- Font: Configurable (default: Arial Bold)
- Color: Configurable (default: red/blue)
- Background: Transparent
- Border: Optional rounded rectangle
- Rotation: Applied during insertion (stage 13)
**Logos:**
- Source: Random selection from `data/visual_element_prefabs/logos/`
- Format: PNG with transparency preserved
- Caching: Directory contents cached after first scan
**Barcodes:**
- Type: Code128 (supports alphanumeric)
- Library: `python-barcode`
- Background: White
- Content validation: Numeric or alphanumeric
**Photos:**
- Source: Random selection from `data/visual_element_prefabs/photos/`
- Format: JPEG or PNG
- Aspect ratio: Preserved during resize
**Charts/Figures:**
- Source: Random selection from `data/visual_element_prefabs/charts/`
- Format: PNG with transparency
- Types: Bar charts, line graphs, pie charts, etc.
**Key Features:**
- **Type-Specific Logic:** Each element type has dedicated generation function
- **Prefab Caching:** Directory scans cached for performance
- **Transparent Backgrounds:** Stamps and some logos support transparency
- **Content Validation:** Barcodes validate numeric content
- **Aspect Ratio Preservation:** Images scaled without distortion
**Configuration Constants:**
- `STAMP_FONT_SIZE`: Calculated from target dimensions
- `STAMP_BORDER_WIDTH`: 2 pixels
- `BARCODE_DPI`: 300 for high quality
- `PREFAB_CACHE_SIZE`: In-memory cache limit
**Error Handling:**
- Missing prefab directory β Error, element skipped
- Empty prefab directory β Warning, fallback placeholder
- Invalid barcode content β Logged, uses fallback text
- Image generation failure β Placeholder created
---
### Stage 11: Render PDF Second Pass
**File:** `pipeline_11_render_pdf_second_pass.py`
**Purpose:** Re-render PDF without handwriting placeholders (replaced with blank spaces) to prepare for handwriting image insertion in stage 12.
**Key Functions:**
- `main()`: Main rendering orchestrator
- `render_pdf_playwright_async()`: Async Playwright rendering
- `remove_handwriting_placeholders_from_html()`: Strip handwriting elements
**Process:**
1. Load HTML from `render_html/` (stage 04)
2. Remove all elements with `data-handwriting` attribute
3. Re-render PDF using same dimensions from stage 04
4. Validate PDF output
5. Save PDFs without handwriting placeholders
**Inputs:**
- HTML from `render_html/` (stage 04)
- Page dimensions from stage 04 logs
**Outputs:**
```
pdf_without_handwriting_placeholder/
{doc_id}.pdf # PDFs with handwriting regions blank
logs/rendering_second_pass/ # Rendering logs
```
**Key Differences from Stage 04:**
- **Uses Pre-calculated Dimensions:** No content measurement needed
- **Handwriting Elements Removed:** Not just hidden (visibility: hidden), but removed from DOM
- **No Geometry Extraction:** Geometries already saved in stage 04
- **Faster Rendering:** No JavaScript injection for measurement
**HTML Preprocessing:**
```html
<!-- Before (Stage 04) -->
<div data-handwriting="author1" class="handwriting">John Smith</div>
<!-- After (Stage 11) -->
<!-- Element completely removed -->
```
**Why This Stage Exists:**
Handwriting placeholders in HTML use system fonts, which don't match the diffusion-generated handwriting. Removing them creates blank spaces where handwriting images will be inserted in stage 12.
**Rendering Configuration:**
- Same timeout and retry logic as stage 04
- Reuses Playwright browser context for efficiency
- No debug output (geometries not extracted)
**Error Handling:**
- Multi-page PDF β Flagged and skipped
- Rendering timeout β Retry with increased timeout
- HTML parsing error β Logged, document marked invalid
---
### Stage 12: Insert Handwriting Images
**File:** `pipeline_12_insert_handwriting_images.py`
**Purpose:** Overlay generated handwriting images onto PDF pages using PyMuPDF, with precise positioning and natural variation.
**Key Functions:**
- `main()`: Main insertion orchestrator
- `insert_handwriting_into_pdf()`: Per-document insertion
- `scale_image_with_aspect_ratio()`: Resize images while preserving aspect
- `group_bboxes_by_line()`: Group multi-word handwriting by line
**Process:**
1. Load PDFs from stage 11
2. Load handwriting images from stage 09
3. Load handwriting definitions (bbox mappings) from stage 07
4. For each handwriting region:
- Group bboxes by line/block
- Scale images to fit bboxes
- Apply random offsets for natural variation
- Insert images at calculated positions
5. Save PDFs with handwriting
**Inputs:**
- PDFs from `pdf_without_handwriting_placeholder/` (stage 11)
- Handwriting images from `handwriting_images/` (stage 09)
- Handwriting definitions from `handwriting_definitions/` (stage 07)
**Outputs:**
```
pdf_with_handwriting/
{doc_id}.pdf # PDFs with handwriting inserted
logs/handwriting_insertion/ # Insertion logs
debug/handwriting_insertion/ # Debug PDFs with bbox overlays (optional)
```
**Insertion Logic:**
**1. Image Scaling:**
- High-res scaling: 3x upsampling before insertion
- Aspect ratio preserved
- Left-aligned within bbox (respects layout rect)
**2. Positioning:**
- **X coordinate:** Left edge of bbox + random offset
- **Y coordinate:** Top edge of bbox + random offset
- **Multi-word lines:** Consistent Y offset for line
**3. Random Offsets:**
```python
x_offset = random.uniform(-MAX_HANDWRITING_RAND_X, MAX_HANDWRITING_RAND_X)
y_offset = random.uniform(-MAX_HANDWRITING_RAND_Y, MAX_HANDWRITING_RAND_Y)
```
**4. Block/Line Grouping:**
For multi-word handwriting (e.g., "John Smith"):
- Group bboxes by Y coordinate (same line)
- Apply consistent Y offset to entire line
- Individual X offsets per word for natural spacing
**Key Features:**
- **High-Resolution Insertion:** 3x scaling for quality
- **Natural Variation:** Random offsets simulate handwriting imperfection
- **Line Consistency:** Multi-word lines maintain baseline
- **Aspect Ratio Preservation:** Images not distorted
- **Transparency Support:** PNG alpha channel preserved
**Configuration Constants:**
- `HANDWRITING_IMAGE_UPSCALE_FACTOR`: 3x
- `MAX_HANDWRITING_RAND_X`: Β±2 pixels
- `MAX_HANDWRITING_RAND_Y`: Β±1 pixel
- `HANDWRITING_LINE_Y_CONSISTENCY`: Same Y offset per line
**Coordinate System:**
- PDF uses 72 DPI (points)
- Bboxes from stage 07 are in PDF coordinates
- No conversion needed
**Error Handling:**
- Missing handwriting image β Warning, bbox skipped
- Image too large for bbox β Scaled down with warning
- PyMuPDF insertion failure β Logged, document flagged
---
### Stage 13: Insert Visual Elements
**File:** `pipeline_13_insert_visual_elements.py`
**Purpose:** Overlay visual element images (stamps, logos, barcodes, etc.) onto PDF pages with precise positioning and rotation.
**Key Functions:**
- `main()`: Main insertion orchestrator
- `insert_visual_elements_into_pdf()`: Per-document insertion
- `scale_image_with_aspect_ratio()`: Resize images (same as stage 12)
**Process:**
1. Load PDFs from stage 12
2. Load visual element images from stage 10
3. Load visual element definitions from stage 08
4. For each visual element:
- Scale image to fit bbox
- Calculate centered position
- Apply rotation (if specified)
- Insert image at calculated position
5. Save final PDFs
6. If no visual elements: copy PDF from stage 12
**Inputs:**
- PDFs from `pdf_with_handwriting/` (stage 12)
- Visual element images from `visual_element_images/` (stage 10)
- Visual element definitions from `visual_element_definitions/` (stage 08)
**Outputs:**
```
pdf_final/
{doc_id}.pdf # Final PDFs with all elements
logs/visual_element_insertion/ # Insertion logs
debug/visual_element_insertion/ # Debug PDFs (optional)
```
**Insertion Logic:**
**1. Image Scaling:**
- High-res scaling: 3x upsampling
- Aspect ratio preserved
- Centered within bbox (not left-aligned like handwriting)
**2. Positioning:**
- **X coordinate:** Center of bbox - half image width
- **Y coordinate:** Center of bbox - half image height
**3. Rotation:**
- Applied via PyMuPDF transformation matrix
- Rotation around image center
- Angle from visual element definition
**Key Differences from Stage 12 (Handwriting):**
- **Centered placement:** Visual elements centered in bbox
- **No random offsets:** Precise placement for logos/stamps
- **Rotation support:** Stamps often rotated for "APPROVED" effect
- **Fallback:** Copies PDF if no visual elements (ensures output exists)
**Rotation Transformation:**
```python
# PyMuPDF rotation matrix
rotation_matrix = fitz.Matrix(rotation_angle)
image_rect = image_rect * rotation_matrix
```
**Key Features:**
- **High-Resolution Insertion:** 3x scaling for quality
- **Centered Alignment:** Visual elements centered in bboxes
- **Rotation Support:** Arbitrary angles for stamps
- **Transparency Preservation:** PNG alpha channel maintained
- **Fallback Handling:** Copies PDF if no visual elements
**Configuration Constants:**
- `VISUAL_ELEMENT_UPSCALE_FACTOR`: 3x
- `ROTATION_PRECISION`: Angle precision in degrees
**Coordinate System:**
- Same as stage 12 (PDF 72 DPI)
- Rotation applied after positioning
**Error Handling:**
- Missing visual element image β Warning, element skipped
- Image too large for bbox β Scaled down with warning
- PyMuPDF insertion failure β Logged, document flagged
- No visual elements β PDF copied from stage 12
---
### Stage 14: Render Image
**File:** `pipeline_14_render_image.py`
**Purpose:** Convert final PDFs to high-quality PNG images for OCR and dataset distribution.
**Key Functions:**
- `main()`: Main conversion orchestrator
- `convert_pdf_to_image()`: PDF to PNG conversion
**Process:**
1. Load PDFs from stage 13
2. Convert each PDF to PNG using custom PDF-to-image module
3. Validate image dimensions
4. Save images
**Inputs:**
- PDFs from `pdf_final/` (stage 13)
**Outputs:**
```
images/
{doc_id}.png # Final document images
logs/image_rendering/ # Conversion logs
```
**Image Specifications:**
- **Format:** PNG
- **DPI:** Configurable (default: 200 DPI for quality OCR)
- **Color Mode:** RGB (24-bit)
- **Compression:** PNG lossless
**PDF-to-Image Module:**
Located in custom module (not standard library):
- Handles PDF rendering at specified DPI
- Single-page conversion only (multi-page PDFs skipped)
- Uses PDF coordinate system: 72 DPI internally
**Key Features:**
- **High DPI:** 200+ DPI for accurate OCR
- **Single Page Only:** Multi-page PDFs flagged as errors
- **Lossless Compression:** PNG preserves all details
- **Size Validation:** Checks image dimensions match PDF
**Configuration Constants:**
- `IMAGE_DPI`: 200 (OCR quality vs file size tradeoff)
- `IMAGE_MAX_DIMENSION`: Optional max width/height
**Coordinate System Conversion:**
```
PDF: 210mm Γ 297mm @ 72 DPI β 595 Γ 842 points
PNG: 210mm Γ 297mm @ 200 DPI β 1654 Γ 2339 pixels
```
**Why This Stage:**
- OCR performs better on high-DPI images
- Images are final output format for datasets
- PNG preserves quality better than JPEG for text
**Error Handling:**
- Multi-page PDF β Flagged and skipped
- Conversion failure β Logged, document marked invalid
- Empty image β Error, document flagged
- Dimension mismatch β Warning logged
---
### Stage 15: Perform OCR
**File:** `pipeline_15_perform_ocr.py`
**Purpose:** Perform Optical Character Recognition on final images to obtain accurate word and line-level bounding boxes, essential for documents with handwriting or visual elements.
**Key Functions:**
- `main()`: Main OCR orchestrator
- `call_microsoft_ocr()`: Microsoft Azure OCR API call
- `convert_ocr_to_bbox_format()`: Transform OCR results to internal format
- `aggregate_words_to_segments()`: Group words into lines
**Process:**
1. Determine which documents need OCR:
- Has handwriting β Requires OCR
- Has visual elements β Requires OCR
- Neither β Copy PDF bboxes from stage 05
2. For documents requiring OCR:
- Call Microsoft OCR service
- Parse word-level bboxes
- Aggregate into line-level segments
- Convert coordinates to PDF space
3. Save final bboxes
**Inputs:**
- Images from `images/` (stage 14)
- Handwriting definitions from `handwriting_definitions/` (stage 07)
- Visual element definitions from `visual_element_definitions/` (stage 08)
- PDF bboxes from `pdf_word_bboxes/` (stage 05, for non-OCR documents)
**Outputs:**
```
final_word_bboxes/
{doc_id}.json # Word-level bounding boxes
final_segment_bboxes/
{doc_id}.json # Line-level bounding boxes
ocr_results_cache/
{doc_id}.json # Raw OCR API responses (cached)
logs/ocr/ # OCR logs and errors
```
**OCR Decision Logic:**
```python
requires_ocr = (
has_handwriting(doc_id) or
has_visual_elements(doc_id)
)
if requires_ocr:
perform_microsoft_ocr()
else:
copy_pdf_bboxes() # Reuse stage 05 results
```
**Why Handwriting/Visual Elements Require OCR:**
- Handwriting images inserted in stage 12 β Not in PDF text layer
- Visual elements inserted in stage 13 β Not in PDF text layer
- PDF text extraction (stage 05) misses these elements
- OCR captures all visible text on rendered image
**Microsoft OCR API:**
- **Service:** Azure Computer Vision (Read API)
- **Input:** PNG image (from stage 14)
- **Output:** Word polygons, text, confidence scores
- **Caching:** Results cached to avoid re-processing
**OCR Response Format:**
```json
{
"readResults": [{
"page": 1,
"words": [
{
"boundingBox": [20, 30, 60, 30, 60, 45, 20, 45],
"text": "Invoice",
"confidence": 0.98
}
],
"lines": [
{
"boundingBox": [20, 30, 95, 30, 95, 45, 20, 45],
"text": "Invoice Date",
"words": [...]
}
]
}]
}
```
**Coordinate Conversion:**
OCR returns pixel coordinates; converted to PDF points:
```python
pdf_x = ocr_x * (pdf_width / image_width)
pdf_y = ocr_y * (pdf_height / image_height)
```
**Segment Aggregation:**
Groups words into lines based on Y coordinate proximity.
**Key Features:**
- **Selective OCR:** Only runs OCR when necessary (cost/time savings)
- **Result Caching:** Avoids redundant API calls
- **Dual-Level Output:** Word and line (segment) bboxes
- **Coordinate Conversion:** OCR pixels β PDF points
- **Fallback:** Copies PDF bboxes for documents without handwriting/VEs
**Configuration:**
- OCR API key from environment variables
- Timeout: 30 seconds per request
- Retry: Up to 3 attempts
**Error Handling:**
- OCR API failure β Retry, fallback to PDF bboxes if all retries fail
- Empty OCR result β Warning, uses PDF bboxes
- Coordinate conversion error β Logged, bbox skipped
---
### Stage 16: Normalize BBoxes
**File:** `pipeline_16_normalize_bboxes.py`
**Purpose:** Convert bounding box coordinates from PDF points (absolute pixels) to normalized [0, 1] coordinates for model training and evaluation.
**Key Functions:**
- `main()`: Main normalization orchestrator
- `normalize_word_and_segment_bboxes()`: Normalize word/segment bboxes
- `normalize_layout_bboxes()`: Normalize layout element bboxes (DLA only)
- `normalize_coordinates()`: Core coordinate transformation
**Process:**
1. Load final bboxes from stage 15
2. Load image dimensions (for normalization denominators)
3. For each bbox:
- Transform: `normalized_x = pixel_x / image_width`
- Transform: `normalized_y = pixel_y / image_height`
- Preserve text content
4. Save normalized bboxes
5. For DLA tasks: also normalize layout element bboxes
**Inputs:**
- Word bboxes from `final_word_bboxes/` (stage 15)
- Segment bboxes from `final_segment_bboxes/` (stage 15)
- Layout element definitions from `layout_element_definitions/` (stage 06, for DLA)
- Image dimensions from stage 14 logs
**Outputs:**
```
normalized_word_bboxes/
{doc_id}.json # Normalized word bboxes
normalized_segment_bboxes/
{doc_id}.json # Normalized segment bboxes
normalized_gt/ # For DLA tasks only
{doc_id}.json # Normalized layout elements
logs/normalization/ # Normalization logs
```
**Normalization Formula:**
```python
normalized_bbox = {
"x0": bbox["x0"] / image_width,
"y0": bbox["y0"] / image_height,
"x1": bbox["x1"] / image_width,
"y1": bbox["y1"] / image_height,
"text": bbox["text"] # Preserved
}
```
**Normalized BBox Format:**
```json
[
{
"x0": 0.095, # Was: 20 pixels out of 210mm @ 200dpi
"y0": 0.128, # Was: 30 pixels
"x1": 0.286, # Was: 60 pixels
"y1": 0.192, # Was: 45 pixels
"text": "Invoice"
}
]
```
**DLA-Specific Normalization:**
For Document Layout Analysis tasks, layout element bboxes are also normalized:
```json
[
{
"label": "title",
"bbox": [0.095, 0.128, 0.905, 0.192] # [x0, y0, x1, y1]
},
{
"label": "text",
"bbox": [0.095, 0.213, 0.905, 0.534]
}
]
```
**Why Normalization:**
- **Model Training:** Most models expect [0, 1] coordinates
- **Resolution Independence:** Works across different image sizes
- **Standard Format:** Matches common dataset formats (e.g., LayoutLM)
**Coordinate System Mapping:**
```
PDF Points (72 DPI):
595 Γ 842 points (A4)
β (stage 14 conversion @ 200 DPI)
Image Pixels (200 DPI):
1654 Γ 2339 pixels
β (stage 16 normalization)
Normalized [0, 1]:
0.0-1.0 Γ 0.0-1.0
```
**Key Features:**
- **Preserves Text:** Text content unchanged during normalization
- **Task-Specific:** DLA tasks get additional layout bbox normalization
- **Validation:** Checks for out-of-bounds coordinates (clamps to [0, 1])
- **Precision:** Full float precision maintained
**Error Handling:**
- Out-of-bounds coordinates β Clamped to [0, 1] with warning
- Missing image dimensions β Error, document skipped
- Zero image dimensions β Error, document marked invalid
---
### Stage 17: GT Preparation & Verification
**File:** `pipeline_17_gt_preparation_verification.py`
**Purpose:** Validate and prepare final ground truth annotations with fuzzy text matching, task-specific formatting, and comprehensive validation.
**Key Functions:**
- `main()`: Main verification orchestrator
- `route_task()`: Route to task-specific handler
- `handle_qa()`: QA ground truth processing
- `handle_kie()`: KIE ground truth with BIO tagging
- `handle_dla()`: DLA ground truth processing
- `fuzzy_match_text()`: Levenshtein-based text matching
**Process:**
1. Load raw GT from stage 03/06
2. Load final word bboxes from stage 15
3. Route to task-specific handler
4. Perform fuzzy text matching (GT text β OCR text)
5. Map GT annotations to bbox indices
6. Apply task-specific formatting
7. Validate and save verified GT
**Inputs:**
- Raw GT from `raw_annotations/` (stage 03/06)
- Word bboxes from `final_word_bboxes/` (stage 15)
- Layout elements from `layout_element_definitions/` (stage 06, for DLA)
- Visual elements from `visual_element_definitions/` (stage 08, for DLA)
**Outputs:**
```
verified_gt/
qa/
{doc_id}.json # QA format
kie/
{doc_id}.json # KIE format with BIO tagging
dla/
{doc_id}.json # DLA format with normalized bboxes
cls/
{doc_id}.json # Classification format
logs/gt_verification/ # Verification logs with match statistics
```
---
#### **QA (Question Answering) Format**
**Process:**
1. Load QA pairs: `{"question": "answer", ...}`
2. For each answer:
- Find words in bboxes matching answer text (fuzzy)
- Record bbox indices of matching words
3. Save verified GT with bbox mappings
**Output Format:**
```json
{
"questions": [
{
"question": "What is the invoice number?",
"answer": "INV-12345",
"answer_bbox_indices": [15, 16] # Word indices in final_word_bboxes
},
{
"question": "What is the total amount?",
"answer": "$1,234.56",
"answer_bbox_indices": [42]
}
]
}
```
**Fuzzy Matching:**
Uses Levenshtein distance with 0.85 similarity cutoff:
```python
similarity = fuzz.ratio(gt_answer, ocr_text) / 100.0
if similarity >= 0.85:
match_found = True
```
---
#### **KIE (Key Information Extraction) Format**
**Process:**
1. Load entity annotations: `{entity_name: text_value, ...}`
2. For each entity:
- Find words matching entity value (fuzzy)
- Generate BIO tags for all words
3. Save verified GT with BIO tagging
**Output Format:**
```json
{
"entities": [
{
"entity": "company_name",
"value": "Acme Corporation",
"bbox_indices": [5, 6]
},
{
"entity": "invoice_date",
"value": "January 15, 2024",
"bbox_indices": [20, 21, 22]
}
],
"word_labels": [
"O", "O", "O", "O", "O", # Words 0-4: Outside entities
"B-company_name", "I-company_name", # Words 5-6: Company name
"O", "O", ..., # Words 7-19: Outside
"B-invoice_date", "I-invoice_date", "I-invoice_date" # Words 20-22
]
}
```
**BIO Tagging:**
- `B-entity`: Beginning of entity
- `I-entity`: Inside entity (continuation)
- `O`: Outside any entity
---
#### **DLA (Document Layout Analysis) Format**
**Process:**
1. Load layout element definitions from stage 06
2. Load visual element definitions from stage 08
3. Validate labels (must be in `valid_labels`)
4. Check spatial constraints (no containment, minimal overlap)
5. Merge visual elements into layout annotations
6. Normalize bboxes to [0, 1]
7. Save verified GT
**Output Format:**
```json
{
"layout_elements": [
{
"id": "layout_0",
"label": "title",
"bbox": [0.095, 0.128, 0.905, 0.192] # Normalized [x0, y0, x1, y1]
},
{
"id": "layout_1",
"label": "text",
"bbox": [0.095, 0.213, 0.905, 0.534]
},
{
"id": "ve0",
"label": "figure", # Visual element mapped to layout label
"bbox": [0.714, 0.895, 0.905, 0.980]
}
]
}
```
**Visual Element Merging:**
Visual elements from stage 08 are converted to layout labels:
- `stamp` β `figure` (or custom mapping)
- `logo` β `figure`
- `chart` β `figure`
- `barcode` β `figure`
- `photo` β `figure`
**Spatial Validation:**
- **No containment:** One bbox fully inside another β Error
- **Minimal overlap:** Overlap area < 5% of smaller bbox β Warning
- **Valid labels:** All labels must be in `SynDatasetDefinition.valid_labels`
---
#### **CLS (Classification) Format**
**Process:**
1. Load classification label from raw GT
2. Validate label against expected classes
3. Save verified GT
**Output Format:**
```json
{
"document_class": "invoice",
"confidence": 1.0
}
```
---
**Fuzzy Matching Details:**
Uses Levenshtein distance (via `fuzz` library) to handle OCR discrepancies:
```python
# Example: GT "INV-12345" vs OCR "INV-I2345" (OCR error)
similarity = fuzz.ratio("INV-12345", "INV-I2345") / 100.0
# similarity = 0.89 (above 0.85 threshold) β Match!
```
**Match Statistics Logged:**
- Total GT annotations
- Successfully matched annotations
- Failed matches (similarity < 0.85)
- Average similarity score
**Key Features:**
- **Fuzzy Matching:** Handles OCR errors gracefully
- **Task-Specific Formatting:** QA, KIE, DLA, CLS all handled differently
- **BIO Tagging:** Automatic generation for KIE
- **Visual Element Integration:** DLA merges visual elements as layout annotations
- **Spatial Validation:** Detects overlapping/contained layout elements
- **Similarity Tracking:** Logs match quality for analysis
**Error Handling:**
- Match failure (similarity < 0.85) β Logged, annotation skipped
- Invalid labels (DLA) β Error, document marked invalid
- Spatial violations (DLA) β Warning, elements flagged
- Missing bboxes β Error, document marked invalid
---
### Stage 18: Analyze
**File:** `pipeline_18_analyze.py`
**Purpose:** Generate comprehensive statistics, cost analysis, and error categorization for the entire dataset generation process.
**Key Functions:**
- `main()`: Main analysis orchestrator
- `calculate_api_costs()`: Compute LLM API costs from token usage
**Process:**
1. Load all document logs from stages 01-17
2. Categorize documents: valid vs. invalid
3. Calculate error distributions
4. Compute API usage and costs
5. Generate statistics (handwriting, visual elements, annotations)
6. Save comprehensive dataset log
**Inputs:**
- All document logs from previous stages
- Batch results from stage 02 (for cost calculation)
- Message processing logs from stage 03
- Prompt usage statistics
**Outputs:**
```
dataset_log.json # Comprehensive dataset statistics
logs/analysis/ # Analysis logs
```
**Dataset Log Structure:**
```json
{
"metadata": {
"syndatadef_name": "docvqa_alpha=1.0",
"task": "qa",
"total_documents_requested": 1000,
"generation_date": "2026-02-07"
},
"prompting": {
"total_prompts": 100,
"total_batches": 10,
"llm_model": "claude-sonnet-4-20250514"
},
"total_cost_summary": {
"total_cost_usd": 123.45,
"input_tokens": 1500000,
"output_tokens": 800000,
"cached_tokens": 500000,
"cost_per_document": 0.12
},
"valid_samples_stats": {
"total_valid": 847,
"total_invalid": 153,
"validity_rate": 0.847,
"avg_handwriting_regions_per_doc": 2.3,
"avg_visual_elements_per_doc": 1.5,
"avg_annotations_per_doc": 8.7,
"documents_with_handwriting": 654,
"documents_with_visual_elements": 512
},
"valid_samples": [
{
"doc_id": "doc_0001",
"seed_image": "docvqa_train_12345",
"has_handwriting": true,
"has_visual_elements": true,
"num_annotations": 10,
"num_words": 247,
"image_size": [1654, 2339]
}
],
"valid_samples_by_category": {
"invoice": 234,
"receipt": 198,
"form": 415
},
"errors": {
"multipage_pdf": 12,
"missing_ocr_result": 5,
"failed_gt_verification": 38,
"rendering_timeout": 8,
"llm_parsing_error": 23,
"bbox_extraction_failed": 4,
"handwriting_generation_failed": 7,
"visual_element_generation_failed": 3,
"other": 53
}
}
```
**Error Categories:**
| Error Category | Description |
|---------------|-------------|
| `multipage_pdf` | PDF rendered with multiple pages (invalid) |
| `missing_ocr_result` | OCR failed or returned empty result |
| `failed_gt_verification` | GT matching/validation failed in stage 17 |
| `rendering_timeout` | PDF rendering exceeded timeout |
| `llm_parsing_error` | Failed to extract HTML/GT from LLM response |
| `bbox_extraction_failed` | PyMuPDF failed to extract bboxes |
| `handwriting_generation_failed` | Diffusion model failed |
| `visual_element_generation_failed` | VE generation/selection failed |
| `other` | Miscellaneous errors |
**Cost Calculation:**
**Claude API Pricing (example):**
```python
costs = {
"input": input_tokens * 0.003 / 1000, # $3 per 1M tokens
"output": output_tokens * 0.015 / 1000, # $15 per 1M tokens
"cached": cached_tokens * 0.0003 / 1000 # $0.30 per 1M tokens (prompt caching)
}
total_cost = costs["input"] + costs["output"] + costs["cached"]
```
**Statistics Computed:**
- **Validity Rate:** Percentage of documents passing all stages
- **Handwriting Stats:** Documents with handwriting, avg regions per doc
- **Visual Element Stats:** Documents with VEs, avg elements per doc
- **Annotation Stats:** Avg QA pairs/KIE entities/DLA elements per doc
- **Token Usage:** Input/output/cached token totals
- **Cost Metrics:** Total cost, cost per document, cost per valid document
**Key Features:**
- **Comprehensive Error Tracking:** All error categories logged
- **Cost Transparency:** Token-level cost breakdown
- **Quality Metrics:** Validity rate, avg annotations, etc.
- **Category Breakdown:** Valid samples grouped by document type
- **Per-Document Tracking:** Each valid document's metadata saved
**Usage:**
This log is essential for:
- Understanding dataset quality
- Optimizing pipeline (identify bottlenecks)
- Cost estimation for future runs
- Debugging (error distribution analysis)
---
### Stage 19: Create Debug Data
**File:** `pipeline_19_create_debug_data.py`
**Purpose:** Generate comprehensive debug visualizations for manual inspection and quality assurance.
**Key Functions:**
- `main()`: Main debug generator
- `visualize_visual_element_bboxes()`: Overlay VE bboxes on PDFs
- `visualize_final_bboxes_on_images()`: Overlay OCR bboxes on images
- `visualize_pdf_bboxes()`: Overlay PDF bboxes
**Process:**
1. Load all generated documents
2. Create debug subdirectories
3. For each document:
- Generate PDF bbox overlays
- Generate VE bbox overlays
- Generate handwriting insertion region overlays
- Generate final OCR bbox overlays on images
4. Copy raw HTML with debug.js script for browser inspection
**Inputs:**
- All intermediate and final outputs from stages 01-18
**Outputs:**
```
debug/
pdf_bboxes/ # PDF word bboxes overlaid (stage 05)
{doc_id}.pdf
visual_element_bboxes/ # VE bboxes overlaid (stage 08)
{doc_id}.pdf
handwriting_insertion/ # Handwriting regions overlaid (stage 12)
{doc_id}.pdf
final_bboxes_on_images/ # OCR bboxes on final images (stage 15)
{doc_id}.png
html_with_debug/ # Raw HTML + debug.js
{doc_id}.html
debug.js # Browser-based inspection script
```
**Debug Visualizations:**
**1. PDF BBoxes (Stage 05):**
- Red rectangles: Word bounding boxes from PyMuPDF
- Annotated with word text
- Purpose: Verify PDF text extraction quality
**2. Visual Element BBoxes (Stage 08):**
- Blue rectangles: Visual element placeholder regions
- Annotated with element type (stamp, logo, etc.)
- Purpose: Verify VE extraction and positioning
**3. Handwriting Insertion Regions (Stage 12):**
- Green rectangles: Handwriting bbox regions
- Annotated with handwriting text
- Purpose: Verify handwriting placement accuracy
**4. Final BBoxes on Images (Stage 15):**
- Orange rectangles: OCR word bounding boxes
- Overlaid on final rendered images
- Purpose: Verify OCR accuracy and coverage
**Debug JavaScript (debug.js):**
```javascript
// Browser-based inspection tool
// Features:
// - Highlight elements on hover
// - Show geometry data in console
// - Toggle element visibility
// - Measure element dimensions
```
**Key Features:**
- **Color-Coded Overlays:** Different colors for different stages
- **Text Annotations:** Bboxes labeled with content
- **Multi-Format:** Both PDF and PNG visualizations
- **Browser Inspection:** HTML with interactive debug script
- **Selective Generation:** Only enabled with `DEBUG_MODE=true` flag
**Configuration:**
- `DEBUG_MODE`: Enable/disable debug output (default: false)
- `DEBUG_BBOX_LINE_WIDTH`: Line thickness for overlays (default: 2)
- `DEBUG_BBOX_OPACITY`: Overlay transparency (default: 0.5)
**When to Use:**
- Visual quality assurance
- Debugging bbox extraction issues
- Verifying handwriting/VE insertion
- Identifying OCR problems
- Manual inspection of edge cases
**Performance Note:**
Debug generation adds ~20% processing time; disabled by default in production.
---
## API Implementation
The DocGenie API provides a FastAPI-based REST service for synchronous document generation, integrating pipeline stages 01-06.
### Architecture Overview
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Client Request
β
β
βββββββββββββββββββββββ
β FastAPI Server β
β (main.py) β
βββββββββββββββββββββββ
β
ββββΊ Validate Request (schemas.py)
β
ββββΊ Download Seed Images (utils.py)
β
ββββΊ Build Prompt (utils.py)
β
ββββΊ Call Claude API (Synchronous)
β ββββΊ Claude Sonnet 4.5
β
ββββΊ Extract HTML & GT (pipeline_03 functions)
β
ββββΊ Render PDF (pipeline_04 functions)
β ββββΊ Playwright/Chromium
β
ββββΊ Extract BBoxes (pipeline_05 functions)
β ββββΊ PyMuPDF
β
ββββΊ Return Response
ββββΊ JSON with base64-encoded PDFs
```
---
### Endpoints
#### **GET /**
Health check endpoint.
**Response:**
```json
{
"message": "DocGenie API is running",
"version": "1.0",
"status": "healthy"
}
```
---
#### **GET /health**
Detailed health status.
**Response:**
```json
{
"status": "healthy",
"api_version": "1.0",
"playwright_available": true,
"llm_model": "claude-sonnet-4-5-20250929"
}
```
---
#### **POST /generate**
Generate documents with ground truth annotations.
**Request Body** (`schemas.GenerateDocumentRequest`):
```json
{
"seed_images": [
"https://example.com/seed1.jpg",
"https://example.com/seed2.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business and administrative",
"gt_type": "Multiple questions and their answers",
"gt_format": "{\"question\": \"answer\", ...}",
"num_solutions": 2
}
}
```
**Request Validation:**
- `seed_images`: 1-10 URLs (HTTPS only)
- `num_solutions`: 1-5 documents
- All prompt parameters required
**Response** (`schemas.GenerateDocumentResponse`):
```json
{
"success": true,
"message": "Successfully generated 2 documents",
"documents": [
{
"document_id": "doc_20260207_001",
"html": "<html>...</html>",
"css": "body { ... }",
"ground_truth": {
"What is the invoice number?": "INV-12345"
},
"pdf_base64": "JVBERi0xLjQK...",
"bboxes": [
{
"x0": 20.0,
"y0": 30.0,
"x1": 60.0,
"y1": 45.0,
"text": "Invoice"
}
],
"page_width_mm": 210.0,
"page_height_mm": 297.0
}
],
"total_documents": 2
}
```
---
#### **POST /generate-files**
Generate documents and return as downloadable files.
**Request:** Same as `/generate`
**Response:**
- **Content-Type:** `application/zip`
- **File:** ZIP archive containing:
- `doc_001.pdf`
- `doc_001_gt.json`
- `doc_001_bboxes.json`
- `doc_002.pdf`
- ...
**File Structure:**
```
generated_documents.zip
βββ doc_001.pdf
βββ doc_001_gt.json
βββ doc_001_bboxes.json
βββ doc_001_metadata.json
βββ doc_002.pdf
βββ ...
```
---
### Request/Response Schemas
Defined in `api/schemas.py`:
#### **PromptParameters**
```python
class PromptParameters(BaseModel):
language: str = "English"
doc_type: str = "business and administrative"
gt_type: str = "Multiple questions and their answers"
gt_format: str = '{"question": "answer", ...}'
num_solutions: int = Field(default=1, ge=1, le=5)
```
#### **GenerateDocumentRequest**
```python
class GenerateDocumentRequest(BaseModel):
seed_images: List[HttpUrl] = Field(..., min_items=1, max_items=10)
prompt_params: PromptParameters
```
#### **BoundingBox**
```python
class BoundingBox(BaseModel):
x0: float
y0: float
x1: float
y1: float
text: str
```
#### **DocumentResult**
```python
class DocumentResult(BaseModel):
document_id: str
html: str
css: str
ground_truth: Optional[dict]
pdf_base64: str
bboxes: List[BoundingBox]
page_width_mm: float
page_height_mm: float
```
#### **GenerateDocumentResponse**
```python
class GenerateDocumentResponse(BaseModel):
success: bool
message: str
documents: List[DocumentResult]
total_documents: int
```
---
### API Pipeline Flow
Detailed integration with pipeline stages:
```python
# Simplified API flow (from api/main.py)
@app.post("/generate", response_model=GenerateDocumentResponse)
async def generate_documents(request: GenerateDocumentRequest):
# 1. Download and encode seed images
seed_images_base64 = await download_and_encode_images(request.seed_images)
# 2. Build prompt from template
prompt = build_prompt_from_template(
seed_images=seed_images_base64,
params=request.prompt_params
)
# 3. Call Claude API (synchronous, not batched)
llm_response = await call_claude_api(
prompt=prompt,
model="claude-sonnet-4-5-20250929"
)
# 4. Extract HTML and GT (from pipeline_03)
documents_html = extract_html_from_message(llm_response)
documents_gt = extract_gt_from_html(documents_html)
# 5. Validate HTML (from utils.py)
validate_html_structure(documents_html)
# 6. Render PDFs (from pipeline_04)
pdfs = await render_pdfs_with_playwright(documents_html)
# 7. Validate PDFs (from utils.py)
validate_pdf_pages(pdfs)
# 8. Extract bboxes (from pipeline_05)
bboxes = extract_bboxes_from_pdfs(pdfs)
# 9. Validate bboxes (from utils.py)
validate_bbox_completeness(bboxes)
# 10. Encode PDFs to base64
pdfs_base64 = encode_pdfs_to_base64(pdfs)
# 11. Build response
return GenerateDocumentResponse(
success=True,
documents=[...],
total_documents=len(documents_html)
)
```
---
### Integration with Pipeline Functions
**Reused from Pipeline:**
- `extract_html_from_message()` from `pipeline_03`
- `extract_gt_from_html()` from `pipeline_03`
- `render_pdf_with_playwright()` from `pipeline_04`
- `extract_bboxes_from_pdf()` from `pipeline_05`
- Various utilities from `docgenie.generation.utils`
**API-Specific Functions (api/utils.py):**
- `download_seed_images()`: Fetch images from URLs
- `encode_images_to_base64()`: Convert images for API transmission
- `build_prompt_from_template()`: Template-based prompt construction
- `call_claude_api_sync()`: Synchronous Claude API call (non-batched)
- `encode_pdf_to_base64()`: PDF encoding for response
- `validate_html_structure()`: HTML validation
- `validate_pdf_pages()`: PDF page count/size validation
- `validate_bbox_completeness()`: Ensure bboxes extracted
---
### Configuration
#### **Environment Variables (.env)**
```bash
ANTHROPIC_API_KEY=sk-ant-... # Required for Claude API
LLM_MODEL=claude-sonnet-4-5-20250929 # Default model
API_PORT=8000 # Server port
DEBUG_MODE=false # Enable debug logging
```
#### **Prompt Templates**
Located in `data/prompt_templates/`:
**Template Structure:**
```
data/prompt_templates/<template_name>/
βββ system_prompt.txt # System message
βββ user_prompt_template.txt # User message template
βββ example_output.html # Example for few-shot
```
**Placeholder Substitution:**
```python
# In user_prompt_template.txt
"""
Please generate {num_solutions} {doc_type} documents in {language}.
Ground truth format: {gt_format}
Ground truth type: {gt_type}
"""
# Substituted with request.prompt_params
```
#### **CORS Configuration**
```python
# In api/main.py
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Open for development
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
```
---
### Authentication & Security
**Current State:**
- **No built-in authentication:** API is open (for development)
- **API Key Management:** Claude API key in `.env`, not passed in requests
- **CORS:** Open (`allow_origins=["*"]`) for development
**Production Recommendations:**
- Add API key authentication (e.g., Bearer tokens)
- Restrict CORS origins to known frontends
- Rate limiting (e.g., Redis-based)
- Input sanitization for HTML injection prevention
- HTTPS only (terminate SSL at reverse proxy)
---
### Performance Considerations
**Async Rendering:**
- Playwright rendering uses async/await
- Concurrent requests supported by FastAPI
- Semaphore control prevents resource exhaustion
**Image Size Limits:**
- Seed images compressed before API transmission
- Max dimension: 1024px (configurable)
- JPEG quality: 85% (configurable)
**Timeouts:**
- PDF render timeout: 60 seconds per document
- Claude API timeout: 120 seconds
- Total request timeout: 300 seconds (5 minutes)
**Retry Logic:**
- PDF rendering: Up to 3 attempts
- Claude API: Up to 2 retries on network errors
- No retry on validation failures
**Concurrency:**
- FastAPI default: Multiple workers (configurable)
- Playwright: Semaphore-controlled (10 concurrent renders)
---
### Limitations
**Current Limitations:**
1. **Single-Page Only:** Multi-page PDFs flagged as errors
2. **Seed Image Limit:** Maximum 10 seed images per request
3. **Document Limit:** Maximum 5 document variations (`num_solutions`)
4. **No Handwriting/Visual Elements:** Stages 07-13 not integrated (API stops at stage 06)
5. **Synchronous LLM:** No batching (higher cost per document)
6. **No Dataset Export:** No `/export-dataset` endpoint for full pipeline runs
**Known Issues:**
- Large documents (>10 pages worth of content) may timeout
- Complex CSS (animations, 3D transforms) may not render correctly
- Some Unicode characters may not display in PDFs
---
### Future Integration Plan
From `api/PIPELINE_INTEGRATION.md`:
#### **Stage 3: Handwriting & Visual Elements (Stages 07-11)**
**New Request Parameters:**
```python
class PromptParameters(BaseModel):
# ... existing ...
enable_handwriting: bool = False
handwriting_ratio: float = Field(default=0.2, ge=0.0, le=1.0)
enable_visual_elements: bool = False
visual_element_types: List[str] = ["stamp", "logo"]
```
**New Response Fields:**
```python
class DocumentResult(BaseModel):
# ... existing ...
handwriting_regions: Optional[List[HandwritingRegion]]
visual_elements: Optional[List[VisualElement]]
```
**Impact:**
- Longer processing time (diffusion model: ~5s per handwriting region)
- Larger response size (additional images)
---
#### **Stage 4: Image Finalization & OCR (Stages 12-15)**
**New Response Fields:**
```python
class DocumentResult(BaseModel):
# ... existing ...
image_base64: str # Final rendered image (PNG)
ocr_text: str # Full OCR text
ocr_confidence: float # Average OCR confidence
```
**Impact:**
- OCR API costs (~$1.50 per 1000 images)
- Additional 2-3 seconds per document (OCR latency)
---
#### **Stage 5: Dataset Packaging (Stages 16-19)**
**New Endpoint:**
```python
@app.post("/export-dataset")
async def export_dataset(request: ExportDatasetRequest):
"""
Run full pipeline (stages 01-19) and return packaged dataset.
"""
# Run pipeline with syndatadef
# Return ZIP with images, GTs, statistics
```
**Request:**
```python
class ExportDatasetRequest(BaseModel):
syndatadef_config: dict # Full SynDatasetDefinition
output_format: str = "huggingface" # "huggingface", "coco", "custom"
```
**Response:**
- ZIP archive with full dataset
- Includes `dataset_log.json` from stage 18
---
### Example Usage
#### **Python Client (api/example_usage.py)**
```python
import requests
import base64
from pathlib import Path
# API endpoint
API_URL = "http://localhost:8000/generate"
# Prepare request
request_data = {
"seed_images": [
"https://example.com/invoice_seed.jpg",
"https://example.com/receipt_seed.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "invoices and receipts",
"gt_type": "Multiple questions and their answers",
"gt_format": '{"question": "answer", ...}',
"num_solutions": 3
}
}
# Call API
response = requests.post(API_URL, json=request_data)
result = response.json()
# Process results
if result["success"]:
for doc in result["documents"]:
doc_id = doc["document_id"]
# Save PDF
pdf_data = base64.b64decode(doc["pdf_base64"])
Path(f"{doc_id}.pdf").write_bytes(pdf_data)
# Save GT
Path(f"{doc_id}_gt.json").write_text(
json.dumps(doc["ground_truth"], indent=2)
)
# Save HTML
Path(f"{doc_id}.html").write_text(doc["html"])
print(f"Saved: {doc_id}")
print(f" - BBoxes: {len(doc['bboxes'])}")
print(f" - GT Annotations: {len(doc['ground_truth'])}")
```
#### **cURL Example**
```bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"seed_images": [
"https://example.com/seed.jpg"
],
"prompt_params": {
"language": "English",
"doc_type": "business documents",
"gt_type": "Questions and answers",
"gt_format": "{\"question\": \"answer\"}",
"num_solutions": 1
}
}'
```
#### **JavaScript/TypeScript Client**
```typescript
const response = await fetch('http://localhost:8000/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
seed_images: ['https://example.com/seed.jpg'],
prompt_params: {
language: 'English',
doc_type: 'invoices',
gt_type: 'Questions and answers',
gt_format: '{"question": "answer"}',
num_solutions: 2
}
})
});
const result = await response.json();
// Decode PDF
const pdfBlob = new Blob(
[Uint8Array.from(atob(result.documents[0].pdf_base64), c => c.charCodeAt(0))],
{ type: 'application/pdf' }
);
// Download PDF
const url = URL.createObjectURL(pdfBlob);
const a = document.createElement('a');
a.href = url;
a.download = `${result.documents[0].document_id}.pdf`;
a.click();
```
---
### Testing
**Test File:** `api/test_api.py`
```python
import pytest
from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)
def test_health_check():
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_generate_documents():
request_data = {
"seed_images": ["https://example.com/seed.jpg"],
"prompt_params": {
"language": "English",
"doc_type": "invoices",
"gt_type": "Questions",
"gt_format": "{\"q\": \"a\"}",
"num_solutions": 1
}
}
response = client.post("/generate", json=request_data)
assert response.status_code == 200
result = response.json()
assert result["success"] is True
assert len(result["documents"]) > 0
def test_invalid_request():
request_data = {
"seed_images": [], # Invalid: empty
"prompt_params": {"num_solutions": 10} # Invalid: > 5
}
response = client.post("/generate", json=request_data)
assert response.status_code == 422 # Validation error
```
**Run Tests:**
```bash
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie
pytest api/test_api.py -v
```
---
### Deployment
**Start Server:**
```bash
cd /media/ahad-hassan/Volume_E/FYP/FYP/docgenie/api
chmod +x start.sh
./start.sh
```
**start.sh Contents:**
```bash
#!/bin/bash
export $(cat .env | xargs)
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
**Production Deployment:**
```bash
# With Gunicorn (multi-worker)
gunicorn main:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000 \
--timeout 300
```
**Docker Deployment:**
```dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```
---
## Core Models & Utilities
### SynDatasetDefinition Model
**File:** `docgenie/generation/models/_syndatadef.py`
**Purpose:** Configuration object for synthetic dataset generation, loaded from YAML files.
**Key Attributes:**
```python
@dataclass
class SynDatasetDefinition:
# Dataset metadata
name: str # "docvqa_alpha=1.0"
task: TaskType # TaskType.QA, .KIE, .DLA, .CLS
base_dataset: str # "docvqa", "cord", etc.
# LLM configuration
llm_model: str # "claude-sonnet-4-20250514"
prompt_template: str # "DocGenie"
num_solutions: int # Documents per prompt
# Prompting parameters
language: str # "English", "German", etc.
doc_type: str # "business and administrative"
gt_type: str # Task-specific GT description
gt_format: str # Expected GT format
# Dataset parameters
num_samples: int # Total documents to generate
alpha: float # Clustering diversity parameter
# Seed selection
seeds_per_cluster: int # Seeds sampled per cluster
clustering_method: str # "kmeans", "agglomerative"
# Task-specific
valid_labels: Optional[List[str]] # For DLA: ["title", "text", ...]
# Output paths
output_dir: Path # Root output directory
```
**Key Methods:**
```python
def get_file_structure(self) -> FileStructure:
"""Returns FileStructure manager for output directories."""
def build_prompt(self, seed_images: List[str]) -> str:
"""Builds prompt from template with parameter substitution."""
def iter_document_logs(self) -> Iterator[DocumentLog]:
"""Iterates over all document logs."""
def update_document_status(self, doc_id: str, status: Status):
"""Updates document status in log."""
@classmethod
def from_yaml(cls, yaml_path: Path) -> "SynDatasetDefinition":
"""Load configuration from YAML file."""
```
**Example YAML:**
```yaml
name: "docvqa_alpha=1.0"
task: "qa"
base_dataset: "docvqa"
llm_model: "claude-sonnet-4-20250514"
prompt_template: "DocGenie"
num_solutions: 1
language: "English"
doc_type: "business and administrative documents"
gt_type: "Multiple questions and their answers"
gt_format: '{"question": "answer", ...}'
num_samples: 1000
alpha: 1.0
seeds_per_cluster: 10
clustering_method: "kmeans"
output_dir: "data/datasets/synthesized_datasets/docvqa_alpha=1.0"
```
---
### PipelineParameters Model
**File:** `docgenie/generation/models/_pipeline.py`
**Purpose:** Runtime parameters for pipeline execution.
**Attributes:**
```python
@dataclass
class PipelineParameters:
# Execution control
start_stage: int = 1 # First stage to execute
end_stage: int = 19 # Last stage to execute
skip_existing: bool = True # Skip documents with existing outputs
# Parallelization
max_workers: int = 10 # Concurrent processing
chromium_concurrency: int = 10 # Parallel PDF renders
# Debug mode
debug_mode: bool = False # Enable debug visualizations
# Retry configuration
max_retries: int = 3 # Retry attempts
retry_delay: float = 2.0 # Seconds between retries
# Timeouts
pdf_render_timeout: int = 60 # Seconds
ocr_timeout: int = 30 # Seconds
```
---
### FileStructure Model
**File:** `docgenie/generation/models/_file.py`
**Purpose:** Manages directory structure for generated data.
**Key Properties:**
```python
@dataclass
class FileStructure:
root: Path # Root output directory
# Directory properties (all return Path)
seeds_directory: Path
prompt_batches_directory: Path
message_results_directory: Path
raw_html_directory: Path
raw_annotations_directory: Path
geometries_directory: Path
pdf_initial_directory: Path
render_html_directory: Path
pdf_word_bboxes_directory: Path
pdf_char_bboxes_directory: Path
layout_element_definitions_directory: Path
handwriting_definitions_directory: Path
visual_element_definitions_directory: Path
handwriting_images_directory: Path
visual_element_images_directory: Path
pdf_without_handwriting_placeholder_directory: Path
pdf_with_handwriting_directory: Path
pdf_final_directory: Path
images_directory: Path
final_word_bboxes_directory: Path
final_segment_bboxes_directory: Path
normalized_word_bboxes_directory: Path
normalized_segment_bboxes_directory: Path
verified_gt_directory: Path
# Debug subdirectories
debug_directory: Path
debug_pdf_bboxes_directory: Path
debug_visual_element_bboxes_directory: Path
debug_handwriting_insertion_directory: Path
debug_final_bboxes_on_images_directory: Path
debug_html_with_debug_directory: Path
```
**Key Methods:**
```python
def create_all_directories(self):
"""Create all output directories."""
def get_document_path(self, doc_id: str, stage: str) -> Path:
"""Get path for specific document and stage."""
```
---
### DocumentLog Model
**File:** `docgenie/generation/models/_log.py`
**Purpose:** Document-level metadata and status tracking.
**Attributes:**
```python
@dataclass
class DocumentLog:
doc_id: str # Unique document ID
seed_image_id: str # Source seed image
prompt_call_id: str # Prompt batch/call ID
status: Status # VALID, INVALID, PROCESSING
# Stage completion flags
has_raw_html: bool = False
has_raw_gt: bool = False
has_pdf_initial: bool = False
has_geometries: bool = False
has_bboxes: bool = False
has_handwriting: bool = False
has_visual_elements: bool = False
has_final_image: bool = False
has_ocr_result: bool = False
has_verified_gt: bool = False
# Statistics
num_words: int = 0
num_annotations: int = 0
num_handwriting_regions: int = 0
num_visual_elements: int = 0
# Error tracking
error_stage: Optional[str] = None
error_message: Optional[str] = None
error_category: Optional[str] = None
# Timestamps
created_at: datetime
updated_at: datetime
```
**Key Methods:**
```python
def mark_stage_complete(self, stage: str):
"""Mark pipeline stage as complete."""
def mark_error(self, stage: str, error_msg: str, category: str):
"""Record error and mark document as invalid."""
def is_valid(self) -> bool:
"""Check if document passed all stages."""
```
---
### BBox Model
**File:** `docgenie/generation/models/_bbox.py`
**Purpose:** Bounding box representation with text content.
**Attributes:**
```python
@dataclass
class BBox:
rect: Rect # {x0, y0, x1, y1}
text: str # Text content
metadata: Optional[dict] = None # Additional data
```
**Key Methods:**
```python
@property
def width(self) -> float:
return self.rect["x1"] - self.rect["x0"]
@property
def height(self) -> float:
return self.rect["y1"] - self.rect["y0"]
def normalize(self, image_width: float, image_height: float) -> "BBox":
"""Convert to normalized [0, 1] coordinates."""
def unnormalize(self, image_width: float, image_height: float) -> "BBox":
"""Convert from normalized to pixel coordinates."""
def to_dict(self) -> dict:
"""Serialize to dictionary."""
@classmethod
def from_dict(cls, data: dict) -> "BBox":
"""Deserialize from dictionary."""
```
---
### LayoutBBox Model
**File:** `docgenie/generation/models/_bbox.py`
**Purpose:** Layout element bounding box (for DLA tasks).
**Attributes:**
```python
@dataclass
class LayoutBBox:
label: str # "title", "text", "table", etc.
rect: Rect # {x0, y0, x1, y1}
metadata: Optional[dict] = None
```
**Key Methods:**
```python
def normalize(self, image_width: float, image_height: float) -> "LayoutBBox":
"""Normalize coordinates."""
def contains(self, other: "LayoutBBox") -> bool:
"""Check if this bbox fully contains another."""
def overlaps(self, other: "LayoutBBox") -> bool:
"""Check if this bbox overlaps with another."""
def overlap_area(self, other: "LayoutBBox") -> float:
"""Calculate overlap area with another bbox."""
```
---
### Utility Modules
#### **BBox Utilities (utils/bboxes.py)**
```python
def load_bboxes_from_file(file_path: Path) -> List[BBox]:
"""Load bboxes from JSON file."""
def save_bboxes_to_file(bboxes: List[BBox], file_path: Path):
"""Save bboxes to JSON file."""
def visualize_bboxes_on_pdf(
pdf_path: Path,
bboxes: List[BBox],
output_path: Path,
color: str = "red"
):
"""Draw bbox overlays on PDF."""
def visualize_bboxes_on_image(
image_path: Path,
bboxes: List[BBox],
output_path: Path,
color: str = "orange"
):
"""Draw bbox overlays on image."""
def check_bbox_containment(bbox1: BBox, bbox2: BBox) -> bool:
"""Check if bbox1 contains bbox2."""
```
---
#### **Geometry Utilities (utils/geos.py)**
```python
def filter_layout_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-label attribute."""
def filter_handwriting_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-handwriting attribute."""
def filter_visual_elements(geometries: dict) -> List[dict]:
"""Extract elements with data-visual-element attribute."""
def filter_by_css_class(geometries: dict, class_name: str) -> List[dict]:
"""Extract elements with specific CSS class."""
```
---
#### **Serialization Utilities (utils/serialization.py)**
```python
def encode_image_to_base64(image_path: Path) -> str:
"""Encode image file to base64 string."""
def decode_base64_to_image(base64_str: str, output_path: Path):
"""Decode base64 string and save as image."""
def serialize_dataclass(obj: Any) -> dict:
"""Serialize dataclass to dictionary."""
def deserialize_dataclass(data: dict, cls: Type[T]) -> T:
"""Deserialize dictionary to dataclass."""
```
---
## Configuration & Constants
### Key Constants (generation/constants.py)
```python
# Document processing
HTML_PARSER = "html.parser" # BeautifulSoup parser
PDF_POINT_SCALING = 72 / 96 # CSS DPI to PDF DPI
# Bounding boxes
BBOX_OVERLAP_THRESHOLD = 0.05 # 5% overlap tolerance
SPATIAL_MATCH_THRESHOLD = 10.0 # Pixel tolerance for matching
# PDF rendering
CHROMIUM_CONCURRENCY = 10 # Parallel renders
PER_PDF_RENDER_TIMEOUT = 60 # Seconds
PER_PDF_RENDER_MAX_RETRIES = 3 # Retry attempts
# Handwriting
MAX_HANDWRITING_CHARS = 7 # Max chars per diffusion generation
HANDWRITING_HEIGHT_PX = 40 # Image height
HANDWRITING_PADDING_PX = 0 # Horizontal padding
DIFFUSION_NUM_INFERENCE_STEPS = 50 # Generation quality
HANDWRITING_IMAGE_UPSCALE_FACTOR = 3 # Insertion scaling
MAX_HANDWRITING_RAND_X = 2 # Random X offset (pixels)
MAX_HANDWRITING_RAND_Y = 1 # Random Y offset (pixels)
# Visual elements
VISUAL_ELEMENT_UPSCALE_FACTOR = 3 # Insertion scaling
STAMP_BORDER_WIDTH = 2 # Stamp border thickness
BARCODE_DPI = 300 # Barcode image quality
# OCR
IMAGE_DPI = 200 # Final image DPI
OCR_CONFIDENCE_THRESHOLD = 0.8 # Min confidence
# Ground truth
FUZZY_MATCH_THRESHOLD = 0.85 # Levenshtein similarity cutoff
# Handwriting styles
HANDWRITING_STYLES = [
"writer_0", "writer_1", "writer_2", ..., "writer_99"
]
# Visual element types
VISUAL_ELEMENT_TYPES = [
"stamp", "logo", "barcode", "chart", "photo"
]
# Visual element type mapping (LLM output β standard)
VISUAL_ELEMENT_TYPE_MAPPING = {
"stamp": "stamp",
"company_stamp": "stamp",
"approval_stamp": "stamp",
"logo": "logo",
"company_logo": "logo",
"brand_logo": "logo",
"barcode": "barcode",
"code128": "barcode",
"chart": "chart",
"graph": "chart",
"figure": "chart",
"photo": "photo",
"image": "photo",
"picture": "photo"
}
```
---
### Environment Variables
**Required:**
```bash
ANTHROPIC_API_KEY=sk-ant-... # Claude API key
MICROSOFT_AZURE_OCR_KEY=... # Azure OCR key
MICROSOFT_AZURE_OCR_ENDPOINT=... # Azure OCR endpoint
```
**Optional:**
```bash
LLM_MODEL=claude-sonnet-4-20250514 # Override model
DEBUG_MODE=false # Enable debug output
LOG_LEVEL=INFO # Logging verbosity
CHROMIUM_CONCURRENCY=10 # Override concurrency
```
---
## Error Handling & Debugging
### Error Categories
**Comprehensive error tracking in stage 18:**
| Category | Description | Resolution |
|----------|-------------|------------|
| `multipage_pdf` | PDF rendered with >1 page | Check HTML content size, CSS page breaks |
| `missing_ocr_result` | OCR API failed or empty | Check Azure credentials, retry |
| `failed_gt_verification` | GT text not found in OCR | Review fuzzy match threshold, inspect HTML |
| `rendering_timeout` | PDF render exceeded timeout | Increase timeout, simplify HTML |
| `llm_parsing_error` | Failed to extract HTML/GT | Review LLM response format, update regex |
| `bbox_extraction_failed` | PyMuPDF extraction error | Check PDF validity, inspect fonts |
| `handwriting_generation_failed` | Diffusion model error | Check model checkpoint, GPU availability |
| `visual_element_generation_failed` | VE creation error | Check prefab directories, image validity |
---
### Debug Visualizations (Stage 19)
**Generated debug outputs:**
1. **PDF BBoxes:** Verify PyMuPDF extraction quality
2. **Visual Element BBoxes:** Verify VE extraction and positioning
3. **Handwriting Insertion:** Verify handwriting placement
4. **Final BBoxes on Images:** Verify OCR accuracy
**Enable debug mode:**
```python
# In pipeline execution
pipeline_params = PipelineParameters(
debug_mode=True,
# ... other params
)
```
---
### Logging
**Log locations:**
```
data/datasets/synthesized_datasets/<dataset_name>/logs/
βββ pipeline_01/
β βββ seed_selection.log
βββ pipeline_02/
β βββ prompting.log
βββ pipeline_03/
β βββ message_processing/
β βββ batch_001.log
β βββ batch_002.log
βββ ...
βββ pipeline_19/
βββ debug_generation.log
```
**Log format:**
```
[2026-02-07 14:32:15] [INFO] [pipeline_04] Starting PDF rendering for doc_0001
[2026-02-07 14:32:18] [INFO] [pipeline_04] Successfully rendered doc_0001 (3.2s)
[2026-02-07 14:32:18] [ERROR] [pipeline_04] Rendering failed for doc_0002: Timeout
```
---
### Common Issues & Solutions
**Issue: Multi-page PDFs**
- **Cause:** HTML content exceeds page size
- **Solution:** Reduce content, check for long tables, use CSS `overflow: hidden`
**Issue: Handwriting not appearing**
- **Cause:** Character bboxes not available (stage 05)
- **Solution:** Use simpler fonts in HTML, ensure PyMuPDF can extract chars
**Issue: OCR missing text**
- **Cause:** Low image DPI, poor contrast
- **Solution:** Increase `IMAGE_DPI` (stage 14), adjust HTML styling
**Issue: GT verification failures**
- **Cause:** OCR discrepancies, fuzzy match threshold too high
- **Solution:** Lower `FUZZY_MATCH_THRESHOLD`, improve HTML text rendering
**Issue: Claude API timeout**
- **Cause:** Large seed images, complex prompts
- **Solution:** Compress seed images, simplify prompt template
---
## Usage Examples
### Full Pipeline Execution
```python
from docgenie.generation import pipeline_01_select_seeds
from docgenie.generation.models import SynDatasetDefinition, PipelineParameters
# Load configuration
syndatadef = SynDatasetDefinition.from_yaml(
"data/syn_dataset_definitions/docvqa_alpha=1.0.yaml"
)
# Configure pipeline
params = PipelineParameters(
start_stage=1,
end_stage=19,
skip_existing=True,
debug_mode=False,
max_workers=10
)
# Execute pipeline stages
for stage in range(params.start_stage, params.end_stage + 1):
print(f"Executing stage {stage:02d}...")
# Import and run stage module
stage_module = importlib.import_module(
f"docgenie.generation.pipeline_{stage:02d}_*"
)
stage_module.main(syndatadef, params)
print(f"Stage {stage:02d} complete.")
print("Pipeline execution complete!")
```
---
### API Usage
**See API Implementation section for detailed examples.**
---
### Custom Dataset Definition
```yaml
# data/syn_dataset_definitions/custom_invoices.yaml
name: "custom_invoices_v1"
task: "kie"
base_dataset: "cord"
llm_model: "claude-sonnet-4-20250514"
prompt_template: "ClaudeRefined12"
num_solutions: 1
language: "English"
doc_type: "invoices and receipts"
gt_type: "Key-value pairs (entity extraction)"
gt_format: '{"entity_name": "entity_value", ...}'
num_samples: 500
alpha: 0.75
seeds_per_cluster: 5
clustering_method: "kmeans"
# KIE-specific
valid_labels: null # Not used for KIE
output_dir: "data/datasets/synthesized_datasets/custom_invoices_v1"
```
---
## Conclusion
This documentation provides a comprehensive overview of the DocGenie generation pipeline and API. For additional support or questions, refer to:
- **Pipeline Integration Guide:** `api/PIPELINE_INTEGRATION.md`
- **API README:** `api/README.md`
- **Source Code:** `docgenie/generation/` and `api/`
**Key Takeaways:**
1. **19-Stage Pipeline:** Modular design from seed selection to GT verification
2. **Multi-Task Support:** QA, KIE, DLA, CLS with task-specific handling
3. **Realistic Documents:** LLM-generated content, diffusion handwriting, visual elements
4. **Quality Assurance:** Comprehensive validation, OCR verification, error tracking
5. **API Integration:** FastAPI service for synchronous document generation (stages 01-06)
6. **Extensibility:** Modular code, clear interfaces, easy to extend
**Pipeline Strengths:**
- Task-agnostic core with task-specific adapters
- Extensive logging and error tracking
- Parallel processing where applicable
- Debug visualizations at every stage
- Integration with state-of-the-art models
**Future Directions:**
- Full API integration (stages 07-19)
- Additional document types (forms, legal, medical)
- Multi-language support expansion
- Enhanced visual element generation (charts, diagrams)
- Real-time generation optimization
|