""" Generate a synthetic KYC document dataset for training a VLM on document extraction and classification tasks. Produces document images for: Aadhar, PAN, Passport, Visa, Election Card with corresponding extraction ground truth in JSON format. Usage: pip install datasets Pillow faker huggingface_hub python generate_kyc_dataset.py Output: Pushes to HuggingFace Hub as Jwalit/kyc-document-extraction-vlm """ # See full script at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm # The dataset has already been generated and pushed. # Re-run this script only if you want to regenerate with different parameters. print("Dataset already generated at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm") print("To regenerate, uncomment the code below and run.")