gemma4-e4b-kyc-document-extractor / generate_kyc_dataset.py
Jwalit's picture
Add dataset generation script reference
bf014b7 verified
"""
Generate a synthetic KYC document dataset for training a VLM on document
extraction and classification tasks.
Produces document images for: Aadhar, PAN, Passport, Visa, Election Card
with corresponding extraction ground truth in JSON format.
Usage:
pip install datasets Pillow faker huggingface_hub
python generate_kyc_dataset.py
Output: Pushes to HuggingFace Hub as Jwalit/kyc-document-extraction-vlm
"""
# See full script at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm
# The dataset has already been generated and pushed.
# Re-run this script only if you want to regenerate with different parameters.
print("Dataset already generated at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm")
print("To regenerate, uncomment the code below and run.")