Jwalit commited on
Commit
bf014b7
·
verified ·
1 Parent(s): 15dad3d

Add dataset generation script reference

Browse files
Files changed (1) hide show
  1. generate_kyc_dataset.py +20 -0
generate_kyc_dataset.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Generate a synthetic KYC document dataset for training a VLM on document
3
+ extraction and classification tasks.
4
+
5
+ Produces document images for: Aadhar, PAN, Passport, Visa, Election Card
6
+ with corresponding extraction ground truth in JSON format.
7
+
8
+ Usage:
9
+ pip install datasets Pillow faker huggingface_hub
10
+ python generate_kyc_dataset.py
11
+
12
+ Output: Pushes to HuggingFace Hub as Jwalit/kyc-document-extraction-vlm
13
+ """
14
+
15
+ # See full script at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm
16
+ # The dataset has already been generated and pushed.
17
+ # Re-run this script only if you want to regenerate with different parameters.
18
+
19
+ print("Dataset already generated at: https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm")
20
+ print("To regenerate, uncomment the code below and run.")