Image-Text-to-Text
Transformers
PEFT
sft
trl
qlora
kyc
document-extraction
document-classification
aadhaar
pan-card
passport
visa
election-card
gemma4
vision-language-model
vllm
Instructions to use Jwalit/gemma4-e4b-kyc-document-extractor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jwalit/gemma4-e4b-kyc-document-extractor with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Jwalit/gemma4-e4b-kyc-document-extractor")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Jwalit/gemma4-e4b-kyc-document-extractor", dtype="auto") - PEFT
How to use Jwalit/gemma4-e4b-kyc-document-extractor with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Jwalit/gemma4-e4b-kyc-document-extractor with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jwalit/gemma4-e4b-kyc-document-extractor" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jwalit/gemma4-e4b-kyc-document-extractor", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Jwalit/gemma4-e4b-kyc-document-extractor
- SGLang
How to use Jwalit/gemma4-e4b-kyc-document-extractor with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jwalit/gemma4-e4b-kyc-document-extractor" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jwalit/gemma4-e4b-kyc-document-extractor", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jwalit/gemma4-e4b-kyc-document-extractor" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jwalit/gemma4-e4b-kyc-document-extractor", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Jwalit/gemma4-e4b-kyc-document-extractor with Docker Model Runner:
docker model run hf.co/Jwalit/gemma4-e4b-kyc-document-extractor
| license: apache-2.0 | |
| base_model: google/gemma-4-E4B-it | |
| tags: | |
| - sft | |
| - trl | |
| - peft | |
| - qlora | |
| - kyc | |
| - document-extraction | |
| - document-classification | |
| - aadhaar | |
| - pan-card | |
| - passport | |
| - visa | |
| - election-card | |
| - gemma4 | |
| - vision-language-model | |
| - vllm | |
| datasets: | |
| - Jwalit/kyc-document-extraction-vlm | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # Gemma 4 E4B β KYC Document Extractor & Classifier | |
| **Production-ready Vision-Language Model for Indian KYC Document Extraction and Classification** | |
| Fine-tuned from [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) using QLoRA SFT on a synthetic KYC document dataset covering 5 Indian identity document types. | |
| ## π― Capabilities | |
| | Task | Description | | |
| |------|-------------| | |
| | **Document Classification** | Classify document as: Aadhaar Card, PAN Card, Passport, Visa, or Election Card (Voter ID) | | |
| | **Field Extraction** | Extract all structured fields (name, DOB, ID number, address, etc.) as JSON | | |
| | **Combined** | Classify + Extract in a single pass | | |
| ## π Supported Document Types | |
| | Document | Fields Extracted | | |
| |----------|-----------------| | |
| | **Aadhaar Card** | full_name, date_of_birth, gender, father_name, aadhaar_number, address, VID | | |
| | **PAN Card** | full_name, father_name, date_of_birth, pan_number | | |
| | **Passport** | surname, given_name, nationality, gender, date_of_birth, passport_number, place_of_birth, date_of_issue, date_of_expiry, place_of_issue | | |
| | **Visa** | issuing_country, visa_type, visa_category, visa_number, full_name, nationality, gender, date_of_birth, passport_number, date_of_issue, date_of_expiry, entries | | |
| | **Election Card** | voter_id, full_name, relative_name, gender, date_of_birth, age, state, constituency, address | | |
| ## π Quick Start | |
| ### With Transformers | |
| ```python | |
| import torch | |
| from transformers import AutoProcessor, AutoModelForImageTextToText | |
| from PIL import Image | |
| model_id = "Jwalit/gemma4-e4b-kyc-document-extractor" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| model_id, device_map="auto", torch_dtype=torch.bfloat16 | |
| ) | |
| image = Image.open("document.jpg").convert("RGB") | |
| messages = [ | |
| {"role": "system", "content": [{"type": "text", "text": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."}]}, | |
| {"role": "user", "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": "Classify this document and extract all information as structured JSON."} | |
| ]} | |
| ] | |
| inputs = processor.apply_chat_template( | |
| messages, add_generation_prompt=True, tokenize=True, | |
| return_dict=True, return_tensors="pt", images=[image] | |
| ).to(model.device) | |
| with torch.no_grad(): | |
| output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1) | |
| result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0] | |
| print(result) | |
| ``` | |
| ### With vLLM (Production Deployment) | |
| ```bash | |
| # Start OpenAI-compatible server | |
| python -m vllm.entrypoints.openai.api_server \ | |
| --model Jwalit/gemma4-e4b-kyc-document-extractor \ | |
| --trust-remote-code \ | |
| --max-model-len 4096 \ | |
| --dtype bfloat16 \ | |
| --gpu-memory-utilization 0.9 | |
| ``` | |
| ```python | |
| from openai import OpenAI | |
| import base64 | |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") | |
| with open("document.jpg", "rb") as f: | |
| img_b64 = base64.b64encode(f.read()).decode() | |
| response = client.chat.completions.create( | |
| model="Jwalit/gemma4-e4b-kyc-document-extractor", | |
| messages=[ | |
| {"role": "system", "content": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."}, | |
| {"role": "user", "content": [ | |
| {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}, | |
| {"type": "text", "text": "Classify and extract all fields from this KYC document as JSON."} | |
| ]} | |
| ], | |
| max_tokens=1024, | |
| temperature=0.1 | |
| ) | |
| print(response.choices[0].message.content) | |
| ``` | |
| ### With vLLM Offline (Batch Processing) | |
| ```python | |
| from vllm import LLM, SamplingParams | |
| llm = LLM( | |
| model="Jwalit/gemma4-e4b-kyc-document-extractor", | |
| trust_remote_code=True, | |
| max_model_len=4096, | |
| dtype="bfloat16", | |
| ) | |
| sampling_params = SamplingParams(temperature=0.1, max_tokens=1024) | |
| # Use llm.chat() with image messages for batch processing | |
| ``` | |
| ## ποΈ Training Details | |
| ### Method | |
| - **Base Model**: `google/gemma-4-E4B-it` (~8B params, Gemma4ForConditionalGeneration) | |
| - **Fine-tuning**: QLoRA SFT (4-bit NF4 quantization + LoRA rank-16 on text decoder) | |
| - **Vision Encoder**: Frozen SigLIP (280 tokens per image, 768-dim, 16 layers) | |
| - **Framework**: TRL SFTTrainer + PEFT + BitsAndBytes | |
| ### Hyperparameters | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Learning Rate | 2e-4 | | |
| | Epochs | 3 | | |
| | Batch Size | 2 Γ 8 (gradient accumulation) = 16 effective | | |
| | LoRA Rank (r) | 16 | | |
| | LoRA Alpha | 32 | | |
| | LoRA Dropout | 0.05 | | |
| | Optimizer | AdamW (fused) | | |
| | LR Scheduler | Cosine with 5% warmup | | |
| | Precision | bf16 | | |
| | Gradient Checkpointing | β | | |
| | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | |
| ### Dataset | |
| - **Dataset**: [`Jwalit/kyc-document-extraction-vlm`](https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm) | |
| - **Size**: 2,704 train / 296 eval samples | |
| - **Document Types**: 5 (Aadhaar, PAN, Passport, Visa, Election Card) | |
| - **Task Types**: Classification, Extraction, Combined (balanced across all) | |
| - **Format**: Conversational VLM (messages with `{"type": "image"}` + `{"type": "text"}`) | |
| ### Architecture | |
| ``` | |
| Gemma4ForConditionalGeneration | |
| βββ Vision Encoder (SigLIP, FROZEN) | |
| β βββ 16 layers, 768-dim, 12 attention heads | |
| β βββ Patch size: 16, Pooling kernel: 3 | |
| β βββ Output: 280 soft tokens per image | |
| βββ Text Decoder (LoRA applied here) | |
| β βββ 42 layers (36 sliding + 6 full attention) | |
| β βββ 2560 hidden, 8 heads, GQA | |
| β βββ 262K vocab, 131K context | |
| β βββ LoRA on: q/k/v/o_proj + gate/up/down_proj | |
| βββ Audio Encoder (unused, frozen) | |
| ``` | |
| ## π§ Reproduce Training | |
| ```bash | |
| # Install dependencies | |
| pip install torch transformers trl datasets peft accelerate bitsandbytes trackio flash-attn pillow | |
| # Run training (requires GPU with β₯24GB VRAM, recommended: A100 80GB) | |
| python train_kyc_vlm.py | |
| ``` | |
| Or via TRL CLI: | |
| ```bash | |
| trl sft \ | |
| --model_name_or_path google/gemma-4-E4B-it \ | |
| --dataset_name Jwalit/kyc-document-extraction-vlm \ | |
| --output_dir ./gemma4-kyc-extractor \ | |
| --learning_rate 2e-4 \ | |
| --num_train_epochs 3 \ | |
| --per_device_train_batch_size 2 \ | |
| --gradient_accumulation_steps 8 \ | |
| --bf16 \ | |
| --gradient_checkpointing \ | |
| --push_to_hub \ | |
| --hub_model_id Jwalit/gemma4-e4b-kyc-document-extractor | |
| ``` | |
| ## β‘ Performance & Deployment Notes | |
| - **vLLM compatible**: Native support via `Gemma4ForConditionalGeneration` architecture | |
| - **280 image tokens**: Efficient β processes document images in ~280 tokens (vs 1024+ for other VLMs) | |
| - **128K context**: Can handle multiple document pages in a single request | |
| - **QLoRA deployment**: Merge adapters for full-speed inference, or serve with PEFT for memory efficiency | |
| ### Merging Adapters (for production β recommended before vLLM serving) | |
| ```python | |
| from peft import AutoPeftModelForCausalLM | |
| import torch | |
| model = AutoPeftModelForCausalLM.from_pretrained( | |
| "Jwalit/gemma4-e4b-kyc-document-extractor", | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| merged_model = model.merge_and_unload() | |
| merged_model.save_pretrained("./merged-kyc-extractor") | |
| # Then push merged model for faster vLLM serving | |
| ``` | |
| ## π Expected Output Format | |
| ```json | |
| { | |
| "document_type": "aadhaar_card", | |
| "full_name": "Rajesh Kumar Singh", | |
| "date_of_birth": "15/03/1985", | |
| "gender": "Male", | |
| "father_name": "Suresh Kumar Singh", | |
| "aadhaar_number": "1234 5678 9012", | |
| "address": "123, MG Road, Mumbai, Maharashtra - 400001", | |
| "vid": "1234 5678 9012 3456" | |
| } | |
| ``` | |
| ## β οΈ Limitations | |
| - Trained on **synthetic** KYC documents β accuracy on real-world documents will improve with fine-tuning on real (anonymized) KYC samples | |
| - Best results when further fine-tuned with 200-500 real document images per type | |
| - Vision encoder is frozen β cannot learn new visual features beyond base SigLIP capabilities | |
| - Indian documents only (Aadhaar, PAN, Passport, Visa, Election Card) | |
| ## π License | |
| Apache 2.0 (same as base model `google/gemma-4-E4B-it`) | |