Instructions to use Jwalit/gemma4-e4b-kyc-document-extractor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jwalit/gemma4-e4b-kyc-document-extractor with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Jwalit/gemma4-e4b-kyc-document-extractor")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Jwalit/gemma4-e4b-kyc-document-extractor", dtype="auto")

PEFT
How to use Jwalit/gemma4-e4b-kyc-document-extractor with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Jwalit/gemma4-e4b-kyc-document-extractor with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jwalit/gemma4-e4b-kyc-document-extractor"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jwalit/gemma4-e4b-kyc-document-extractor",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Jwalit/gemma4-e4b-kyc-document-extractor

SGLang

How to use Jwalit/gemma4-e4b-kyc-document-extractor with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jwalit/gemma4-e4b-kyc-document-extractor" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jwalit/gemma4-e4b-kyc-document-extractor",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jwalit/gemma4-e4b-kyc-document-extractor" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jwalit/gemma4-e4b-kyc-document-extractor",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Jwalit/gemma4-e4b-kyc-document-extractor with Docker Model Runner:
```
docker model run hf.co/Jwalit/gemma4-e4b-kyc-document-extractor
```

gemma4-e4b-kyc-document-extractor / README.md

Jwalit

Add model README with full documentation

5138c52 verified 24 days ago

preview code

raw

history blame contribute delete

8.56 kB

	---
	license: apache-2.0
	base_model: google/gemma-4-E4B-it
	tags:
	- sft
	- trl
	- peft
	- qlora
	- kyc
	- document-extraction
	- document-classification
	- aadhaar
	- pan-card
	- passport
	- visa
	- election-card
	- gemma4
	- vision-language-model
	- vllm
	datasets:
	- Jwalit/kyc-document-extraction-vlm
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# Gemma 4 E4B — KYC Document Extractor & Classifier

	Production-ready Vision-Language Model for Indian KYC Document Extraction and Classification

	Fine-tuned from [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) using QLoRA SFT on a synthetic KYC document dataset covering 5 Indian identity document types.

	## 🎯 Capabilities

	\| Task \| Description \|
	\|------\|-------------\|
	\| Document Classification \| Classify document as: Aadhaar Card, PAN Card, Passport, Visa, or Election Card (Voter ID) \|
	\| Field Extraction \| Extract all structured fields (name, DOB, ID number, address, etc.) as JSON \|
	\| Combined \| Classify + Extract in a single pass \|

	## 📋 Supported Document Types

	\| Document \| Fields Extracted \|
	\|----------\|-----------------\|
	\| Aadhaar Card \| full_name, date_of_birth, gender, father_name, aadhaar_number, address, VID \|
	\| PAN Card \| full_name, father_name, date_of_birth, pan_number \|
	\| Passport \| surname, given_name, nationality, gender, date_of_birth, passport_number, place_of_birth, date_of_issue, date_of_expiry, place_of_issue \|
	\| Visa \| issuing_country, visa_type, visa_category, visa_number, full_name, nationality, gender, date_of_birth, passport_number, date_of_issue, date_of_expiry, entries \|
	\| Election Card \| voter_id, full_name, relative_name, gender, date_of_birth, age, state, constituency, address \|

	## 🚀 Quick Start

	### With Transformers

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForImageTextToText
	from PIL import Image

	model_id = "Jwalit/gemma4-e4b-kyc-document-extractor"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForImageTextToText.from_pretrained(
	model_id, device_map="auto", torch_dtype=torch.bfloat16
	)

	image = Image.open("document.jpg").convert("RGB")

	messages = [
	{"role": "system", "content": [{"type": "text", "text": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."}]},
	{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": "Classify this document and extract all information as structured JSON."}
	]}
	]

	inputs = processor.apply_chat_template(
	messages, add_generation_prompt=True, tokenize=True,
	return_dict=True, return_tensors="pt", images=[image]
	).to(model.device)

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)

	result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
	print(result)
	```

	### With vLLM (Production Deployment)

	```bash
	# Start OpenAI-compatible server
	python -m vllm.entrypoints.openai.api_server \
	--model Jwalit/gemma4-e4b-kyc-document-extractor \
	--trust-remote-code \
	--max-model-len 4096 \
	--dtype bfloat16 \
	--gpu-memory-utilization 0.9
	```

	```python
	from openai import OpenAI
	import base64

	client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

	with open("document.jpg", "rb") as f:
	img_b64 = base64.b64encode(f.read()).decode()

	response = client.chat.completions.create(
	model="Jwalit/gemma4-e4b-kyc-document-extractor",
	messages=[
	{"role": "system", "content": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."},
	{"role": "user", "content": [
	{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
	{"type": "text", "text": "Classify and extract all fields from this KYC document as JSON."}
	]}
	],
	max_tokens=1024,
	temperature=0.1
	)
	print(response.choices[0].message.content)
	```

	### With vLLM Offline (Batch Processing)

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="Jwalit/gemma4-e4b-kyc-document-extractor",
	trust_remote_code=True,
	max_model_len=4096,
	dtype="bfloat16",
	)

	sampling_params = SamplingParams(temperature=0.1, max_tokens=1024)
	# Use llm.chat() with image messages for batch processing
	```

	## 🏋️ Training Details

	### Method
	- Base Model: `google/gemma-4-E4B-it` (~8B params, Gemma4ForConditionalGeneration)
	- Fine-tuning: QLoRA SFT (4-bit NF4 quantization + LoRA rank-16 on text decoder)
	- Vision Encoder: Frozen SigLIP (280 tokens per image, 768-dim, 16 layers)
	- Framework: TRL SFTTrainer + PEFT + BitsAndBytes

	### Hyperparameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 2e-4 \|
	\| Epochs \| 3 \|
	\| Batch Size \| 2 × 8 (gradient accumulation) = 16 effective \|
	\| LoRA Rank (r) \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.05 \|
	\| Optimizer \| AdamW (fused) \|
	\| LR Scheduler \| Cosine with 5% warmup \|
	\| Precision \| bf16 \|
	\| Gradient Checkpointing \| ✅ \|
	\| Target Modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|

	### Dataset
	- Dataset: [`Jwalit/kyc-document-extraction-vlm`](https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm)
	- Size: 2,704 train / 296 eval samples
	- Document Types: 5 (Aadhaar, PAN, Passport, Visa, Election Card)
	- Task Types: Classification, Extraction, Combined (balanced across all)
	- Format: Conversational VLM (messages with `{"type": "image"}` + `{"type": "text"}`)

	### Architecture

	```
	Gemma4ForConditionalGeneration
	├── Vision Encoder (SigLIP, FROZEN)
	│ ├── 16 layers, 768-dim, 12 attention heads
	│ ├── Patch size: 16, Pooling kernel: 3
	│ └── Output: 280 soft tokens per image
	├── Text Decoder (LoRA applied here)
	│ ├── 42 layers (36 sliding + 6 full attention)
	│ ├── 2560 hidden, 8 heads, GQA
	│ ├── 262K vocab, 131K context
	│ └── LoRA on: q/k/v/o_proj + gate/up/down_proj
	└── Audio Encoder (unused, frozen)
	```

	## 🔧 Reproduce Training

	```bash
	# Install dependencies
	pip install torch transformers trl datasets peft accelerate bitsandbytes trackio flash-attn pillow

	# Run training (requires GPU with ≥24GB VRAM, recommended: A100 80GB)
	python train_kyc_vlm.py
	```

	Or via TRL CLI:
	```bash
	trl sft \
	--model_name_or_path google/gemma-4-E4B-it \
	--dataset_name Jwalit/kyc-document-extraction-vlm \
	--output_dir ./gemma4-kyc-extractor \
	--learning_rate 2e-4 \
	--num_train_epochs 3 \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--bf16 \
	--gradient_checkpointing \
	--push_to_hub \
	--hub_model_id Jwalit/gemma4-e4b-kyc-document-extractor
	```

	## ⚡ Performance & Deployment Notes

	- vLLM compatible: Native support via `Gemma4ForConditionalGeneration` architecture
	- 280 image tokens: Efficient — processes document images in ~280 tokens (vs 1024+ for other VLMs)
	- 128K context: Can handle multiple document pages in a single request
	- QLoRA deployment: Merge adapters for full-speed inference, or serve with PEFT for memory efficiency

	### Merging Adapters (for production — recommended before vLLM serving)

	```python
	from peft import AutoPeftModelForCausalLM
	import torch

	model = AutoPeftModelForCausalLM.from_pretrained(
	"Jwalit/gemma4-e4b-kyc-document-extractor",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)
	merged_model = model.merge_and_unload()
	merged_model.save_pretrained("./merged-kyc-extractor")
	# Then push merged model for faster vLLM serving
	```

	## 📊 Expected Output Format

	```json
	{
	"document_type": "aadhaar_card",
	"full_name": "Rajesh Kumar Singh",
	"date_of_birth": "15/03/1985",
	"gender": "Male",
	"father_name": "Suresh Kumar Singh",
	"aadhaar_number": "1234 5678 9012",
	"address": "123, MG Road, Mumbai, Maharashtra - 400001",
	"vid": "1234 5678 9012 3456"
	}
	```

	## ⚠️ Limitations

	- Trained on synthetic KYC documents — accuracy on real-world documents will improve with fine-tuning on real (anonymized) KYC samples
	- Best results when further fine-tuned with 200-500 real document images per type
	- Vision encoder is frozen — cannot learn new visual features beyond base SigLIP capabilities
	- Indian documents only (Aadhaar, PAN, Passport, Visa, Election Card)

	## 📝 License

	Apache 2.0 (same as base model `google/gemma-4-E4B-it`)