File size: 8,556 Bytes
5138c52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
license: apache-2.0
base_model: google/gemma-4-E4B-it
tags:
  - sft
  - trl
  - peft
  - qlora
  - kyc
  - document-extraction
  - document-classification
  - aadhaar
  - pan-card
  - passport
  - visa
  - election-card
  - gemma4
  - vision-language-model
  - vllm
datasets:
  - Jwalit/kyc-document-extraction-vlm
pipeline_tag: image-text-to-text
library_name: transformers
---

# Gemma 4 E4B β€” KYC Document Extractor & Classifier

**Production-ready Vision-Language Model for Indian KYC Document Extraction and Classification**

Fine-tuned from [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) using QLoRA SFT on a synthetic KYC document dataset covering 5 Indian identity document types.

## 🎯 Capabilities

| Task | Description |
|------|-------------|
| **Document Classification** | Classify document as: Aadhaar Card, PAN Card, Passport, Visa, or Election Card (Voter ID) |
| **Field Extraction** | Extract all structured fields (name, DOB, ID number, address, etc.) as JSON |
| **Combined** | Classify + Extract in a single pass |

## πŸ“‹ Supported Document Types

| Document | Fields Extracted |
|----------|-----------------|
| **Aadhaar Card** | full_name, date_of_birth, gender, father_name, aadhaar_number, address, VID |
| **PAN Card** | full_name, father_name, date_of_birth, pan_number |
| **Passport** | surname, given_name, nationality, gender, date_of_birth, passport_number, place_of_birth, date_of_issue, date_of_expiry, place_of_issue |
| **Visa** | issuing_country, visa_type, visa_category, visa_number, full_name, nationality, gender, date_of_birth, passport_number, date_of_issue, date_of_expiry, entries |
| **Election Card** | voter_id, full_name, relative_name, gender, date_of_birth, age, state, constituency, address |

## πŸš€ Quick Start

### With Transformers

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

model_id = "Jwalit/gemma4-e4b-kyc-document-extractor"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)

image = Image.open("document.jpg").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."}]},
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Classify this document and extract all information as structured JSON."}
    ]}
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt", images=[image]
).to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)
    
result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(result)
```

### With vLLM (Production Deployment)

```bash
# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model Jwalit/gemma4-e4b-kyc-document-extractor \
    --trust-remote-code \
    --max-model-len 4096 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9
```

```python
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

with open("document.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Jwalit/gemma4-e4b-kyc-document-extractor",
    messages=[
        {"role": "system", "content": "You are an expert KYC document analyst. Always respond with accurate, structured JSON output."},
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "Classify and extract all fields from this KYC document as JSON."}
        ]}
    ],
    max_tokens=1024,
    temperature=0.1
)
print(response.choices[0].message.content)
```

### With vLLM Offline (Batch Processing)

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Jwalit/gemma4-e4b-kyc-document-extractor",
    trust_remote_code=True,
    max_model_len=4096,
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.1, max_tokens=1024)
# Use llm.chat() with image messages for batch processing
```

## πŸ‹οΈ Training Details

### Method
- **Base Model**: `google/gemma-4-E4B-it` (~8B params, Gemma4ForConditionalGeneration)
- **Fine-tuning**: QLoRA SFT (4-bit NF4 quantization + LoRA rank-16 on text decoder)
- **Vision Encoder**: Frozen SigLIP (280 tokens per image, 768-dim, 16 layers)
- **Framework**: TRL SFTTrainer + PEFT + BitsAndBytes

### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 2e-4 |
| Epochs | 3 |
| Batch Size | 2 Γ— 8 (gradient accumulation) = 16 effective |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Optimizer | AdamW (fused) |
| LR Scheduler | Cosine with 5% warmup |
| Precision | bf16 |
| Gradient Checkpointing | βœ… |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |

### Dataset
- **Dataset**: [`Jwalit/kyc-document-extraction-vlm`](https://huggingface.co/datasets/Jwalit/kyc-document-extraction-vlm)
- **Size**: 2,704 train / 296 eval samples
- **Document Types**: 5 (Aadhaar, PAN, Passport, Visa, Election Card)
- **Task Types**: Classification, Extraction, Combined (balanced across all)
- **Format**: Conversational VLM (messages with `{"type": "image"}` + `{"type": "text"}`)

### Architecture

```
Gemma4ForConditionalGeneration
β”œβ”€β”€ Vision Encoder (SigLIP, FROZEN)
β”‚   β”œβ”€β”€ 16 layers, 768-dim, 12 attention heads
β”‚   β”œβ”€β”€ Patch size: 16, Pooling kernel: 3
β”‚   └── Output: 280 soft tokens per image
β”œβ”€β”€ Text Decoder (LoRA applied here)
β”‚   β”œβ”€β”€ 42 layers (36 sliding + 6 full attention)
β”‚   β”œβ”€β”€ 2560 hidden, 8 heads, GQA
β”‚   β”œβ”€β”€ 262K vocab, 131K context
β”‚   └── LoRA on: q/k/v/o_proj + gate/up/down_proj
└── Audio Encoder (unused, frozen)
```

## πŸ”§ Reproduce Training

```bash
# Install dependencies
pip install torch transformers trl datasets peft accelerate bitsandbytes trackio flash-attn pillow

# Run training (requires GPU with β‰₯24GB VRAM, recommended: A100 80GB)
python train_kyc_vlm.py
```

Or via TRL CLI:
```bash
trl sft \
    --model_name_or_path google/gemma-4-E4B-it \
    --dataset_name Jwalit/kyc-document-extraction-vlm \
    --output_dir ./gemma4-kyc-extractor \
    --learning_rate 2e-4 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --bf16 \
    --gradient_checkpointing \
    --push_to_hub \
    --hub_model_id Jwalit/gemma4-e4b-kyc-document-extractor
```

## ⚑ Performance & Deployment Notes

- **vLLM compatible**: Native support via `Gemma4ForConditionalGeneration` architecture
- **280 image tokens**: Efficient β€” processes document images in ~280 tokens (vs 1024+ for other VLMs)
- **128K context**: Can handle multiple document pages in a single request
- **QLoRA deployment**: Merge adapters for full-speed inference, or serve with PEFT for memory efficiency

### Merging Adapters (for production β€” recommended before vLLM serving)

```python
from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    "Jwalit/gemma4-e4b-kyc-document-extractor",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-kyc-extractor")
# Then push merged model for faster vLLM serving
```

## πŸ“Š Expected Output Format

```json
{
  "document_type": "aadhaar_card",
  "full_name": "Rajesh Kumar Singh",
  "date_of_birth": "15/03/1985",
  "gender": "Male",
  "father_name": "Suresh Kumar Singh",
  "aadhaar_number": "1234 5678 9012",
  "address": "123, MG Road, Mumbai, Maharashtra - 400001",
  "vid": "1234 5678 9012 3456"
}
```

## ⚠️ Limitations

- Trained on **synthetic** KYC documents β€” accuracy on real-world documents will improve with fine-tuning on real (anonymized) KYC samples
- Best results when further fine-tuned with 200-500 real document images per type
- Vision encoder is frozen β€” cannot learn new visual features beyond base SigLIP capabilities
- Indian documents only (Aadhaar, PAN, Passport, Visa, Election Card)

## πŸ“ License

Apache 2.0 (same as base model `google/gemma-4-E4B-it`)