Privacy Masker - Gemma-4-E4B LoRA Adapter

LoRA adapter for PII (Personally Identifiable Information) detection on mobile UI screenshots. Stage 2 of a two-stage pipeline (PaddleOCR -> Gemma classifier).

Team: Aurimas B啪臈skis 路 Tomas Stankevicius 路 Aida Katkauskait臈 Course: VU Deep Learning, 2026

Intended use

Given a list of OCR text regions extracted from a mobile screenshot, classify each region as one of 9 PII classes or null. The model expects a layout-aware prompt of the form [index@x,y] "text" with coordinates normalized to a 1000x1000 grid.

Training

  • Base model: google/gemma-4-E4B (4-bit NF4)
  • LoRA: r=32, 伪=64, dropout=0.05, language-model projections only
  • Trainable parameters: 69.8M (0.87%)
  • Dataset: pii_v5 (9,989 mobile UI screenshots from RICO-ScreenQA)
  • Optimizer: AdamW, cosine LR schedule, bf16

Results (pii_v5 test split, 997 screens)

Metric Value
Micro F1 0.586
Macro F1 0.536
JSON validity 100%

Per-class F1: email_address 0.84 路 phone_number 0.59 路 address 0.56 路 full_name 0.56 路 account_balance 0.56 路 transaction_amount 0.55 路 username 0.47 路 other_sensitive 0.42 路 date_of_birth 0.28.

How to use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                        bnb_4bit_compute_dtype=torch.bfloat16)
base = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B",
                                            quantization_config=bnb,
                                            device_map="auto")
model = PeftModel.from_pretrained(base, "tomasstankevicius/privacy-masker-gemma4-lora")
tok = AutoTokenizer.from_pretrained("tomasstankevicius/privacy-masker-gemma4-lora")

See the project repo for the full inference pipeline (PaddleOCR + prompt construction + JSON parsing)

Limitations

  • OCR ceiling: bounded above by PaddleOCR recall (misses low-contrast text, icons).
  • date_of_birth regresses on pii_v5 due to label noise (mixes date strings with age integers).
  • other_sensitive is structurally weak (mixes biometrics, credentials, demographics).
  • Trained on English UIs only.

License

Inherits the Gemma license from the base model.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for tomwrx/privacy-masker-gemma4-lora

Adapter
(7)
this model