PIXAR-13B

PIXAR-13B is a Vision-Language Model (VLM) for image tampering analysis, introduced in the paper "From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering".

Given a query image, PIXAR-13B jointly performs:

  • Binary โ€” real or tampered
  • Object classification โ€” identifies which of 81 COCO categories was modified
  • Pixel-level localization โ€” generates a segmentation mask over the tampered region
  • Natural language description โ€” describes what was changed and how

Model Description

  • Developed by: MBZUAI VILA Lab
  • Model type: Multimodal Vision-Language Model for Image Tampering Detection
  • License: MIT
  • Base model: SIDA-7B (LLaVA + LLaMA-2)
  • Paper: From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Architecture

PIXAR-13B is built on a LLaVA + LLaMA-2 backbone with LoRA fine-tuning (rank 8), integrated with SAM ViT-H for pixel-level decoding and CLIP ViT-L/14 for visual-language alignment. Three special tokens are inserted into the token sequence to anchor multi-task prediction heads:

Token Role
[CLS] 3-way classification (real / fully synthetic / tampered) via a linear head
[OBJ] Multi-label object recognition over 81 COCO categories via a linear head
[SEG] Pixel-level segmentation mask generation via SAM, optionally fused with the generated text embedding

Key Contributions

Existing tampering benchmarks rely on coarse object masks as ground truth, which conflates unedited pixels inside the mask with actual tamper evidence and misses subtle edits outside the mask. PIXAR replaces binary masks with per-pixel difference maps $D = |I_\text{orig} - I_\text{gen}|$, thresholded at a tunable $\tau$ to produce dynamic ground truth $M_\tau$ that captures edits at multiple scales. The PIXAR benchmark provides:

  • 420K+ training pairs with pixel-level $M_\tau$ maps, semantic class labels, and natural language descriptions
  • 40K balanced test pairs spanning 8 manipulation types (replace, remove, splice, inpaint, attribute change, colorization, etc.)

PIXAR-13B achieves 2.6ร— IoU improvement over prior state of the art on the PIXAR benchmark.

How to Get Started

For interactive inference, see the project repository:

python chat.py --version jiachengcui888/PIXAR-13B --precision bf16 --seg_prompt_mode seg_only

Training Details

Training Data

The PIXAR benchmark โ€” 420K+ image pairs with pixel-faithful tamper labels, semantic categories, and natural language descriptions, spanning 8 manipulation types across COCO-category objects.

Training Procedure

Fine-tuned with DeepSpeed on a LLaVA + LLaMA-2 backbone using LoRA (rank 8). Key hyperparameters:

Hyperparameter Value
LoRA rank 8
Learning rate 1e-4
Batch size 2
Precision bf16
Threshold ฯ„ 0.05
text 3.0
cls 1.0
bce 1.0
dice 1.0
sem 0.5

Citation


Downloads last month
69
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jiachengcui888/PIXAR-13B

Finetuned
saberzl/SIDA-13B
Finetuned
(3)
this model