PIXAR-13B

PIXAR-13B is a Vision-Language Model (VLM) for image tampering analysis, introduced in the paper "From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering".

Given a query image, PIXAR-13B jointly performs:

Binary — real or tampered
Object classification — identifies which of 81 COCO categories was modified
Pixel-level localization — generates a segmentation mask over the tampered region
Natural language description — describes what was changed and how

Model Description

Developed by: MBZUAI VILA Lab
Model type: Multimodal Vision-Language Model for Image Tampering Detection
License: MIT
Base model: SIDA-7B (LLaVA + LLaMA-2)
Paper: From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Architecture

PIXAR-13B is built on a LLaVA + LLaMA-2 backbone with LoRA fine-tuning (rank 8), integrated with SAM ViT-H for pixel-level decoding and CLIP ViT-L/14 for visual-language alignment. Three special tokens are inserted into the token sequence to anchor multi-task prediction heads:

Token	Role
`[CLS]`	3-way classification (real / fully synthetic / tampered) via a linear head
`[OBJ]`	Multi-label object recognition over 81 COCO categories via a linear head
`[SEG]`	Pixel-level segmentation mask generation via SAM, optionally fused with the generated text embedding

Key Contributions

Existing tampering benchmarks rely on coarse object masks as ground truth, which conflates unedited pixels inside the mask with actual tamper evidence and misses subtle edits outside the mask. PIXAR replaces binary masks with per-pixel difference maps $D = |I_\text{orig} - I_\text{gen}|$, thresholded at a tunable $\tau$ to produce dynamic ground truth $M_\tau$ that captures edits at multiple scales. The PIXAR benchmark provides:

420K+ training pairs with pixel-level $M_\tau$ maps, semantic class labels, and natural language descriptions
40K balanced test pairs spanning 8 manipulation types (replace, remove, splice, inpaint, attribute change, colorization, etc.)

PIXAR-13B achieves 2.6× IoU improvement over prior state of the art on the PIXAR benchmark.

How to Get Started

For interactive inference, see the project repository:

python chat.py --version jiachengcui888/PIXAR-13B --precision bf16 --seg_prompt_mode seg_only

Training Details

Training Data

The PIXAR benchmark — 420K+ image pairs with pixel-faithful tamper labels, semantic categories, and natural language descriptions, spanning 8 manipulation types across COCO-category objects.

Training Procedure

Fine-tuned with DeepSpeed on a LLaVA + LLaMA-2 backbone using LoRA (rank 8). Key hyperparameters:

Hyperparameter	Value
LoRA rank	8
Learning rate	1e-4
Batch size	2
Precision	bf16
Threshold τ	0.05
text	3.0
cls	1.0
bce	1.0
dice	1.0
sem	0.5

Citation

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jiachengcui888/PIXAR-13B

Base model

xinlai/LISA-13B-llama2-v1

Finetuned

saberzl/SIDA-13B

Finetuned

(3)

this model