PIXAR-7B
PIXAR-7B is a Vision-Language Model (VLM) for image tampering analysis, introduced in the paper "From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering".
Given a query image, PIXAR-7B jointly performs:
- Binary โ real or tampered
- Object classification โ identifies which of 81 COCO categories was modified
- Pixel-level localization โ generates a segmentation mask over the tampered region
- Natural language description โ describes what was changed and how
Model Description
- Developed by: MBZUAI VILA Lab
- Model type: Multimodal Vision-Language Model for Image Tampering Detection
- License: MIT
- Base model: SIDA-7B (LLaVA + LLaMA-2)
- Paper: From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Architecture
PIXAR-7B is built on a LLaVA + LLaMA-2 backbone with LoRA fine-tuning (rank 8), integrated with SAM ViT-H for pixel-level decoding and CLIP ViT-L/14 for visual-language alignment. Three special tokens are inserted into the token sequence to anchor multi-task prediction heads:
| Token | Role |
|---|---|
[CLS] |
3-way classification (real / fully synthetic / tampered) via a linear head |
[OBJ] |
Multi-label object recognition over 81 COCO categories via a linear head |
[SEG] |
Pixel-level segmentation mask generation via SAM, optionally fused with the generated text embedding |
Key Contributions
Existing tampering benchmarks rely on coarse object masks as ground truth, which conflates unedited pixels inside the mask with actual tamper evidence and misses subtle edits outside the mask. PIXAR replaces binary masks with per-pixel difference maps $D = |I_\text{orig} - I_\text{gen}|$, thresholded at a tunable $\tau$ to produce dynamic ground truth $M_\tau$ that captures edits at multiple scales.
The PIXAR benchmark provides:
- 420K+ training pairs with pixel-level $M_\tau$ maps, semantic class labels, and natural language descriptions
- 40K balanced test pairs spanning 8 manipulation types (replace, remove, splice, inpaint, attribute change, colorization, etc.)
PIXAR-7B achieves 2.6ร IoU improvement over prior state of the art on the PIXAR benchmark.
How to Get Started
For interactive inference, see the project repository:
python chat.py --version jiachengcui888/PIXAR-7B --precision bf16 --seg_prompt_mode seg_only
Training Details
Training Data
The PIXAR benchmark โ 420K+ image pairs with pixel-faithful tamper labels, semantic categories, and natural language descriptions, spanning 8 manipulation types across COCO-category objects.
Training Procedure
Fine-tuned with DeepSpeed on a LLaVA + LLaMA-2 backbone using LoRA (rank 8). Key hyperparameters:
| Hyperparameter | Value |
|---|---|
| LoRA rank | 8 |
| Learning rate | 1e-4 |
| Batch size | 2 |
| Precision | bf16 |
| Threshold ฯ | 0.05 |
| text | 3.0 |
| cls | 1.0 |
| bce | 1.0 |
| dice | 1.0 |
| sem | 0.5 |
Evaluation
PIXAR-7B achieves 2.6ร IoU improvement over prior SOTA on the PIXAR test benchmark. For full evaluation results, refer to the paper.
Citation
- Downloads last month
- 53