Guardian — Multi-View VLM for Robotic Planning & Execution Failure Detection (Thinking variant)
Guardian is a vision-language model fine-tuned for unified planning and execution verification in robotic manipulation. Given an instruction and one or more images of the robot scene, it predicts whether a proposed plan is correct (planning verification) or whether a subtask was successfully executed (execution verification), and emits an explicit chain-of-thought reasoning trace alongside the final answer.
This checkpoint (guardian-thinking) is the thinking variant: it is trained and inferred with <think> ... </think> reasoning before the final <answer> and <category> tokens. A lighter no-CoT counterpart (guardian-vanilla) is released separately.
| Project page | Paper | Code | Data |
|---|---|---|---|
| di.ens.fr/willow/research/guardian | arXiv:2512.01946 | GitHub | 🤗 Guardian collection |
Model summary
- Architecture: InternVL3-8B (Qwen2.5-7B LLM + InternViT-300M-448px-V2.5), fine-tuned with LoRA (rank 16) on the LLM only; visual encoder and MLP connector kept frozen.
- Capabilities:
- Planning verification — from an initial scene image and a proposed list of subtasks, decide whether the plan is correct.
- Execution verification — from before/after observations of a subtask (single-view or multi-view), decide whether the subtask succeeded.
- Thinking mode — every prediction is preceded by an explicit reasoning trace.
- Output format:
- Thinking:
<think> reasoning </think> <answer> True|False </answer> <category> ... </category>
- Thinking:
- Training data: FailCoT (RLBench-Fail + BridgeDataV2-Fail), ~30K planning + execution failures with reasoning traces. See the paper Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation (Pacaud et al., 2026).
Quick start
The simplest way to run Guardian is the lightweight wrapper shipped in the Guardian repo (examples/guardian.py):
from examples.guardian import Guardian
guardian = Guardian(
model_path="<path>/guardian-thinking",
thinking=True,
)
# Planning verification: 1 image of the initial scene
answer, category = guardian.verify_plan(
img_paths_list=["/path/to/start_img.png"],
task_instruction="stack the red cup on the blue cup",
plan=str([
"grasp red cup",
"move grasped object on top of blue cup",
"release",
]),
)
# Execution verification: 2, 6, or 8 images (before/after, possibly multi-view)
answer, category = guardian.verify_subtask(
img_paths_list=[
"/path/to/start_left.png",
"/path/to/start_right.png",
"/path/to/start_wrist.png",
"/path/to/end_left.png",
"/path/to/end_right.png",
"/path/to/end_wrist.png",
],
task_instruction="stack the red cup on the blue cup",
subtask_instruction="grasp red cup",
)
For execution verification, the wrapper accepts:
- 2 images — single-view:
[start, end] - 6 images — three views:
[start_left, start_right, start_wrist, end_left, end_right, end_wrist] - 8 images — four views, similarly ordered.
See docs/RUN_DEMO.md in the Guardian repo for the full demo.
Downloading the checkpoint
hf download paulpacaud/guardian-thinking \
--local-dir ./data/failure_forge/models/guardian-thinking
The codebase expects the checkpoint to live under ./data/failure_forge/models/guardian-thinking/.
Evaluation
Guardian is evaluated on three real-robot OOD benchmarks bundled at paulpacaud/Guardian-FailCoT-OOD-datasets — UR5-Fail, RoboFail, RoboVQA — plus the in-distribution test splits of FailCoT (RLBench-Fail / BridgeDataV2-Fail).
Reproduce evaluation results following docs/Offline_VQA_Evaluation.md in the Guardian repo. Headline numbers from Table II of the paper:
| Benchmark | Execution acc. | Planning acc. |
|---|---|---|
| RoboFail | 0.86 | 0.70 |
| UR5-Fail | 0.77 | 0.89 |
| RoboVQA | 0.85 | — |
Intended use
Guardian is designed as a plug-and-play verification module for robotic manipulation pipelines (e.g. as the verifier in 3D-LOTUS++): at each planning step or subtask boundary, query Guardian; on a failure, trigger replanning or re-execution.
Citation
@misc{pacaud2026guardian_failcot,
title = {Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation},
author = {Paul Pacaud and Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
year = {2026},
eprint = {2512.01946},
archivePrefix = {arXiv},
primaryClass = {cs.RO}
}
If you specifically build on the earlier Guardian workshop paper:
@inproceedings{pacaud2025guardian,
title = {Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models},
author = {Paul Pacaud and Ricardo Garcia Pinel and Shizhe Chen and Cordelia Schmid},
booktitle = {Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025},
year = {2025},
url = {https://openreview.net/forum?id=wps46mtC9B}
}
License
Released under the Apache 2.0 license, inheriting the license of the InternVL3-8B base model.
- Downloads last month
- 9
Model tree for paulpacaud/guardian-thinking
Base model
OpenGVLab/InternVL3-8B-Pretrained