nielsr's picture
nielsr HF Staff
Improve model card and add metadata
ed6f68a verified
|
raw
history blame
2.33 kB
metadata
library_name: transformers
pipeline_tag: image-text-to-text
base_model: google/paligemma2-3b-mix-224
tags:
  - robotics
  - failure-detection
  - vision-language

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

I-FailSense is a vision-language model (VLM) framework designed to detect language-conditioned robotic failures. It focuses on identifying semantic misalignment errors, where a robot executes a task that is semantically meaningful but inconsistent with the user's instruction.

Model Description

The model architecture consists of a base VLM (PaliGemma-2 3B) fine-tuned using LoRA, combined with lightweight classification heads (FS blocks) attached to internal layers. An ensembling mechanism aggregates predictions from these blocks to provide grounded arbitration for failure detection. While trained primarily on semantic misalignment, I-FailSense generalizes well to broader robotic failure categories and different environments.

How to Get Started

To use or evaluate the model, please use the implementation provided in the official GitHub repository.

Evaluation

You can evaluate the model (using both the LoRA weights and the FS block weights) on the Calvin dataset with the following command:

python src/evaluate.py \
    --vlm_model_id ACIDE/FailSense-Calvin-1p-3b \
    --fs_id FS/FailSense-Calvin-1p-3b \
    --dataset_name calvin \
    --result_folder results_calvin_1p

Citation

If you find this work useful, please cite:

@inproceedings{ifailsense2026,
  title        = {I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models},
  author       = {Clemence Grislain and Hamed Rahimi and Olivier Sigaud and Mohamed Chetouani},
  booktitle    = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
  year         = {2026},
  url          = {https://arxiv.org/abs/2509.16072}
}