How to use from
Docker Model Runner
docker model run hf.co/ce-amtic/ProcVLM-2B
Quick Links

ProcVLM-2B

ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.

Homepage | arXiv | Code

Model Details

  • Model name: ce-amtic/ProcVLM-2B
  • Model type: Vision-language model for robot progress reward inference
  • Architecture: Qwen3-VL-style multimodal causal language model
  • Input: One or more RGB images sampled from a robot trajectory, plus a natural-language task description
  • Output: Textual reasoning and a completion estimate formatted as <progress>XX%</progress>
  • Primary use case: Frame-wise progress reward prediction for robotic manipulation videos

Intended Use

ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:

  • estimating task completion progress from robot videos;
  • producing dense progress rewards from sparse demonstrations;
  • adapting progress prediction to a new environment with one-shot LoRA fine-tuning.

This model is not intended to be used as a safety-critical controller without downstream validation.

Quick Start

Clone the ProcVLM repository and install the environment:

git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM

uv sync --python 3.10
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation

Run progress reward inference on a video:

python evqa/inference.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_predictions.jsonl \
    --task "fold the red T-shirt" \
    --window_size 8

Each JSONL row contains a sampled frame_index and its corresponding progress prediction.

You can also visualize predictions as a video:

python evqa/eval/visualize_progress_video.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_visualization.mp4 \
    --task "fold the red T-shirt" \
    --window_size 8

Python API

The same inference workflow is available through infer_progress_from_video():

from evqa.inference import infer_progress_from_video

records = infer_progress_from_video(
    model_path="ce-amtic/ProcVLM-2B",
    video_path="path/to/your/video.mp4",
    task="fold the red T-shirt",
    window_size=8,
)

for item in records:
    print(item["frame_index"], item["progress"])

The returned records include:

  • frame_index: source video frame index;
  • timestamp_sec: source video timestamp;
  • window_frame_indices: frame indices used as the model input window;
  • progress: parsed progress value in [0, 100];
  • reasoning: model reasoning with the progress tag removed;
  • model_output: raw model output.

Prompt Format

ProcVLM uses a procedural progress prompt. The default template is:

Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.

The model should answer with reasoning and a final progress tag, for example:

To complete the task: Tower the blocks, the following steps are required:
1. Grasp the green block.
2. Place the green block onto the red block.
Therefore, the estimated progress percentage is <progress>84.13%</progress>.

Or if the task is finished:

The task requires: Tower the blocks. Images show no block outside the tower, no further steps required. 
Therefore, the estimated progress percentage is <progress>100.00%</progress>.

vLLM Batch Inference

For high-throughput multi-image inference, the ProcVLM repository provides evqa.model.batch_chat_with_vllm():

from evqa.model import batch_chat_with_vllm

outputs = batch_chat_with_vllm(
    batch_items=[
        {
            "image": [
                "frames/frame_000000.jpg",
                "frames/frame_000010.jpg",
                "frames/frame_000020.jpg",
            ],
            "conversations": [
                {
                    "from": "human",
                    "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
                }
            ],
        }
    ],
    model_path="ce-amtic/ProcVLM-2B",
    max_new_tokens=1024,
    temperature=0.0,
    tp=1,
)

One-Shot LoRA Adaptation

ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the one-shot adaptation guide for:

  • annotating coarse sub-task stages with the visual UI;
  • generating a LoRA fine-tuning dataset;
  • running evqa/one-shot/lora_oneshot.sh;
  • using the saved LoRA checkpoint with evqa/inference.py --use_lora.

Limitations

  • The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
  • The progress output is a learned estimate, not a calibrated physical measurement.
  • For long-horizon videos, inference quality depends on the sampled frame window and the task description.
  • The model should be validated in the target robot environment before being used as a reward signal for training or deployment.

Citation

If you use ProcVLM, please cite the paper:

@misc{feng2026procvlmlearningproceduregroundedprogress,
      title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation}, 
      author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
      year={2026},
      eprint={2605.08774},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.08774}, 
}

License

Please refer to the license information on this model repository and the upstream base model license before using the weights.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ce-amtic/ProcVLM-2B

Finetuned
(206)
this model

Paper for ce-amtic/ProcVLM-2B