Image-Text-to-Text
Transformers
PyTorch
Safetensors
English
qwen3_vl
robotics
vision-language-model
progress-reward
robot-manipulation
qwen3-vl
procvlm
conversational
Instructions to use ce-amtic/ProcVLM-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ce-amtic/ProcVLM-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ce-amtic/ProcVLM-2B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ce-amtic/ProcVLM-2B") model = AutoModelForImageTextToText.from_pretrained("ce-amtic/ProcVLM-2B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ce-amtic/ProcVLM-2B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ce-amtic/ProcVLM-2B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ce-amtic/ProcVLM-2B
- SGLang
How to use ce-amtic/ProcVLM-2B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ce-amtic/ProcVLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ce-amtic/ProcVLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ce-amtic/ProcVLM-2B with Docker Model Runner:
docker model run hf.co/ce-amtic/ProcVLM-2B
Update README.md
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ base_model:
|
|
| 19 |
|
| 20 |
# ProcVLM-2B
|
| 21 |
|
| 22 |
-
ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage
|
| 23 |
|
| 24 |
<p align="center">
|
| 25 |
<a href="https://procvlm.github.io/">Homepage</a> |
|
|
@@ -42,7 +42,6 @@ ProcVLM-2B is designed for research on robot learning, progress reward modeling,
|
|
| 42 |
|
| 43 |
- estimating task completion progress from robot videos;
|
| 44 |
- producing dense progress rewards from sparse demonstrations;
|
| 45 |
-
- visualizing progress over time for manipulation rollouts;
|
| 46 |
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
|
| 47 |
|
| 48 |
This model is not intended to be used as a safety-critical controller without downstream validation.
|
|
@@ -122,8 +121,17 @@ Given the recent observation and the task "{task}", first infer the remaining at
|
|
| 122 |
The model should answer with reasoning and a final progress tag, for example:
|
| 123 |
|
| 124 |
```text
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
```
|
| 128 |
|
| 129 |
## vLLM Batch Inference
|
|
@@ -177,14 +185,14 @@ ProcVLM can be adapted to a new environment with one successful task demonstrati
|
|
| 177 |
If you use ProcVLM, please cite the paper:
|
| 178 |
|
| 179 |
```bibtex
|
| 180 |
-
@misc{
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
}
|
| 189 |
```
|
| 190 |
|
|
|
|
| 19 |
|
| 20 |
# ProcVLM-2B
|
| 21 |
|
| 22 |
+
ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
|
| 23 |
|
| 24 |
<p align="center">
|
| 25 |
<a href="https://procvlm.github.io/">Homepage</a> |
|
|
|
|
| 42 |
|
| 43 |
- estimating task completion progress from robot videos;
|
| 44 |
- producing dense progress rewards from sparse demonstrations;
|
|
|
|
| 45 |
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
|
| 46 |
|
| 47 |
This model is not intended to be used as a safety-critical controller without downstream validation.
|
|
|
|
| 121 |
The model should answer with reasoning and a final progress tag, for example:
|
| 122 |
|
| 123 |
```text
|
| 124 |
+
To complete the task: Tower the blocks, the following steps are required:
|
| 125 |
+
1. Grasp the green block.
|
| 126 |
+
2. Place the green block onto the red block.
|
| 127 |
+
Therefore, the estimated progress percentage is <progress>84.13%</progress>.
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
Or if the task is finished:
|
| 131 |
+
|
| 132 |
+
```text
|
| 133 |
+
The task requires: Tower the blocks. Images show no block outside the tower, no further steps required.
|
| 134 |
+
Therefore, the estimated progress percentage is <progress>100.00%</progress>.
|
| 135 |
```
|
| 136 |
|
| 137 |
## vLLM Batch Inference
|
|
|
|
| 185 |
If you use ProcVLM, please cite the paper:
|
| 186 |
|
| 187 |
```bibtex
|
| 188 |
+
@misc{feng2026procvlmlearningproceduregroundedprogress,
|
| 189 |
+
title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
|
| 190 |
+
author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
|
| 191 |
+
year={2026},
|
| 192 |
+
eprint={2605.08774},
|
| 193 |
+
archivePrefix={arXiv},
|
| 194 |
+
primaryClass={cs.RO},
|
| 195 |
+
url={https://arxiv.org/abs/2605.08774},
|
| 196 |
}
|
| 197 |
```
|
| 198 |
|