Image-Text-to-Text
Transformers
PyTorch
Safetensors
English
qwen3_vl
robotics
vision-language-model
progress-reward
robot-manipulation
qwen3-vl
procvlm
conversational
Instructions to use ce-amtic/ProcVLM-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ce-amtic/ProcVLM-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ce-amtic/ProcVLM-2B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ce-amtic/ProcVLM-2B") model = AutoModelForImageTextToText.from_pretrained("ce-amtic/ProcVLM-2B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ce-amtic/ProcVLM-2B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ce-amtic/ProcVLM-2B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ce-amtic/ProcVLM-2B
- SGLang
How to use ce-amtic/ProcVLM-2B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ce-amtic/ProcVLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ce-amtic/ProcVLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ce-amtic/ProcVLM-2B with Docker Model Runner:
docker model run hf.co/ce-amtic/ProcVLM-2B
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| tags: | |
| - robotics | |
| - vision-language-model | |
| - progress-reward | |
| - robot-manipulation | |
| - qwen3-vl | |
| - procvlm | |
| license: cc-by-4.0 | |
| datasets: | |
| - ce-amtic/ProcVQA-20M-annotations | |
| - ce-amtic/ProcCorpus-60M | |
| base_model: | |
| - Qwen/Qwen3-VL-2B-Instruct | |
| # ProcVLM-2B | |
| ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage. | |
| <p align="center"> | |
| <a href="https://procvlm.github.io/">Homepage</a> | | |
| <a href="https://arxiv.org/abs/2605.08774">arXiv</a> | | |
| <a href="https://github.com/ProcVLM/ProcVLM">Code</a> | |
| </p> | |
| ## Model Details | |
| - **Model name:** `ce-amtic/ProcVLM-2B` | |
| - **Model type:** Vision-language model for robot progress reward inference | |
| - **Architecture:** Qwen3-VL-style multimodal causal language model | |
| - **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description | |
| - **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>` | |
| - **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos | |
| ## Intended Use | |
| ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include: | |
| - estimating task completion progress from robot videos; | |
| - producing dense progress rewards from sparse demonstrations; | |
| - adapting progress prediction to a new environment with one-shot LoRA fine-tuning. | |
| This model is not intended to be used as a safety-critical controller without downstream validation. | |
| ## Quick Start | |
| Clone the ProcVLM repository and install the environment: | |
| ```bash | |
| git clone https://github.com/ProcVLM/ProcVLM.git | |
| cd ProcVLM | |
| uv sync --python 3.10 | |
| source .venv/bin/activate | |
| uv pip install flash-attn --no-build-isolation | |
| ``` | |
| Run progress reward inference on a video: | |
| ```bash | |
| python evqa/inference.py \ | |
| --model_path ce-amtic/ProcVLM-2B \ | |
| --video_path path/to/your/video.mp4 \ | |
| --output_path path/to/progress_predictions.jsonl \ | |
| --task "fold the red T-shirt" \ | |
| --window_size 8 | |
| ``` | |
| Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction. | |
| You can also visualize predictions as a video: | |
| ```bash | |
| python evqa/eval/visualize_progress_video.py \ | |
| --model_path ce-amtic/ProcVLM-2B \ | |
| --video_path path/to/your/video.mp4 \ | |
| --output_path path/to/progress_visualization.mp4 \ | |
| --task "fold the red T-shirt" \ | |
| --window_size 8 | |
| ``` | |
| ## Python API | |
| The same inference workflow is available through `infer_progress_from_video()`: | |
| ```python | |
| from evqa.inference import infer_progress_from_video | |
| records = infer_progress_from_video( | |
| model_path="ce-amtic/ProcVLM-2B", | |
| video_path="path/to/your/video.mp4", | |
| task="fold the red T-shirt", | |
| window_size=8, | |
| ) | |
| for item in records: | |
| print(item["frame_index"], item["progress"]) | |
| ``` | |
| The returned records include: | |
| - `frame_index`: source video frame index; | |
| - `timestamp_sec`: source video timestamp; | |
| - `window_frame_indices`: frame indices used as the model input window; | |
| - `progress`: parsed progress value in `[0, 100]`; | |
| - `reasoning`: model reasoning with the progress tag removed; | |
| - `model_output`: raw model output. | |
| ## Prompt Format | |
| ProcVLM uses a procedural progress prompt. The default template is: | |
| ```text | |
| Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags. | |
| ``` | |
| The model should answer with reasoning and a final progress tag, for example: | |
| ```text | |
| To complete the task: Tower the blocks, the following steps are required: | |
| 1. Grasp the green block. | |
| 2. Place the green block onto the red block. | |
| Therefore, the estimated progress percentage is <progress>84.13%</progress>. | |
| ``` | |
| Or if the task is finished: | |
| ```text | |
| The task requires: Tower the blocks. Images show no block outside the tower, no further steps required. | |
| Therefore, the estimated progress percentage is <progress>100.00%</progress>. | |
| ``` | |
| ## vLLM Batch Inference | |
| For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`: | |
| ```python | |
| from evqa.model import batch_chat_with_vllm | |
| outputs = batch_chat_with_vllm( | |
| batch_items=[ | |
| { | |
| "image": [ | |
| "frames/frame_000000.jpg", | |
| "frames/frame_000010.jpg", | |
| "frames/frame_000020.jpg", | |
| ], | |
| "conversations": [ | |
| { | |
| "from": "human", | |
| "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.', | |
| } | |
| ], | |
| } | |
| ], | |
| model_path="ce-amtic/ProcVLM-2B", | |
| max_new_tokens=1024, | |
| temperature=0.0, | |
| tp=1, | |
| ) | |
| ``` | |
| ## One-Shot LoRA Adaptation | |
| ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for: | |
| - annotating coarse sub-task stages with the visual UI; | |
| - generating a LoRA fine-tuning dataset; | |
| - running `evqa/one-shot/lora_oneshot.sh`; | |
| - using the saved LoRA checkpoint with `evqa/inference.py --use_lora`. | |
| ## Limitations | |
| - The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions. | |
| - The progress output is a learned estimate, not a calibrated physical measurement. | |
| - For long-horizon videos, inference quality depends on the sampled frame window and the task description. | |
| - The model should be validated in the target robot environment before being used as a reward signal for training or deployment. | |
| ## Citation | |
| If you use ProcVLM, please cite the paper: | |
| ```bibtex | |
| @misc{feng2026procvlmlearningproceduregroundedprogress, | |
| title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation}, | |
| author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang}, | |
| year={2026}, | |
| eprint={2605.08774}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.RO}, | |
| url={https://arxiv.org/abs/2605.08774}, | |
| } | |
| ``` | |
| ## License | |
| Please refer to the license information on this model repository and the upstream base model license before using the weights. |