JosephBai
/

TTI

Model card Files Files and versions

TTI / Dev /guidelines.md

JosephBai's picture

Upload folder using huggingface_hub

857c2e9 verified 30 days ago

|

history blame contribute delete

3.25 kB

	### Overview
	This document describes how to integrate a Vision-Language-Action-Critic (VLAC) into the SimpleVLA-RL training stack. The goal is to replace the simulator-provided terminal success signal during training with a VLAC-predicted value, while preserving the simulator signal for evaluation.

	### Repository structure (high level)
	- verl
	Core RL training framework for SimpleVLA-RL.
	Uses Ray for distributed orchestration and vLLM for fast inference/parallelization.
	Originally built for LLM RL, adapted here for VLA training.
	Ships with a pre-trained OpenVLA-OFT model and the LIBERO simulation environment as Python packages.

	- examples
	Entry scripts and helpers for training and evaluation.
	`run_openvla_oft_rl.sh` is the main training entrypoint.
	`eval_openvla_oft.sh` runs evaluation (with `trainer.val_only=True`).

	- evo_vlac
	External module imported from the VLAC project.
	Provides models to predict task progress/success for robot manipulation.
	`evo_vlac/examples` shows how to run the critic on images/videos and obtain pairwise critic, per-frame values, and done/success estimates.

	### Objective
	- Training
	Do not use the environment `done` during training. Termination and terminal rewards are determined by VLAC only.
	After each step, call VLAC to detect `done`.
	- If `VLAC determined done == True`: terminate the episode immediately and set the terminal reward to 1.0 (no need to call VLAC value again).
	- Else if `max_step` is reached: terminate the rollout and call the VLAC model to compute the value list and use the last value the reward.
	- Else: continue the episode.

	- Evaluation
	Keep using the environment `done` signal to compute success rate. VLAC is not used to determine evaluation success.

	- Rollout termination logic
	- Training: "done" signal from VLAC
	- Testing: "done" signal from the environment

	- Reward logic
	- Training: if "done" detected by VLAC, reward=1.0; else, reward comes from VLAC value.

	### Integration approach (service-oriented)
	- Why service
	`verl` is Ray/vLLM-managed; VLAC is managed via Hugging Face and ms-swift. To keep systems decoupled and switchable, expose VLAC as a lightweight service that `verl` calls at needed times.

	- Deployment flexibility
	The VLAC service can run on the same node (sharing GPU memory fraction) or on a different node.

	- Transport options
	During training, frames are kept in memory as a Python list (TODO: you need to double-confirm this in the code in implementation); they are not written to disk until the episode finishes. Therefore, prefer sending the frames for communication in the service.


	### Additional Notes
	- Try to use simple implementation. Avoding using complicated environment, e.g., docker.
	- GPU resource sharing: Likely to share GPUs with SimpleVLA-RL on H100 80GB cards. During rollout generation, usage is ~20–30 GB; during actor gradient updates, peaks at ~60–70 GB.
	- Development environment: Use this desktop only for editing and light, non-intrusive smoke tests. I will push to a server for real training/evaluation.
	- Collaboration: Before starting, I will confirm the open questions above with you and then proceed with the plan.