found-rl_vlms / README.md

ys-qu

Update README.md

1cbf363 verified 6 days ago

preview code

raw

history blame contribute delete

2.15 kB

metadata

library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
task_categories:
  - reinforcement-learning
  - robotics
  - vision-language-modelling
tags:
  - autonomous-driving
  - carla
  - imitation-learning
  - vlm
  - found-rl
size_categories:
  - 10G-100G

Found-RL's fine-tuned Vision-Language Models (VLMs)

📜 Overview

These VLMs serve for the paper "Found-RL: Foundation Model-Enhanced Reinforcement Learning for Autonomous Driving".

In this work, we use fine-tuned VLMs to provide feedback for reinforcement learning agents in autonomous driving scenarios.

📄 Paper: Found-RL: foundation model-enhanced reinforcement learning for autonomous driving
💻 Code & Usage: https://github.com/ys-qu/found-rl
📂 Dataset: https://huggingface.co/datasets/ys-qu/found-rl_dataset

📦 Fine-tuning strategies

RGB + Text (LoRA SFT):
- Visual Input: Front-view RGB camera images (shape = 900 * 256).
- Method: Used for LoRA (Low-Rank Adaptation) Supervised Fine-Tuning.
- Purpose: To enable the VLM to understand visual scenes and follow driving instructions based on realistic camera feeds.
Rendered BEV + Text (Full SFT):
- Visual Input: Rendered Bird's Eye View (BEV) semantic maps (shape = 192 * 192).
- Method: Used for Full Parameter Supervised Fine-Tuning.
- Purpose: To provide a holistic spatial understanding of the driving environment, allowing the VLM to act as an expert.

If you use these VLMs in your research, please cite our paper:

@misc{qu2026foundrl,
      title={Found-RL: foundation model-enhanced reinforcement learning for autonomous driving}, 
      author={Yansong Qu and Zihao Sheng and Zilin Huang and Jiancong Chen and Yuhao Luo and Tianyi Wang and Yiheng Feng and Samuel Labi and Sikai Chen},
      year={2026},
      eprint={2602.10458},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.10458}, 
}