--- library_name: transformers pipeline_tag: image-text-to-text license: apache-2.0 task_categories: - reinforcement-learning - robotics - vision-language-modelling tags: - autonomous-driving - carla - imitation-learning - vlm - found-rl size_categories: - 10G-100G --- # Found-RL's fine-tuned Vision-Language Models (VLMs) ## 📜 Overview These VLMs serve for the paper **"Found-RL: Foundation Model-Enhanced Reinforcement Learning for Autonomous Driving"**. In this work, we use fine-tuned VLMs to provide feedback for reinforcement learning agents in autonomous driving scenarios. - **📄 Paper:** [Found-RL: foundation model-enhanced reinforcement learning for autonomous driving](https://www.arxiv.org/pdf/2602.10458) - **💻 Code & Usage:** [https://github.com/ys-qu/found-rl](https://github.com/ys-qu/found-rl) - **📂 Dataset:** [https://huggingface.co/datasets/ys-qu/found-rl_dataset](https://huggingface.co/datasets/ys-qu/found-rl_dataset) ## 📦 Fine-tuning strategies 1. **RGB + Text (LoRA SFT):** - **Visual Input:** Front-view RGB camera images (shape = 900 * 256). - **Method:** Used for **LoRA (Low-Rank Adaptation)** Supervised Fine-Tuning. - **Purpose:** To enable the VLM to understand visual scenes and follow driving instructions based on realistic camera feeds. 2. **Rendered BEV + Text (Full SFT):** - **Visual Input:** Rendered Bird's Eye View (BEV) semantic maps (shape = 192 * 192). - **Method:** Used for **Full Parameter** Supervised Fine-Tuning. - **Purpose:** To provide a holistic spatial understanding of the driving environment, allowing the VLM to act as an expert. If you use these VLMs in your research, please cite our paper: ```bibtex @misc{qu2026foundrl, title={Found-RL: foundation model-enhanced reinforcement learning for autonomous driving}, author={Yansong Qu and Zihao Sheng and Zilin Huang and Jiancong Chen and Yuhao Luo and Tianyi Wang and Yiheng Feng and Samuel Labi and Sikai Chen}, year={2026}, eprint={2602.10458}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.10458}, }