--- base_model: Qwen/Qwen3-VL-4B-Thinking library_name: transformers model_name: PhysicalAI-reason-VLA tags: - generated_from_trainer - sft - trl - vision-language - autonomous-driving - reasoning license: mit datasets: - mjf-su/PhysicalAI-reason-US --- # PhysicalAI-reason-VLA A vision-language driving policy fine-tuned from [mjf-su/PhysicalAI-base-VLA](https://huggingface.co/mjf-su/PhysicalAI-base-VLA) (itself based on [Qwen/Qwen3-VL-4B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking)) using supervised fine-tuning with [TRL](https://github.com/huggingface/trl). This model extends the base waypoint-prediction VLA with **structured chain-of-thought reasoning** and **discrete driving decisions**, trained on 10k Gemini-annotated driving scenes for 2 epochs. --- ## Input / Output **Inputs** - A forward-facing camera image - Past ego-vehicle waypoints in the vehicle's relative frame **Output** ``` { "scene": "2–3 sentence static scene description", "move_justification": "2–3 sentence causal explanation linking scene to decisions", } [x.xx,y.yy,t.tttt] [x.xx,y.yy,t.tttt] ... ``` The model produces three outputs in sequence: a reasoning trace (``), discrete longitudinal and lateral driving decisions (``), and future trajectory waypoints (``). --- ## Decision Tokens Each `` block contains exactly one longitudinal and one lateral token. **Longitudinal** — `` · `` · `` · `` · `` · `` · `` **Lateral** — `` · `` · `` · `` · `` · `` · `` · `` · `` · `` · `` · `` These are registered as genuine single tokens in the vocabulary (not subword decompositions), enabling efficient probability measurement over the full decision space with a single forward pass. --- ## Training | | | |---|---| | **Base model** | [mjf-su/PhysicalAI-base-VLA](https://huggingface.co/mjf-su/PhysicalAI-base-VLA) | | **Dataset** | [mjf-su/PhysicalAI-reason-US](https://huggingface.co/datasets/mjf-su/PhysicalAI-reason-US) | | **Annotation** | Gemini batch API (chain-of-thought labels on real US driving data) | | **Samples** | 10,000 | | **Epochs** | 2 | | **Method** | Completion-only SFT via TRL | --- ## Quick Start ```python from transformers import AutoProcessor, AutoModelForImageTextToText from PIL import Image model_id = "mjf-su/PhysicalAI-reason-VLA" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto") image = Image.open("forward_camera.jpg") past_waypoints = "[0.00,0.00,0.0000]\n[0.51,0.00,0.0001]\n..." messages = [ { "role": "system", "content": [{"type": "text", "text": "You are a helpful AI assistant ..."}] }, { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": f"[PAST-VEHICLE-MOTION]:\n{past_waypoints}"} ] } ] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(processor.batch_decode(outputs, skip_special_tokens=True)[0]) ``` --- ## Citation ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, year = 2022, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ```