| --- |
| license: mit |
| tags: |
| - autonomous-driving |
| - trajectory-prediction |
| - kitscenes |
| - kinematic-bicycle-model |
| --- |
| |
| # KITScenes LongTail Challenge — Solution |
|
|
| > **Challenge**: [KIT-MRT/KITScenes-LongTail-Challenge](https://huggingface.co/spaces/KIT-MRT/KITScenes-LongTail-Challenge) |
| > **Dataset**: [KIT-MRT/KITScenes-LongTail](https://huggingface.co/datasets/KIT-MRT/KITScenes-LongTail) |
| > **Paper**: [arXiv:2603.23607](https://arxiv.org/abs/2603.23607) |
|
|
| ## Approach: Few-Shot CoT Kinematic Bicycle Model |
|
|
| Based on the best open-source method from the paper (MMS 4.24 with Gemma 3 12B), this solution uses: |
|
|
| 1. **VLM Reasoning** — analyzes driving scenarios using past trajectory + instruction |
| 2. **Kinematic Bicycle Model** (Kong et al. 2015) — converts action commands → 25 precise waypoints |
| 3. **Few-Shot Chain-of-Thought** — 3 training examples guide the VLM's structured output |
|
|
| ### Supported VLM Backends |
|
|
| | Backend | Command | Notes | |
| |---|---|---| |
| | **Ollama** (local) | `--backend ollama --ollama-model qwen3.5-vl:9b` | No rate limits, free, best for batch | |
| | **HF Inference** | `--backend hf` | Llama-4-Scout-17B via novita/groq | |
| | **Gemini** | `--backend gemini` | Needs `GOOGLE_API_KEY`, free tier rate-limited | |
| | **Fallback** | `--no-vlm` | Instruction-keyword heuristic, no API needed | |
|
|
| ## Quick Start (Local with Ollama) |
|
|
| ```bash |
| # 1. Clone |
| git clone https://huggingface.co/ValeraZSD/kitscenes-longtail-solution |
| cd kitscenes-longtail-solution |
| |
| # 2. Install deps |
| pip install numpy huggingface_hub pyarrow |
| |
| # 3. Start Ollama model (in another terminal) |
| ollama pull qwen3.5-vl:9b # or gemma4:e4b |
| ollama serve |
| |
| # 4. Run generation |
| python scripts/generate_production.py \ |
| --backend ollama \ |
| --ollama-model qwen3.5-vl:9b \ |
| --metadata data/test_metadata.json \ |
| --output submissions/submission_ollama.jsonl \ |
| --upload-repo ValeraZSD/kitscenes-submissions \ |
| --upload-filename submission_ollama_v1.jsonl |
| |
| # 5. Validate |
| python scripts/validate.py submissions/submission_ollama.jsonl |
| ``` |
|
|
| ## Pipeline Architecture |
|
|
| ``` |
| Input: past_trajectory (21 pts @ 5Hz) + driving_instruction |
| ↓ |
| Few-Shot CoT Prompt (3 training examples + query) |
| → VLM (Ollama / HF Inference / Gemini) |
| → XML output with 9 structured fields |
| ↓ |
| Parse XML → Normalize to 5 accel × 5 steer commands |
| ↓ |
| Kinematic Bicycle Model (Kong et al. 2015) |
| Phase 1: 0–3s (15 steps) Phase 2: 3–5s (10 steps) |
| → 25 waypoints in ego-centric coordinates (+x=fwd, +y=left) |
| ↓ |
| Output: scenario_id + future_trajectory (25×2) + reasoning (english, 9 fields) |
| ``` |
|
|
| ## Steering Angle Calibration |
|
|
| Paper Table 6 values (±30° at low speed) are for instantaneous inputs. For sustained 3s phases in the bicycle model, we calibrated against expert trajectories: |
|
|
| | Command | Paper ≤60km/h | **Calibrated** | >60km/h | |
| |---|---|---|---| |
| | turning left | 30° | **6°** | 0.3° | |
| | turning slightly left | 10° | **1°** | 0.1° | |
| | steering straight | 0° | 0° | 0° | |
| | turning slightly right | -10° | **-1°** | -0.1° | |
| | turning right | -30° | **-6°** | -0.3° | |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `scripts/generate_production.py` | Main pipeline — VLM + bicycle model + validation + upload | |
| | `scripts/validate.py` | Standalone submission validator (mirrors challenge code) | |
| | `data/test_metadata.json` | Pre-extracted test metadata (400 scenarios, 190KB, no images) | |
| | `configs/action_vocabulary.json` | Action → parameter mapping with calibrated values | |
| | `submissions/` | Generated submission JSONL files | |
| | `requirements.txt` | Python dependencies | |
|
|
| ## Test Data Distribution |
|
|
| | Instruction | Count | Scenario Type | Count | |
| |---|---|---|---| |
| | drive straight on | 196 | intersection | 125 | |
| | turn right | 43 | overtake/lane change | 102 | |
| | use left lane | 34 | specifically selected | 68 | |
| | use right lane | 30 | construction zone | 36 | |
| | overtake truck | 25 | heavy rain | 27 | |
| | turn left | 20 | snow & wintry mix | 23 | |
| | u-turn | 8 | nighttime | 19 | |
|
|
| Speed: min=0, mean=52, max=130 km/h — 65% urban (≤60), 35% highway (>60) |
|
|