File size: 8,401 Bytes
cab95b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: apache-2.0
language:
- en
library_name: peft
base_model: Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
tags:
- vision-language
- autonomous-driving
- faithfulness
- critic
- lora
- grpo-reward
- waypoint-prediction
---

# FaithfulnessCritic

LoRA adapters over **Qwen3-VL-4B-Instruct** that score whether a vision-language driving planner's **reasoning (R)**, **meta-action (A)**, and **24-step waypoint plan (W)** are mutually self-consistent given the camera scene.

The critic emits a single token directly after a forced `<verdict>` prefix; the score `P(CONSISTENT) ∈ (0,1)` is recovered by softmaxing the logits over the two single-token verdict words `CONSISTENT` and `INCONSISTENT`. The model is intended as a frozen reward signal during GRPO planner training and as a faithfulness-auditing tool offline.

## Variants

The repo contains four adapter checkpoints under separate subfolders. They differ in (i) which **input class** the critic sees and (ii) which **counterfactual augmentation** strategies were used to construct the negative training examples.

| Subfolder | Input class | Negative strategies | Notes |
|---|---|---|---|
| `GB-S12`  | BEV plot + speed profile          | S1, S2 | Lighter β€” no scene-description corruption. |
| `GB-S123` | BEV plot + speed profile          | S1, S2, S3 | All three failure modes. |
| `GP-S12`  | Forward camera overlay + speed    | S1, S2 | First-person view; uses calibration parquets. |
| `GP-S123` | Forward camera overlay + speed    | S1, S2, S3 | All three failure modes. |

Where:
- **GB** = Gemini-curated dataset, **B**EV input.
- **GP** = Gemini-curated dataset, first-**P**erson input.
- **S1** β€” waypoint substitution: `W` replaced with geometrically incompatible donor waypoints.
- **S2** β€” move-justification substitution: only `R.move_justification` is swapped from a donor.
- **S3** β€” scene description substitution: `R.scene` is swapped from a different scene.

Validation sets always include all three strategies in equal proportions, regardless of training mix, so the variants are directly comparable on the same benchmark.

## Quick start

Each subfolder is a standalone PEFT adapter. Load it on top of the base VLM:

```python
import torch
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor

BASE = "Qwen/Qwen3-VL-4B-Instruct"
ADAPTER = "mjf-su/FaithfulnessCritic"
SUBFOLDER = "GB-S12"  # or GB-S123, GP-S12, GP-S123

processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
processor.tokenizer.padding_side = "left"

base = AutoModelForImageTextToText.from_pretrained(
    BASE, dtype=torch.bfloat16, trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, ADAPTER, subfolder=SUBFOLDER)
model.eval().to("cuda")

# Build the chat-template prompt with image(s) + text and append "<verdict>"
# at the end so the next-token logits are over CONSISTENT / INCONSISTENT.
# See `critic_rewards.py:CriticRewardBase._build_prompt` for the full template
# and `_score_logit_mode` for the scoring path used to produce P(CONSISTENT).
```

The reference end-to-end pipeline lives at https://github.com/mjf-su/fms4navigation under `critic_library/Gemini_samples/{BEV,fPOV}/`.

## Inputs

A single triplet `(Image, R, A, W)`:
- **Image** β€” forward-facing camera frame of the driving scene.
  - `GB-*` adapters consume a BEV trajectory plot + a speed-vs-time strip rendered purely from `W`.
  - `GP-*` adapters consume the camera frame with `W` projected as a teal polyline (full calibration + egomotion required) plus the same speed strip.
- **R** β€” `<think>{ "scene": ..., "move_justification": ... }</think>`.
- **A** β€” `<action> Longitudinal: <label> | Lateral: <label> </action>` from the canonical 7-longitudinal Γ— 11-lateral vocabulary.
- **W** β€” 24 lines of `<wp>[x, y, ΞΈ]</wp>`, vehicle-relative, 0.25 s spacing, 6 s horizon.

## Output

The critic emits a single token after a forced `<verdict>` prefix. Two scoring paths are supported:

| Mode | What it does | Range |
|---|---|---|
| `logit` (default) | Softmax over the two single-token verdict ids at the prompt's last position. | `P(CONSISTENT) ∈ (0,1)` |
| `generate` | Greedy-decode 8 tokens, regex-parse `CONSISTENT` / `INCONSISTENT`. | `{0.0, 0.5, 1.0}` |

Use `logit` mode for reward signals (smooth) and `generate` mode for human-readable verdicts.

## Training

- **Base**: Qwen/Qwen3-VL-4B-Instruct (frozen).
- **Adaptation**: LoRA (`r=256`, `lr=1e-4`).
- **Loss**: standard SFT next-token cross-entropy, supervising only the `CONSISTENT` / `INCONSISTENT` verdict token.
- **Positives**: ground-truth `(R, A, W)` triplets from a Gemini-curated subset of [PhysicalAI-Reason-US](https://huggingface.co/datasets/mjf-su/PhysicalAI-Reason-US).
- **Negatives**: counterfactual triplets built per strategy; donor eligibility requires both action axes to differ, different `scene_id`, same train/val split.

## Evaluation

Each variant scored 125 randomly drawn (`seed=42`) planner outputs from two driving VLM planners, with `gemini-3-pro-preview` (few-shot, system-prompt + 6 worked examples) used as the LLM judge. Per-axis verdicts are aggregated to a single `overall ∈ {CONSISTENT, INCONSISTENT, AMBIGUOUS}`. **Agreement = accuracy treating Gemini's `overall` as ground truth**, computed on the subset where both Gemini and the critic returned a non-null verdict (Gemini parse failures and `AMBIGUOUS` are skipped).

```
Planner          Critic     Agreement   P      R     F1    ΞΌP|C   ΞΌP|IC
─────────────────────────────────────────────────────────────────────────
MetaAction-1e    GB-S12       0.764    0.763  0.750  0.756  0.750  0.222
MetaAction-1e    GB-S123      0.724    0.732  0.683  0.707  0.683  0.238
MetaAction-1e    GP-S12       0.732    0.729  0.717  0.723  0.717  0.254
MetaAction-1e    GP-S123      0.732    0.737  0.700  0.718  0.700  0.238
ADEnReward       GB-S12       0.694    0.672  0.717  0.694  0.717  0.328
ADEnReward       GB-S123      0.653    0.644  0.633  0.639  0.633  0.328
ADEnReward       GP-S12       0.734    0.714  0.750  0.732  0.750  0.281
ADEnReward       GP-S123      0.694    0.696  0.650  0.672  0.650  0.266
```

- **P / R / F1** treat `CONSISTENT` as the positive class.
- **ΞΌP\|C** β€” mean critic `P(CONSISTENT)` on Gemini-CONSISTENT records (higher is better).
- **ΞΌP\|IC** β€” mean critic `P(CONSISTENT)` on Gemini-INCONSISTENT records (lower is better; the spread `ΞΌP|C βˆ’ ΞΌP|IC` β‰ˆ 0.45–0.53 across variants indicates the critic is well-discriminating despite a non-trivial decision-boundary error rate).

Best per planner: `GB-S12` for MetaAction-1e (0.764), `GP-S12` for ADEnReward (0.734). Adding S3 (scene-description corruption) to the training mix did not improve agreement on either planner in this benchmark.

## Intended use

- Frozen reward model in GRPO/PPO planner fine-tuning where faithfulness of the (R, A, W) chain matters.
- Offline auditing of candidate planner outputs.
- Counterfactual-failure-mode analysis when paired with the variant ablation (S12 vs S123).

## Out-of-scope use

- The critic is **not** a safety verifier. A `CONSISTENT` verdict means R/A/W are mutually self-consistent and consistent with the scene; it does **not** mean the trajectory is collision-free, comfortable, or legally compliant.
- The critic was trained on a US-centric driving dataset; performance on non-US driving cultures, weather conditions, or sensor configurations not present in the training set is unverified.
- Single-camera, single-frame input only β€” no temporal stack, no surround views.

## Limitations

- Greedy decoding only in `generate` mode; the reward signal is best read via `logit` mode.
- The critic occasionally produces `null` (parse / render failure) when calibration parquets or camera frames are missing β€” see `n_critic_failure` in the eval summaries.
- Like the judge it's evaluated against, the critic can be confidently wrong on edge cases involving rare action combinations (lane-change-during-pull-over, etc.).

## Files

```
mjf-su/FaithfulnessCritic/
β”œβ”€β”€ GB-S12/      adapter_config.json + adapter_model.safetensors
β”œβ”€β”€ GB-S123/     ...
β”œβ”€β”€ GP-S12/      ...
└── GP-S123/     ...
```