Update README.md

0778f81 verified 18 days ago

8.4 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: peft
	base_model: Qwen/Qwen3-VL-4B-Instruct
	pipeline_tag: image-text-to-text
	tags:
	- vision-language
	- autonomous-driving
	- faithfulness
	- critic
	- lora
	- grpo-reward
	- waypoint-prediction
	---

	# FaithfulnessCritic

	LoRA adapters over Qwen3-VL-4B-Instruct that score whether a vision-language driving planner's reasoning (R), meta-action (A), and 24-step waypoint plan (W) are mutually self-consistent given the camera scene.

	The critic emits a single token directly after a forced `<verdict>` prefix; the score `P(CONSISTENT) ∈ (0,1)` is recovered by softmaxing the logits over the two single-token verdict words `CONSISTENT` and `INCONSISTENT`. The model is intended as a frozen reward signal during GRPO planner training and as a faithfulness-auditing tool offline.

	## Variants

	The repo contains four adapter checkpoints under separate subfolders. They differ in (i) which input class the critic sees and (ii) which counterfactual augmentation strategies were used to construct the negative training examples.

	\| Subfolder \| Input class \| Negative strategies \| Notes \|
	\|---\|---\|---\|---\|
	\| `GB-S12` \| BEV plot + speed profile \| S1, S2 \| Lighter — no scene-description corruption. \|
	\| `GB-S123` \| BEV plot + speed profile \| S1, S2, S3 \| All three failure modes. \|
	\| `GP-S12` \| Forward camera overlay + speed \| S1, S2 \| First-person view; uses calibration parquets. \|
	\| `GP-S123` \| Forward camera overlay + speed \| S1, S2, S3 \| All three failure modes. \|

	Where:
	- GB = Gemini-curated dataset, BEV input.
	- GP = Gemini-curated dataset, first-Person input.
	- S1 — waypoint substitution: `W` replaced with geometrically incompatible donor waypoints.
	- S2 — move-justification substitution: only `R.move_justification` is swapped from a donor.
	- S3 — scene description substitution: `R.scene` is swapped from a different scene.

	Validation sets always include all three strategies in equal proportions, regardless of training mix, so the variants are directly comparable on the same benchmark.

	## Quick start

	Each subfolder is a standalone PEFT adapter. Load it on top of the base VLM:

	```python
	import torch
	from peft import PeftModel
	from transformers import AutoModelForImageTextToText, AutoProcessor

	BASE = "Qwen/Qwen3-VL-4B-Instruct"
	ADAPTER = "mjf-su/FaithfulnessCritic"
	SUBFOLDER = "GB-S12" # or GB-S123, GP-S12, GP-S123

	processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
	processor.tokenizer.padding_side = "left"

	base = AutoModelForImageTextToText.from_pretrained(
	BASE, dtype=torch.bfloat16, trust_remote_code=True,
	)
	model = PeftModel.from_pretrained(base, ADAPTER, subfolder=SUBFOLDER)
	model.eval().to("cuda")

	# Build the chat-template prompt with image(s) + text and append "<verdict>"
	# at the end so the next-token logits are over CONSISTENT / INCONSISTENT.
	# See `critic_rewards.py:CriticRewardBase._build_prompt` for the full template
	# and `_score_logit_mode` for the scoring path used to produce P(CONSISTENT).
	```

	The reference end-to-end pipeline lives at https://github.com/mjf-su/fms4navigation under `critic_library/Gemini_samples/{BEV,fPOV}/`.

	## Inputs

	A single triplet `(Image, R, A, W)`:
	- Image — forward-facing camera frame of the driving scene.
	- `GB-*` adapters consume a BEV trajectory plot + a speed-vs-time strip rendered purely from `W`.
	- `GP-*` adapters consume the camera frame with `W` projected as a teal polyline (full calibration + egomotion required) plus the same speed strip.
	- R — `<think>{ "scene": ..., "move_justification": ... }</think>`.
	- A — `<action> Longitudinal: <label> \| Lateral: <label> </action>` from the canonical 7-longitudinal × 11-lateral vocabulary.
	- W — 24 lines of `<wp>[x, y, θ]</wp>`, vehicle-relative, 0.25 s spacing, 6 s horizon.

	## Output

	The critic emits a single token after a forced `<verdict>` prefix. Two scoring paths are supported:

	\| Mode \| What it does \| Range \|
	\|---\|---\|---\|
	\| `logit` (default) \| Softmax over the two single-token verdict ids at the prompt's last position. \| `P(CONSISTENT) ∈ (0,1)` \|
	\| `generate` \| Greedy-decode 8 tokens, regex-parse `CONSISTENT` / `INCONSISTENT`. \| `{0.0, 0.5, 1.0}` \|

	Use `logit` mode for reward signals (smooth) and `generate` mode for human-readable verdicts.

	## Training

	- Base: Qwen/Qwen3-VL-4B-Instruct (frozen).
	- Adaptation: LoRA (`r=256`, `lr=1e-4`).
	- Loss: standard SFT next-token cross-entropy, supervising only the `CONSISTENT` / `INCONSISTENT` verdict token.
	- Positives: ground-truth `(R, A, W)` triplets from a Gemini-curated subset of [PhysicalAI-Reason-US](https://huggingface.co/datasets/mjf-su/PhysicalAI-Reason-US).
	- Negatives: counterfactual triplets built per strategy; donor eligibility requires both action axes to differ, different `scene_id`, same train/val split.

	## Evaluation

	Each variant scored 125 randomly drawn (`seed=42`) planner outputs from two driving VLM planners, with `gemini-3-pro-preview` (few-shot, system-prompt + 6 worked examples) used as the LLM judge. Per-axis verdicts are aggregated to a single `overall ∈ {CONSISTENT, INCONSISTENT, AMBIGUOUS}`. Agreement = accuracy treating Gemini's `overall` as ground truth, computed on the subset where both Gemini and the critic returned a non-null verdict (Gemini parse failures and `AMBIGUOUS` are skipped).

	```
	Planner Critic Agreement P R F1 μP\|C μP\|IC
	─────────────────────────────────────────────────────────────────────────
	MetaAction-1e GB-S12 0.764 0.763 0.750 0.756 0.750 0.222
	MetaAction-1e GB-S123 0.724 0.732 0.683 0.707 0.683 0.238
	MetaAction-1e GP-S12 0.732 0.729 0.717 0.723 0.717 0.254
	MetaAction-1e GP-S123 0.732 0.737 0.700 0.718 0.700 0.238
	ADEnReward GB-S12 0.694 0.672 0.717 0.694 0.717 0.328
	ADEnReward GB-S123 0.653 0.644 0.633 0.639 0.633 0.328
	ADEnReward GP-S12 0.734 0.714 0.750 0.732 0.750 0.281
	ADEnReward GP-S123 0.694 0.696 0.650 0.672 0.650 0.266
	```

	- P / R / F1 treat `CONSISTENT` as the positive class.
	- μP\\|C — mean critic `P(CONSISTENT)` on Gemini-CONSISTENT records (higher is better).
	- μP\\|IC — mean critic `P(CONSISTENT)` on Gemini-INCONSISTENT records (lower is better; the spread `μP\|C − μP\|IC` ≈ 0.45–0.53 across variants indicates the critic is well-discriminating despite a non-trivial decision-boundary error rate).

	Best per planner: `GB-S12` for MetaAction-1e (0.764), `GP-S12` for ADEnReward (0.734). Adding S3 (scene-description corruption) to the training mix did not improve agreement on either planner in this benchmark.

	## Intended use

	- Frozen reward model in GRPO/PPO planner fine-tuning where faithfulness of the (R, A, W) chain matters.
	- Offline auditing of candidate planner outputs.
	- Counterfactual-failure-mode analysis when paired with the variant ablation (S12 vs S123).

	## Out-of-scope use

	- The critic is not a safety verifier. A `CONSISTENT` verdict means R/A/W are mutually self-consistent and consistent with the scene; it does not mean the trajectory is collision-free, comfortable, or legally compliant.
	- The critic was trained on a US-centric driving dataset; performance on non-US driving cultures, weather conditions, or sensor configurations not present in the training set is unverified.
	- Single-camera, single-frame input only — no temporal stack, no surround views.

	## Limitations

	- Greedy decoding only in `generate` mode; the reward signal is best read via `logit` mode.
	- The critic occasionally produces `null` (parse / render failure) when calibration parquets or camera frames are missing — see `n_critic_failure` in the eval summaries.
	- Like the judge it's evaluated against, the critic can be confidently wrong on edge cases involving rare action combinations (lane-change-during-pull-over, etc.).

	## Files

	```
	mjf-su/FaithfulnessCritic/
	├── GB-S12/ adapter_config.json + adapter_model.safetensors
	├── GB-S123/ ...
	├── GP-S12/ ...
	└── GP-S123/ ...
	```