File size: 6,209 Bytes
c819abd
 
 
 
 
 
 
 
2deb8ce
c819abd
2deb8ce
c819abd
 
 
 
2deb8ce
 
 
 
 
c819abd
2deb8ce
 
 
c819abd
c1a6da2
 
 
 
 
 
 
2deb8ce
 
 
 
 
c819abd
2deb8ce
 
 
 
 
c1a6da2
 
 
2deb8ce
 
 
c1a6da2
2deb8ce
 
 
c1a6da2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2deb8ce
c1a6da2
 
 
 
2deb8ce
 
c819abd
 
 
2deb8ce
 
c819abd
 
 
 
 
 
2deb8ce
c819abd
2deb8ce
 
 
 
 
 
 
 
c819abd
 
2deb8ce
 
 
 
 
 
 
 
 
c819abd
 
 
 
 
 
2deb8ce
c819abd
2deb8ce
c819abd
 
2deb8ce
c819abd
 
 
 
2deb8ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1a6da2
 
2deb8ce
 
c819abd
2deb8ce
c819abd
2deb8ce
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
license: mit
tags:
  - robotics
  - flow-matching
  - ood-detection
  - visual-servoing
  - conditioning-energy
  - uncertainty-quantification
pipeline_tag: robotics
library_name: pytorch
---

# Familiarity-Flow OneBox 8-Layer

Flow-matching policy for stereo-image-conditioned 3D grasp-offset prediction,
trained on the **OneBox** synthetic Isaac-Sim dataset. The full learning
dynamics — value of the prediction, geometry of the flow, and
Jacobian-of-conditioning OOD signal — are studied in the
[Familiarity-Flow repo](https://github.com/Finding-Familiarity/Familiarity-Flow).

Intended primarily as the **conditioning-energy OOD-detection backend** for
robotic-policy gating, exposed through the
[familiarity-planner](https://github.com/tomnotch/familiarity-planner) package.

**This checkpoint comes from a 150,000-step extended-training study**
that explored flow / OOD-separation dynamics well past the conventional
convergence point. See
[`docs/long_run_analysis.md`](https://github.com/Finding-Familiarity/Familiarity-Flow/blob/main/docs/long_run_analysis.md)
in the repo for the full write-up (multi-descent behaviour observed, not
the monotone-plateau or terminal-collapse initially hypothesised).

---

## Checkpoint summary

| Field | Value |
|---|---|
| Architecture | `FlowMatchingPolicy`, 8 cross-attention layers |
| Vision encoder | DINOv2-B (ViT-B/14, frozen) |
| Action space | ℝ³ (3-DoF grasp offset) |
| Time sampling | Beta(1.5, 1) (π₀ schedule) |
| Training data | OneBox (synthetic Isaac Sim, ZED-Mini stereo) |
| Training steps | 128,250 (best val_loss checkpoint of 150k-step run) |
| Best val_loss | **0.0639** |
| Best val L2 error | **0.1462** |
| Parameters | 244 M total, 35.6 M trainable (encoder frozen) |
| License | MIT |

### OOD-separation at this checkpoint (step 128,250)

| Metric | ID | OOD (clutter) | WILD (real) | OOD/ID | WILD/ID |
|---|---|---|---|---|---|
| CE  | 0.642 | 3.341 | 2.077 | **5.20×** | 3.23× |
| DCE | 0.062 | 0.303 | 0.186 | **4.87×** | 2.99× |

AUROC(ID vs OOD) and AUROC(ID vs WILD) are both **1.000** (rank-based
separation is perfect and has been since step ≈ 8k).

Reported directly from the training log at
`outputs/csv/onebox/version_15` in the repo.

### vs the previous checkpoint (step 21,850, val_loss 0.0726)

Strictly better or tied on every metric we measured:

| | Previous | This checkpoint | Δ |
|---|---|---|---|
| val/loss | 0.0726 | **0.0639** | −12.0% |
| val/l2_error | 0.1755 | **0.1462** | −16.7% |
| ood/loss | 4.414 | 4.241 | −3.9% |
| ood/l2_error | 1.371 | 1.271 | −7.3% |
| CE WILD/ID | 2.79× | **3.23×** | +15.8% |
| DCE OOD/ID | 4.32× | **4.87×** | +12.7% |
| DCE WILD/ID | 2.41× | **2.99×** | +24.1% |

(CE OOD/ID drifted −2.1%, well inside the run-to-run variance observed
during the extended run.)

> **Threshold-shift note**: absolute CE/DCE values in this checkpoint
> are ~3× larger than in the previous one (CE_ID 0.225 → 0.642). A
> downstream OOD detector using an absolute threshold needs to be
> re-calibrated — ratios are preserved but the raw scale is not.

---

## Usage

### Download

```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="TomNotch/familiarity-flow-onebox-8L",
    filename="onebox_8L.ckpt",
)
```

### Load directly (Familiarity-Flow must be installed)

```python
from familiarity_flow.lightning.module import FlowMatchingModule

module = FlowMatchingModule.load_from_checkpoint(ckpt_path, map_location="cuda")
module.eval()
policy = module.ema_policy   # EMA-averaged weights used for inference
```

### Score a batch for OOD-ness

```python
# images: list of stereo image tensors, each shaped (B, 3, 224, 224)
ce = policy.ood_score(images, num_steps=10)   # shape: (B,)
# Higher CE = more OOD
```

### Via familiarity-planner

```python
from familiarity_planner.familiarity import Familiarity

fam = Familiarity(
    "conditioning_energy",
    checkpoint_path="TomNotch/familiarity-flow-onebox-8L",   # auto-downloaded
)
score = fam(stereo_observation)   # smaller = more familiar
```

---

## Method

Conditional flow matching with linear interpolation and independent coupling
(Lipman et al., *ICLR 2023*). The **conditioning energy**

$$\mathrm{CE}(c) = \int_1^0 \left\lVert \frac{\partial v_\theta}{\partial c}(x_t, t, c) \right\rVert_F^2 \, \mathrm{d}t$$

is measured along the deterministic Euler ODE trajectory from noise
(`x_1 ∼ N(0, I)`) to the predicted action (`x_0`). Its endpoint-Jacobian
cousin DCE measures the squared Frobenius norm of `∂φ/∂c` where `φ` is
the full ODE map. Both scale as out-of-distribution inputs excite the
learned velocity field's sensitivity to conditioning — a signal that
falls out of the geometry of the flow without any auxiliary classifier.

---

## Limitations

- Trained on a **single synthetic domain** (OneBox Isaac Sim renderings).
  Generalisation across robots, object sets, or camera rigs is **not**
  claimed.
- Action head predicts only a 3-DoF grasp offset; not a full pose or
  trajectory.
- OOD-detection quality (CE/DCE) is strong on the OneBox `clutter` and
  `wild` eval sets used during training — behaviour on arbitrary
  out-of-domain inputs is untested.
- **Not for deployment on physical robots** without independent
  validation. Intended as a research artefact and as a concrete
  backend for methodology study.

---

## Related work

- Lipman et al., *Flow Matching for Generative Modeling*, ICLR 2023
  ([arXiv:2210.02747](https://arxiv.org/abs/2210.02747))
- Black et al., *π₀: A Vision-Language-Action Flow Model for General
  Robot Control* ([arXiv:2410.24164](https://arxiv.org/abs/2410.24164))
- Chen et al., *Neural Ordinary Differential Equations*, NeurIPS 2018
  ([arXiv:1806.07366](https://arxiv.org/abs/1806.07366))
- Liu et al., *Simple and Principled Uncertainty Estimation (SNGP)*,
  NeurIPS 2020 ([arXiv:2006.10108](https://arxiv.org/abs/2006.10108))
- Nakkiran et al., *Deep Double Descent*, ICLR 2020
  ([arXiv:1912.02292](https://arxiv.org/abs/1912.02292))

---

## Author

Mukai (Tom Notch) Yu — Carnegie Mellon University, Robotics Institute.
Course project for 16-832 / 16-761 (Spring 2026).