dwellbot_stream3r / design_docs /keyframe_selection_motion_coverage.md
brian4dwell's picture
keyframe selection
de1bede
# Design Doc: Motion- & Coverage-Aware Key Frame Selection

**Author:** Brian Clark  
**Last Updated:** 2025-11-07  
**Target Components:** `_compute_selected_frames`, Stream3R inference outputs  
**Goal:** Replace naive FPS sampling with a strategy that keeps only frames providing new camera poses and meaningful scene coverage, reducing point-cloud clutter and improving 2D scene graphs.

---

## 1. Overview

We combine two complementary signals:

1. **Motion-aware downsampling (Option A):** ensure key frames are spaced by actual camera movement (SE(3) distance), not just time.
2. **Coverage-driven selection (Option B):** prefer frames that contribute new high-confidence geometry after Stream3R processing.

The final key frame list is built by enforcing motion diversity first, then greedily adding frames with the largest uncovered coverage gain until we reach a target budget.

---

## 2. Inputs & Prerequisites

- Per-frame camera extrinsics (`extrinsic`) from Stream3R.
- Optional per-frame quality metrics (blur/confidence) from camera head.
- Stream3R `world_points` and `world_points_conf` (or post-voxel-reduction point maps) to evaluate coverage.
- Library support: NumPy + SciPy (for SE(3) distances), optional Open3D or custom KD-tree for point coverage.

---

## 3. Motion Metrics (Option A)

### 3.1 Pose difference
- Compute translation delta: `||t_i - t_j||`.
- Compute rotation delta: angle of `R_i * R_j^{-1}` via `acos((trace - 1) / 2)`.
- Combine with weights (e.g., `motion = w_t * Δpos + w_r * Δrot`), with defaults `w_t=1.0`, `w_r=0.5 m/rad`.

### 3.2 Greedy spacing (temporal pass)
1. Initialize with first frame as key.
2. For each subsequent frame:
   - Accumulate motion distance from last key (sum of per-frame deltas).
   - If distance ≥ `motion_threshold` OR time since last key ≥ `max_gap`, mark as key.
   - Optional: enforce minimum gap (`min_gap_time`) to avoid bursty picks.
3. Result: `motion_keys` – baseline set with adequate pose coverage.

### 3.3 Quality gating (optional)
- Discard frames with low focus / brightness (if metadata available).
- Use confidence summary (mean `world_points_conf`) to veto worst frames before motion selection.

---

## 4. Coverage Metrics (Option B)

### 4.1 Coverage data
- For each frame, gather the subset of point cloud indices it contributes above a confidence threshold.
  - Option 1: Use raw `world_points_conf` mask per frame.
  - Option 2: After voxel reduction, store voxel IDs touched by each frame (during inference loop).

### 4.2 Greedy coverage selection
1. Start with `coverage_keys = []`, `covered = set()`.
2. For each candidate frame (ordered by motion selection or confidence):
   - Compute `gain = new_points / total_points`, where `new_points = {points not in covered}`.
   - Keep a priority queue sorted by gain (breaking ties via motion distance or confidence).
3. While `coverage_keys` size < desired target (`top_k` or auto budget):
   - Pop frame with highest gain.
   - Add to `coverage_keys` and update `covered`.
   - Recompute gains lazily or maintain stored values (since coverage shrinks).
4. Merge with `motion_keys`: `selected = sorted(motion_keys ∪ coverage_keys)` preserving chronological order.

### 4.3 Parameters
| Parameter | Purpose | Default |
|-----------|---------|---------|
| `coverage_conf_thres` | Minimum confidence per point | 0.3 |
| `top_k` | Max key frames (if >0) | Provided payload |
| `auto_budget_seconds` | If `top_k` not set, target frames per scene duration | 0.4 fps (≈12 frames for 30 s) |
| `min_gain_ratio` | Stop if marginal gain < threshold | 0.01 |

---

## 5. Algorithm Outline

```text
1. Precompute per-frame metadata:
   - Motion deltas & cumulative distance
   - Frame quality/confidence
   - Coverage contributions (voxel IDs or hashed points)

2. Motion pass:
   motion_keys = greedy_motion_selection(frames, motion_threshold, min_gap, max_gap)

3. Coverage pass:
   candidates = frames filtered by quality & (if large scenes) downsampled using motion_keys as seeds
   coverage_keys = greedy_coverage_selection(candidates, contributions, budget)

4. Combine & finalize:
   selected = sort(unique(motion_keys ∪ coverage_keys))
   if len(selected) > budget: prune lowest coverage gain while keeping motion anchors
   collect metadata (confidence, motion distance, coverage gain) for diagnostics

5. Optional reinflation pass (if enabled) to restore splat density for the selected frames only.

6. Emit diagnostics in `selected_frames.json`.

6. Integration Points

6.1 _compute_selected_frames

  • Extend signature to accept:
    • frame_records (already present)
    • extrinsics, world_points, world_points_conf
    • optional confidence_summary, frame_timestamps
  • Return list of dicts with fields: frame_id, motion_score, coverage_gain, cum_motion, etc., so the artifact can explain the reasoning.

6.2 Inference loop

  • While iterating frames, record:
    • Pose deltas (store to arrays for later).
    • Coverage bitsets: e.g., hash voxel indices (np.floor(world_points / voxel_size)).
    • Quality metrics (mean conf, brightness).

6.3 Job artifacts

  • Include selection diagnostics in selected_frames.json:
    {
      "frame_id": "...",
      "motion_distance": 0.45,
      "coverage_gain": 0.12,
      "decision": "coverage"
    }
    
  • Enables auditing the chosen frames.

6.4 Two-pass pipeline hook

  • Add a config flag (e.g., STREAM3R_KEYFRAME_PREPASS) to toggle a lightweight pre-pass.
  • Pre-pass steps:
    1. Collect frames as usual.
    2. Run a reduced inference loop (camera head only or full Stream3R with artifact generation disabled) to gather motion and coverage metadata.
    3. Execute the key-frame selection algorithm to produce selected indices.
  • Main pass:
    1. Filter frame_records to the selected indices.
    2. If the batch size is below a configured maximum, switch inference to full attention; otherwise remain in window mode.
    3. Run the full artifact pipeline (pointmaps, GLB, reinflation) on the reduced set.
    4. Persist selection diagnostics alongside artifacts.
  • Provide a fallback path: if the pre-pass fails or returns too few frames, revert to the original sampling strategy so the job still succeeds.

7. Configuration & Defaults

Setting Description Default
STREAM3R_KEYFRAME_MOTION_THRESH Motion distance (m) to trigger new key 0.3
STREAM3R_KEYFRAME_ROT_THRESH Rotation angle (rad) weight 0.5
STREAM3R_KEYFRAME_MIN_GAP Minimum time gap (s) 0.25
STREAM3R_KEYFRAME_MAX_GAP Max time between keys (s) 2.0
STREAM3R_KEYFRAME_TOP_K Max number of key frames 18 (overridable per payload)
STREAM3R_KEYFRAME_MIN_GAIN Coverage gain stop threshold 0.01
STREAM3R_KEYFRAME_CONF_THRESH Confidence threshold for coverage 0.3

8. Validation Plan

  1. Quantitative
    • Compare key frame counts vs. baseline (2 fps sampling).
    • Measure point coverage retention (% of original points represented by key frames).
    • Evaluate overlap with heuristic linear sampling (should be reduced).
  2. Qualitative
    • Visual inspection: point cloud clutter reduction, better 2D scene graph clarity.
    • Spot-check key-frame artifacts (diagnostic metadata) to ensure decisions align with expectations.
  3. Performance
    • Ensure coverage computations remain efficient (hash-based; track memory usage).
    • Add timing logs in _compute_selected_frames.

9. Future Extensions

  • Integrate image-content heuristics (entropy, saliency) into coverage scoring.
  • Multi-pass selection: first ensure 360° orientation coverage, then fill gaps.
  • Adaptive budgets based on room size / path length (use total motion distance).
  • Optionally, trigger reinflation of selected frames only for visualization.

Deliverables

  1. Updated _compute_selected_frames with motion + coverage logic.
  2. Supporting utilities for pose distance and coverage hashing.
  3. Config hooks & optional environment variables.
  4. Tests covering edge cases (no motion, tiny coverage gains, payload top_k override).
  5. Documentation updates describing new behavior and tuning knobs.