Spaces:

dwellbot
/

dwellbot_stream3r

Configuration error

App Files Files Community

dwellbot_stream3r / design_docs /worker.md

brian4dwell

initi worker working

1c5aca1 2 months ago

preview code

raw

history blame contribute delete

9.95 kB

STream3R — Jobs, Events, and Storage Design

Executive Summary

The STream3R Job System provides an asynchronous GPU job orchestration layer for 3D scene reconstruction and perception tasks.
It standardizes how pose and world-coordinate extraction and scene model building are executed, stored, and tracked across services.

Primary Goals

Asynchronous GPU processing:
All heavy inference runs on background RQ workers; FastAPI services only enqueue jobs and monitor progress.
Unified observability:
- Redis Streams for job lifecycle events and progress (stream3r:events)
- Postgres (stream3r_jobs) as the canonical job record
- S3/Backblaze for durable artifacts and results
Two calling modes:
- get_pose_and_world_coords → Streams-based (Option A) for near-real-time updates
- create_model → Polling (Option B) for long-running model generation
Consistent storage under /scene/{scene_id}/stream3r/, containing:
- kv_cache.pt — serialized key/value cache state
- predictions.npz — packed outputs from the model build
- session_settings.json — runtime/config parameters
- selected_frames.json — frame subset selection
- scene.glb — final assembled scene model
- poses.jsonl — per-frame extrinsics (camera poses)
- pointmaps/*.npz — per-frame world coordinates + confidence maps

Key Outcomes

Clean separation of API ↔ GPU worker responsibilities
Event-driven feedback for quick jobs; reliable polling for long ones
Durable, versioned scene data under a unified layout
End-to-end traceability of all STream3R jobs via Redis + Postgres + S3

1. Queues, Streams, and Locks

Component	Purpose	Notes
`pose_pointmap`	RQ queue for latency-sensitive `pose_pointmap` jobs
`model_build`	RQ queue for long `model_build` jobs
`stream3r:events`	Redis Stream for all job events (`started`, `progress`, `finished`, `failed`)	trimmed periodically
`gpu:lock`	Redis lock ensuring single GPU job at a time per machine

Each Stream event is a flat map of strings:


job_id, job_type, scene_id,
status, progress, result_url, model_dir,
error, ts

2. S3 / Backblaze Storage Layout

All STream3R artifacts live under a scene folder:


s3://<bucket>/scene/{scene_id}/stream3r/
results/
{job_id}.json                     # per-job result JSON (pose_pointmap)
models/
kv_cache.pt                       # serialized KV cache
predictions.npz                   # packed model outputs
session_settings.json             # runtime/config settings
selected_frames.json              # frame subset indices
scene.glb                         # fused 3D scene
poses.jsonl                       # per-frame extrinsics
summary.json                      # canonical model_build result JSON
pointmaps/
{frame_token}.npz                 # per-frame world_coords + confidence

Key Result URLs

Pose/pointmap job → s3://.../scene/{scene_id}/stream3r/results/{job_id}.json
Model build job → s3://.../scene/{scene_id}/stream3r/models/summary.json

3. Database: `stream3r_jobs`

Canonical job table in Postgres.

CREATE TABLE IF NOT EXISTS stream3r_jobs (
  job_id         UUID PRIMARY KEY,
  job_type       TEXT NOT NULL,           -- 'pose_pointmap' | 'model_build'
  scene_id       TEXT NOT NULL,
  status         TEXT NOT NULL,           -- 'queued' | 'started' | 'finished' | 'failed'
  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  started_at     TIMESTAMPTZ,
  completed_at   TIMESTAMPTZ,
  payload        JSONB,                   -- enqueue-time payload
  result         JSONB,                   -- URLs / metrics
  error          TEXT
);

CREATE INDEX IF NOT EXISTS stream3r_jobs_scene_id_idx ON stream3r_jobs(scene_id);
CREATE INDEX IF NOT EXISTS stream3r_jobs_status_idx   ON stream3r_jobs(status);

Upsert pattern:

Insert on enqueue (queued)
Update on start → started
Update on finish → finished, add result
Update on failure → failed, add error

4. Result JSON Schemas

a. Pose + World Coords (per-frame)

s3://…/scene/{scene_id}/stream3r/results/{job_id}.json

{
  "job_id": "uuid",
  "job_type": "pose_pointmap",
  "scene_id": "SCENE123",
  "artifacts": {
    "pointmap_url": "s3://.../scene/SCENE123/stream3r/pointmaps/frame_000010.npz"
  },
  "pose": { "R": [[...]], "t": [x, y, z] },
  "intrinsics": { "fx":..., "fy":..., "cx":..., "cy":... },
  "metrics": { "runtime_s": 1.23 },
  "stream3r": { "cfg": "configs/stream3r_base.yaml", "commit": "<git_sha>" }
}

b. Model Build (scene-level)

s3://…/scene/{scene_id}/stream3r/models/summary.json

{
  "job_id": "uuid",
  "job_type": "model_build",
  "scene_id": "SCENE123",
  "artifacts": {
    "model_dir":        "s3://.../scene/SCENE123/stream3r/models/",
    "kv_cache":         "s3://.../scene/SCENE123/stream3r/models/kv_cache.pt",
    "predictions":      "s3://.../scene/SCENE123/stream3r/models/predictions.npz",
    "session_settings": "s3://.../scene/SCENE123/stream3r/models/session_settings.json",
    "selected_frames":  "s3://.../scene/SCENE123/stream3r/models/selected_frames.json",
    "scene_glb":        "s3://.../scene/SCENE123/stream3r/models/scene.glb",
    "poses_jsonl":      "s3://.../scene/SCENE123/stream3r/models/poses.jsonl"
  },
  "metrics": { "frames": 128, "runtime_s": 42.3 },
  "stream3r": { "cfg": "configs/stream3r_base.yaml", "commit": "<git_sha>" }
}

5. Caller API Responsibilities

`get_pose_and_world_coords` → Option A (Streams)

Enqueue job → get job_id
XREAD BLOCK on stream3r:events until status=finished
On finish:
- Fetch result_url
- Load JSON → retrieve pose, intrinsics, and pointmap_url
- Download .npz to get world_coords + confidence

`create_model` → Option B (Polling)

Enqueue job → return job_id immediately
Periodically poll GET /jobs/{job_id}
On finished:
- Read result with result_url + model_dir
- Download summary.json and listed model files

6. Worker Event & Persistence Flow

Acquire GPU lock
Emit started
Upsert DB row (stream3r_jobs)
Run inference, emitting progress events (every N frames)
Save artifacts to S3:
- pointmaps/*.npz with {xyz, conf}
- poses.jsonl
- Model outputs listed above
Write result JSON → emit finished
Update DB row → status=finished, result=…
On error → emit failed, update DB

7. Example Event Payloads (Redis Stream)

Started

job_id=uuid
job_type=pose_pointmap
scene_id=SCENE123
status=started
progress=1
ts=1730312345.12

Progress

job_id=uuid
job_type=model_build
scene_id=SCENE123
status=progress
progress=40
ts=1730312456.22

Finished

job_id=uuid
job_type=model_build
scene_id=SCENE123
status=finished
progress=100
result_url=s3://bucket/scene/SCENE123/stream3r/models/summary.json
model_dir=s3://bucket/scene/SCENE123/stream3r/models/
ts=1730312567.33

Failed

job_id=uuid
job_type=pose_pointmap
scene_id=SCENE123
status=failed
error=RuntimeError: CUDA OOM
ts=1730312570.00

8. Operational Guidelines

Concern	Best Practice
GPU Safety	Use `gpu:lock` to serialize jobs per GPU
Redis Stream retention	`XTRIM stream3r:events MAXLEN ~50000`
Durability	All artifacts and summaries must persist to S3/Backblaze
DB Reliability	Upsert on each transition; retry writes if DB unavailable
Idempotency	Support caller-supplied `job_id` or `request_id`
Security	Keep Redis internal; use signed or private S3 URLs
Backpressure	Enqueueing API should reject (`429`) when queue depth too large

9. End-to-End Flows

🔹 Pose + World Coords (short job)

API enqueues job → returns job_id
Client subscribes via Redis Stream (blocking XREAD)
Worker runs inference → writes pointmap.npz + result.json
Worker emits finished → client downloads results

🔹 Model Build (long job)

API enqueues → returns job_id
Client polls GET /jobs/{id} or DB row
Worker fuses frames → writes full scene model files
Worker updates DB + emits finished
Client retrieves summary.json + artifacts under /scene/{scene_id}/stream3r/models/

10. Summary

Component	Responsibility	Persistence
FastAPI API	Enqueue jobs, expose `/jobs/{id}`	DB (via worker), Redis (events)
GPU Worker	Execute STream3R inference, emit events	S3/Backblaze, DB
Redis Streams	Event bus for progress + completion	ephemeral
Postgres (`stream3r_jobs`)	Canonical job record	durable
S3/Backblaze /scene/	Scene artifacts, model data	durable

Outcome: This design provides an asynchronous, event-driven, and durable framework for managing STream3R GPU jobs, with standardized scene storage, traceable job metadata, and clear integration points for both real-time and long-running workflows.