Spaces:
Configuration error
STream3R β Jobs, Events, and Storage Design
Executive Summary
The STream3R Job System provides an asynchronous GPU job orchestration layer for 3D scene reconstruction and perception tasks.
It standardizes how pose and world-coordinate extraction and scene model building are executed, stored, and tracked across services.
Primary Goals
- Asynchronous GPU processing:
All heavy inference runs on background RQ workers; FastAPI services only enqueue jobs and monitor progress. - Unified observability:
- Redis Streams for job lifecycle events and progress (
stream3r:events) - Postgres (
stream3r_jobs) as the canonical job record - S3/Backblaze for durable artifacts and results
- Redis Streams for job lifecycle events and progress (
- Two calling modes:
get_pose_and_world_coordsβ Streams-based (Option A) for near-real-time updatescreate_modelβ Polling (Option B) for long-running model generation
- Consistent storage under
/scene/{scene_id}/stream3r/, containing:kv_cache.ptβ serialized key/value cache statepredictions.npzβ packed outputs from the model buildsession_settings.jsonβ runtime/config parametersselected_frames.jsonβ frame subset selectionscene.glbβ final assembled scene modelposes.jsonlβ per-frame extrinsics (camera poses)pointmaps/*.npzβ per-frame world coordinates + confidence maps
Key Outcomes
- Clean separation of API β GPU worker responsibilities
- Event-driven feedback for quick jobs; reliable polling for long ones
- Durable, versioned scene data under a unified layout
- End-to-end traceability of all STream3R jobs via Redis + Postgres + S3
1. Queues, Streams, and Locks
| Component | Purpose | Notes |
|---|---|---|
pose_pointmap |
RQ queue for latency-sensitive pose_pointmap jobs |
|
model_build |
RQ queue for long model_build jobs |
|
stream3r:events |
Redis Stream for all job events (started, progress, finished, failed) |
trimmed periodically |
gpu:lock |
Redis lock ensuring single GPU job at a time per machine |
Each Stream event is a flat map of strings:
job_id, job_type, scene_id,
status, progress, result_url, model_dir,
error, ts
2. S3 / Backblaze Storage Layout
All STream3R artifacts live under a scene folder:
s3://<bucket>/scene/{scene_id}/stream3r/
results/
{job_id}.json # per-job result JSON (pose_pointmap)
models/
kv_cache.pt # serialized KV cache
predictions.npz # packed model outputs
session_settings.json # runtime/config settings
selected_frames.json # frame subset indices
scene.glb # fused 3D scene
poses.jsonl # per-frame extrinsics
summary.json # canonical model_build result JSON
pointmaps/
{frame_token}.npz # per-frame world_coords + confidence
Key Result URLs
- Pose/pointmap job β
s3://.../scene/{scene_id}/stream3r/results/{job_id}.json - Model build job β
s3://.../scene/{scene_id}/stream3r/models/summary.json
3. Database: stream3r_jobs
Canonical job table in Postgres.
CREATE TABLE IF NOT EXISTS stream3r_jobs (
job_id UUID PRIMARY KEY,
job_type TEXT NOT NULL, -- 'pose_pointmap' | 'model_build'
scene_id TEXT NOT NULL,
status TEXT NOT NULL, -- 'queued' | 'started' | 'finished' | 'failed'
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
payload JSONB, -- enqueue-time payload
result JSONB, -- URLs / metrics
error TEXT
);
CREATE INDEX IF NOT EXISTS stream3r_jobs_scene_id_idx ON stream3r_jobs(scene_id);
CREATE INDEX IF NOT EXISTS stream3r_jobs_status_idx ON stream3r_jobs(status);
Upsert pattern:
- Insert on enqueue (
queued) - Update on start β
started - Update on finish β
finished, addresult - Update on failure β
failed, adderror
4. Result JSON Schemas
a. Pose + World Coords (per-frame)
s3://β¦/scene/{scene_id}/stream3r/results/{job_id}.json
{
"job_id": "uuid",
"job_type": "pose_pointmap",
"scene_id": "SCENE123",
"artifacts": {
"pointmap_url": "s3://.../scene/SCENE123/stream3r/pointmaps/frame_000010.npz"
},
"pose": { "R": [[...]], "t": [x, y, z] },
"intrinsics": { "fx":..., "fy":..., "cx":..., "cy":... },
"metrics": { "runtime_s": 1.23 },
"stream3r": { "cfg": "configs/stream3r_base.yaml", "commit": "<git_sha>" }
}
b. Model Build (scene-level)
s3://β¦/scene/{scene_id}/stream3r/models/summary.json
{
"job_id": "uuid",
"job_type": "model_build",
"scene_id": "SCENE123",
"artifacts": {
"model_dir": "s3://.../scene/SCENE123/stream3r/models/",
"kv_cache": "s3://.../scene/SCENE123/stream3r/models/kv_cache.pt",
"predictions": "s3://.../scene/SCENE123/stream3r/models/predictions.npz",
"session_settings": "s3://.../scene/SCENE123/stream3r/models/session_settings.json",
"selected_frames": "s3://.../scene/SCENE123/stream3r/models/selected_frames.json",
"scene_glb": "s3://.../scene/SCENE123/stream3r/models/scene.glb",
"poses_jsonl": "s3://.../scene/SCENE123/stream3r/models/poses.jsonl"
},
"metrics": { "frames": 128, "runtime_s": 42.3 },
"stream3r": { "cfg": "configs/stream3r_base.yaml", "commit": "<git_sha>" }
}
5. Caller API Responsibilities
get_pose_and_world_coords β Option A (Streams)
Enqueue job β get
job_idXREAD BLOCKonstream3r:eventsuntilstatus=finishedOn finish:
- Fetch
result_url - Load JSON β retrieve
pose,intrinsics, andpointmap_url - Download
.npzto getworld_coords+confidence
- Fetch
create_model β Option B (Polling)
Enqueue job β return
job_idimmediatelyPeriodically poll
GET /jobs/{job_id}On
finished:- Read
resultwithresult_url+model_dir - Download
summary.jsonand listed model files
- Read
6. Worker Event & Persistence Flow
Acquire GPU lock
Emit
startedUpsert DB row (
stream3r_jobs)Run inference, emitting
progressevents (every N frames)Save artifacts to S3:
pointmaps/*.npzwith{xyz, conf}poses.jsonl- Model outputs listed above
Write result JSON β emit
finishedUpdate DB row β
status=finished, result=β¦On error β emit
failed, update DB
7. Example Event Payloads (Redis Stream)
Started
job_id=uuid
job_type=pose_pointmap
scene_id=SCENE123
status=started
progress=1
ts=1730312345.12
Progress
job_id=uuid
job_type=model_build
scene_id=SCENE123
status=progress
progress=40
ts=1730312456.22
Finished
job_id=uuid
job_type=model_build
scene_id=SCENE123
status=finished
progress=100
result_url=s3://bucket/scene/SCENE123/stream3r/models/summary.json
model_dir=s3://bucket/scene/SCENE123/stream3r/models/
ts=1730312567.33
Failed
job_id=uuid
job_type=pose_pointmap
scene_id=SCENE123
status=failed
error=RuntimeError: CUDA OOM
ts=1730312570.00
8. Operational Guidelines
| Concern | Best Practice |
|---|---|
| GPU Safety | Use gpu:lock to serialize jobs per GPU |
| Redis Stream retention | XTRIM stream3r:events MAXLEN ~50000 |
| Durability | All artifacts and summaries must persist to S3/Backblaze |
| DB Reliability | Upsert on each transition; retry writes if DB unavailable |
| Idempotency | Support caller-supplied job_id or request_id |
| Security | Keep Redis internal; use signed or private S3 URLs |
| Backpressure | Enqueueing API should reject (429) when queue depth too large |
9. End-to-End Flows
πΉ Pose + World Coords (short job)
- API enqueues job β returns
job_id - Client subscribes via Redis Stream (blocking XREAD)
- Worker runs inference β writes
pointmap.npz+result.json - Worker emits
finishedβ client downloads results
πΉ Model Build (long job)
- API enqueues β returns
job_id - Client polls
GET /jobs/{id}or DB row - Worker fuses frames β writes full scene model files
- Worker updates DB + emits
finished - Client retrieves
summary.json+ artifacts under/scene/{scene_id}/stream3r/models/
10. Summary
| Component | Responsibility | Persistence |
|---|---|---|
| FastAPI API | Enqueue jobs, expose /jobs/{id} |
DB (via worker), Redis (events) |
| GPU Worker | Execute STream3R inference, emit events | S3/Backblaze, DB |
| Redis Streams | Event bus for progress + completion | ephemeral |
Postgres (stream3r_jobs) |
Canonical job record | durable |
| S3/Backblaze /scene/ | Scene artifacts, model data | durable |
Outcome: This design provides an asynchronous, event-driven, and durable framework for managing STream3R GPU jobs, with standardized scene storage, traceable job metadata, and clear integration points for both real-time and long-running workflows.