phanerozoic commited on
Commit
51eb329
Β·
verified Β·
1 Parent(s): e8fb6fe

Rewrite README: explain the decomposition before naming it, mention EUPE, measured register

Browse files
Files changed (1) hide show
  1. README.md +17 -13
README.md CHANGED
@@ -12,21 +12,25 @@ library_name: pytorch
12
 
13
  # Prism
14
 
15
- One head architecture for all spatial vision tasks. The structure separates features by scale; the output weights select the task.
16
 
17
- The cofiber decomposition (average pool, subtract, iterate) produces multi-scale feature bands from a frozen backbone's spatial output. Each band isolates information present at one spatial scale but absent from the next coarser scale. The decomposition is analytic β€” zero learned parameters. Task-specific prediction is a single linear layer per task on the shared decomposed features.
 
 
18
 
19
  ```
20
- Frozen backbone β†’ spatial features β†’ cofiber decomposition (0 params)
21
- ↓
22
- scale band 0 (stride 16)
23
- scale band 1 (stride 32)
24
- scale band 2 (stride 64)
25
- ↓
26
- β”Œβ”€β”€ Detection weights (768 β†’ 80 classes + 4 box + 1 ctr)
27
- β”œβ”€β”€ Segmentation weights (768 β†’ 150 classes)
28
- β”œβ”€β”€ Depth weights (768 β†’ 256 bins)
29
- └── Any other spatial task (768 β†’ N)
 
 
30
  ```
31
 
32
- The architecture is the decomposition. The task is the last matrix multiply.
 
12
 
13
  # Prism
14
 
15
+ A multi-task head that attaches to a frozen vision backbone and produces detection, segmentation, and depth predictions from a single shared feature decomposition. Developed on EUPE-ViT-B (86M parameters, Meta FAIR) but applicable to any backbone that outputs dense spatial features.
16
 
17
+ The spatial features from the backbone are separated into scale bands by iteratively subtracting a downsampled-then-upsampled copy of the features from the features themselves. Each subtraction isolates the information present at one spatial resolution but absent from the next coarser resolution. This decomposition requires no learned parameters β€” it is a fixed sequence of average pooling, bilinear upsampling, and subtraction. The resulting scale bands are the input to all task heads.
18
+
19
+ Each task is a linear projection on the shared scale bands. Detection, segmentation, and depth differ only in their output weight matrices. The decomposition and the per-scale normalization are shared across all tasks and computed once per image.
20
 
21
  ```
22
+ Image β†’ EUPE-ViT-B (frozen, 86M params) β†’ spatial features [768, H, W]
23
+
24
+ Scale decomposition (0 learned params):
25
+ band_0 = features - upsample(pool(features)) stride 16
26
+ band_1 = pool(features) - upsample(pool(pool(features))) stride 32
27
+ band_2 = pool(pool(features)) stride 64
28
+
29
+ Task outputs (one linear layer each):
30
+ β”œβ”€β”€ Detection (768 β†’ 85 per location) ~65K params
31
+ β”œβ”€β”€ Segmentation (768 β†’ 150 per location) ~115K params
32
+ β”œβ”€β”€ Depth (768 β†’ 256 per location) ~197K params
33
+ └── [any spatial task] (768 β†’ N)
34
  ```
35
 
36
+ The scale decomposition is derived from the adjoint structure of the upsample/downsample pair. A machine-checked proof of the decomposition's cross-scale orthogonality properties is available at `phanerozoic/threshold-cofiber-detection`.