Spaces:
Running
Running
File size: 3,533 Bytes
23680f2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# HyperView System Architecture
## The Integrated Pipeline Approach
HyperView is built as a three-stage pipeline that turns raw multimodal data into an interactive, fairness-aware view of a dataset. Each stage uses the tool best suited for the job:
* **Ingestion – Python (PyTorch/Geoopt):** Differentiable manifold operations and training of the Hyperbolic Adapter.
* **Storage & Retrieval – Rust (Qdrant):** Low-latency vector search with a custom Poincaré distance metric.
* **Visualization – Browser (WebGL/Deck.gl):** GPU-accelerated rendering of the Poincaré disk in the browser.
## System Diagram
<p align="center">
<img src="../assets/hyperview_architecture.png" alt="HyperView System Architecture: The Integrated Pipeline Approach" width="100%">
</p>
## Component Breakdown
### 1. Ingestion: Hyperbolic Adapter (Python)
* **Role:** The bridge between flat (Euclidean) model embeddings and curved (hyperbolic) space.
* **Input:** Raw data (images/text) → standard model embeddings (e.g. CLIP/ResNet vectors).
* **Tech:** PyTorch, Geoopt.
* **Function:**
* Learns a small Hyperbolic Adapter using differentiable manifold operations.
* Uses the exponential map (`expmap0`) to project Euclidean vectors into the Poincaré ball.
* This is where minority and rare cases are expanded away from the crowded center so they remain distinguishable.
### 2. Storage & Retrieval: Vector Engine (Rust / Qdrant)
* **Role:** The memory that stores and retrieves hyperbolic embeddings at scale.
* **Tech:** Qdrant (forked/extended in Rust).
* **Challenge:** Standard vector DBs only support dot, cosine, or Euclidean distance.
* **Solution:**
* Implement a custom `PoincareDistance` metric in Rust:
$$d(u, v) = \text{arccosh}\left(1 + 2 \frac{\lVert u - v\rVert^2}{(1 - \lVert u\rVert^2)(1 - \lVert v\rVert^2)}\right)$$
* Plug this metric into Qdrant’s HNSW index for fast nearest-neighbor search in hyperbolic space.
* This allows search results to respect the hierarchy in the data instead of collapsing the long tail.
### 3. Visualization: Poincaré Disk Viewer (WebGL)
* **Role:** The lens that lets humans explore the structure of the dataset.
* **Tech:** React, Deck.gl, custom WebGL shaders.
* **Challenge:** Rendering 1M points in non-Euclidean geometry directly in the browser.
* **Solution:**
* Send raw hyperbolic coordinates to the GPU and render them directly onto the Poincaré disk using a custom shader (no CPU-side projection).
* Provide pan/zoom/selection so curators can inspect minority clusters, isolate rare subgroups at the boundary, and export curated subsets.
## Data Flow: The Fairness Pipeline
1. **Ingest:** User uploads a dataset (e.g. medical images, biodiversity data).
2. **Embed:** Standard models (CLIP/ResNet/Whisper) produce Euclidean embeddings.
3. **Expand:** The Hyperbolic Adapter projects them into the Poincaré ball; rare cases move towards the boundary instead of being crushed.
4. **Index:** Qdrant stores these hyperbolic vectors with the custom Poincaré distance metric.
5. **Query:** A user clicks on a minority example or defines a region of interest.
6. **Search:** Qdrant returns semantic neighbors according to Poincaré distance, preserving the hierarchy between majority, minority, and rare subgroups.
7. **Visualize & Curate:** The browser renders the Poincaré disk, highlighting clusters and long-tail regions so users can see gaps, remove duplicates, and build fairer training sets.
|