HyperView / docs /datasets.md
morozovdd's picture
feat: add HyperView app for space
23680f2

Datasets

Creating a Dataset

import hyperview as hv

# Persistent dataset (default) - survives restarts
dataset = hv.Dataset("my_dataset")

# In-memory dataset - lost when process exits
dataset = hv.Dataset("my_dataset", persist=False)

Storage location: ~/.hyperview/datasets/ (configurable via HYPERVIEW_DATABASE_DIR)

Internally, each dataset is stored as two Lance tables (directories) inside that folder:

  • hyperview_{dataset_name}.lance/ (samples)
  • hyperview_{dataset_name}_meta.lance/ (metadata like label colors)

Adding Samples

From HuggingFace

dataset.add_from_huggingface(
    "uoft-cs/cifar100",
    split="train",
    image_key="img",
    label_key="fine_label",
    max_samples=1000,
)

From Directory

dataset.add_images_dir("/path/to/images", label_from_folder=True)

Persistence Model: Additive

HyperView uses an additive persistence model:

Action Behavior
Add samples New samples inserted, existing skipped by ID
Request fewer than exist Existing samples preserved (no deletion)
Request more than exist Only new samples added
Embeddings Cached per-sample, reused across sessions
Projections Recomputed when new samples added (UMAP requires refit)

Example:

dataset = hv.Dataset("my_dataset")

dataset.add_from_huggingface(..., max_samples=200)  # 200 samples
dataset.add_from_huggingface(..., max_samples=400)  # +200 new → 400 total
dataset.add_from_huggingface(..., max_samples=300)  # no change → 400 total
dataset.add_from_huggingface(..., max_samples=500)  # +100 new → 500 total

Samples are never implicitly deleted. Use hv.Dataset.delete("name") for explicit removal.

Computing Embeddings

# High-dimensional embeddings (CLIP/ResNet)
dataset.compute_embeddings(model="clip", show_progress=True)

# 2D projections for visualization
dataset.compute_visualization()  # UMAP to Euclidean + Hyperbolic

Embeddings are stored per-sample. If a sample already has embeddings, it's skipped.

Listing & Deleting Datasets

# List all persistent datasets
hv.Dataset.list_datasets()  # ['cifar100_demo', 'my_dataset', ...]

# Delete a dataset
hv.Dataset.delete("my_dataset")

# Check existence
hv.Dataset.exists("my_dataset")  # True/False

Dataset Info

len(dataset)           # Number of samples
dataset.name           # Dataset name
dataset.labels         # Unique labels
dataset.samples        # Iterator over all samples
dataset[sample_id]     # Get sample by ID