Spaces:
Running
Running
| # Datasets | |
| ## Creating a Dataset | |
| ```python | |
| import hyperview as hv | |
| # Persistent dataset (default) - survives restarts | |
| dataset = hv.Dataset("my_dataset") | |
| # In-memory dataset - lost when process exits | |
| dataset = hv.Dataset("my_dataset", persist=False) | |
| ``` | |
| **Storage location:** `~/.hyperview/datasets/` (configurable via `HYPERVIEW_DATABASE_DIR`) | |
| Internally, each dataset is stored as two Lance tables (directories) inside that folder: | |
| - `hyperview_{dataset_name}.lance/` (samples) | |
| - `hyperview_{dataset_name}_meta.lance/` (metadata like label colors) | |
| ## Adding Samples | |
| ### From HuggingFace | |
| ```python | |
| dataset.add_from_huggingface( | |
| "uoft-cs/cifar100", | |
| split="train", | |
| image_key="img", | |
| label_key="fine_label", | |
| max_samples=1000, | |
| ) | |
| ``` | |
| ### From Directory | |
| ```python | |
| dataset.add_images_dir("/path/to/images", label_from_folder=True) | |
| ``` | |
| ## Persistence Model: Additive | |
| HyperView uses an **additive** persistence model: | |
| | Action | Behavior | | |
| |--------|----------| | |
| | Add samples | New samples inserted, existing skipped by ID | | |
| | Request fewer than exist | Existing samples preserved (no deletion) | | |
| | Request more than exist | Only new samples added | | |
| | Embeddings | Cached per-sample, reused across sessions | | |
| | Projections | Recomputed when new samples added (UMAP requires refit) | | |
| **Example:** | |
| ```python | |
| dataset = hv.Dataset("my_dataset") | |
| dataset.add_from_huggingface(..., max_samples=200) # 200 samples | |
| dataset.add_from_huggingface(..., max_samples=400) # +200 new → 400 total | |
| dataset.add_from_huggingface(..., max_samples=300) # no change → 400 total | |
| dataset.add_from_huggingface(..., max_samples=500) # +100 new → 500 total | |
| ``` | |
| Samples are **never implicitly deleted**. Use `hv.Dataset.delete("name")` for explicit removal. | |
| ## Computing Embeddings | |
| ```python | |
| # High-dimensional embeddings (CLIP/ResNet) | |
| dataset.compute_embeddings(model="clip", show_progress=True) | |
| # 2D projections for visualization | |
| dataset.compute_visualization() # UMAP to Euclidean + Hyperbolic | |
| ``` | |
| Embeddings are stored per-sample. If a sample already has embeddings, it's skipped. | |
| ## Listing & Deleting Datasets | |
| ```python | |
| # List all persistent datasets | |
| hv.Dataset.list_datasets() # ['cifar100_demo', 'my_dataset', ...] | |
| # Delete a dataset | |
| hv.Dataset.delete("my_dataset") | |
| # Check existence | |
| hv.Dataset.exists("my_dataset") # True/False | |
| ``` | |
| ## Dataset Info | |
| ```python | |
| len(dataset) # Number of samples | |
| dataset.name # Dataset name | |
| dataset.labels # Unique labels | |
| dataset.samples # Iterator over all samples | |
| dataset[sample_id] # Get sample by ID | |
| ``` | |