File size: 8,305 Bytes

6794be9

---
license: mit
library_name: stellar
pipeline_tag: image-feature-extraction
datasets:
  - imagenet-1k
tags:
  - vision
  - self-supervised-learning
  - representation-learning
  - sparse-tokens
  - vision-transformer
  - image-feature-extraction
---

# STELLAR — Sparse Visual Representations via Spatial–Semantic Factorization

**STELLAR** learns a **unified sparse visual representation** that supports both
**reconstruction** and **semantics** using as few as **16 tokens**. By factorizing
*"what"* (semantics) from *"where"* (spatial layout), each image is encoded as the
low-rank product of a **localization** matrix and a **semantics** matrix.

<p align="center">
  <img src="factorization.svg" alt="Spatial–semantic factorization" width="720">
</p>

- 📄 **Paper:** [arXiv:2602.01905](https://arxiv.org/abs/2602.01905) (ICML 2026)
- 💻 **Code:** [github.com/microsoft/STELLAR](https://github.com/microsoft/STELLAR)

These checkpoints contain the **full set of trained STELLAR modules** (encoder, sparse
tokens, projections, reconstruction decoder, and clustering heads), so a single file
supports feature extraction, image reconstruction, and continued pretraining. All
models are self-supervised on **ImageNet-1K** at 224×224.

## Highlights

- **Sparse & unified** — one small set of tokens serves both high-level semantics
  and pixel-level reconstruction.
- **Factorized latents** — each token captures a concept (*what*) together with a
  spatial map of *where* it appears.
- **Strong on both axes** — STELLAR-H reaches **2.60 FID** (reconstruction) and
  **79.1%** ImageNet linear-probing accuracy with just **16 tokens**.

## Available models

| Model | Backbone | Tokens | Params | Type | File |
| :--- | :--- | :---: | :---: | :--- | :--- |
| `stellar-b16` | ViT-B/16 | 16 | 88M | main | [`stellar-b16.safetensors`](stellar-b16.safetensors) |
| `stellar-l16` | ViT-L/16 | 16 | 307M | main | [`stellar-l16.safetensors`](stellar-l16.safetensors) |
| `stellar-h16` | ViT-H/14 | 16 | 636M | main | [`stellar-h16.safetensors`](stellar-h16.safetensors) |
| `stellar-b8`  | ViT-B/16 | 8  | 88M | ablation | [`stellar-b8.safetensors`](stellar-b8.safetensors) |
| `stellar-b24` | ViT-B/16 | 24 | 88M | ablation | [`stellar-b24.safetensors`](stellar-b24.safetensors) |

The main models (`b16`, `l16`, `h16`) are recommended for downstream use; the 8- and
24-token base models are ablations on the number of sparse tokens.

## Usage

Install the STELLAR code and the Hub helpers:

```bash
pip install huggingface_hub safetensors
git clone https://github.com/microsoft/STELLAR && cd STELLAR
pip install -r requirements.txt
```

### Quick start

From the STELLAR code directory, use the [`load_stellar.py`](load_stellar.py) helper
(it downloads the weights from the Hub for you):

```python
import torch
from load_stellar import load_stellar, list_models

print(list_models())                   # ['stellar-b16', 'stellar-l16', ...]
model = load_stellar("stellar-b16")     # purpose="encode" (default)

# RGB image in [0, 1], resized to 224×224 (ImageNet normalization is applied internally)
image = torch.rand(1, 3, 224, 224)
with torch.no_grad():
    out = model.encode(image)

out["sparse"]    # (1, K, D)   sparse concept tokens   ("what")
out["spatial"]   # (1, P, K)   per-token spatial maps  ("where")
out["dense"]     # (1, P, D)   dense per-patch features
out["cls"]       # (1, 1, D)   global image token
```

### Reconstruction & continued pretraining

The same checkpoint can be loaded for other purposes via the `purpose` argument. Image
reconstruction and continued pretraining use the decoder, which predicts
[MaskGIT-VQGAN](https://huggingface.co/fun-research/TiTok) tokens — pass the tokenizer
path as `vq_model`:

```python
# 1. encode -> factorized features (sparse concept tokens + spatial maps)
model = load_stellar("stellar-b16", purpose="reconstruct", vq_model=VQGAN_PATH)
features = model.encode(image)          # dict: sparse (B,K,D), spatial (B,P,K), ...

# 2. decode the factorized features -> VQGAN decoder -> pixels
out = model.reconstruct(features)       # or model.reconstruct(features["sparse"], features["spatial"])
pixels = out["reconstruction"]          # (B, 3, H, W) RGB in [0, 1]
#                                         224x224 for /16 models, 256x256 for the /14 H model
#   out["tokens"] : (B, P) predicted VQGAN token ids
#   out["logits"] : (B, P, 1024) raw codebook logits

# continued pretraining (all modules, gradients enabled)
model = load_stellar("stellar-b16", purpose="pretrain", vq_model=VQGAN_PATH)
losses = model({"image": image, "labels": labels, ...})["predictions"]
```

`reconstruct` is the **decoder half** of STELLAR: it takes the factorized features and
runs low-rank dense map → ViT decoder → VQGAN decoder to return RGB pixels. See
[`examples/reconstruction.ipynb`](examples/reconstruction.ipynb) for an end-to-end demo
that loads an image and displays the reconstruction.


### What the model returns

| Key | Shape | Description | Typical use |
| :--- | :--- | :--- | :--- |
| `sparse` | `(B, K, D)` | sparse concept tokens | classification, retrieval |
| `spatial` | `(B, P, K)` | spatial map of each token | segmentation, visualization |
| `dense` | `(B, P, D)` | dense per-patch features | segmentation |
| `lowrank` | `(B, P, D)` | reassembled dense map | reconstruction |
| `cls` | `(B, 1, D)` | global representation | classification |

`B` = batch, `K` = number of sparse tokens, `P` = number of patches (196 for /16 at
224², 256 for /14), `D` = embedding dim (768 / 1024 / 1280 for B / L / H).

### Loading the weights manually

```python
import json, torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from src.models.stellar_model import STELLARModel

repo = "microsoft/STELLAR"
cfg = json.load(open(hf_hub_download(repo, "config.json")))["models"]["stellar-b16"]
state = load_file(hf_hub_download(repo, cfg["weights"]))

model = STELLARModel(
    num_sparse_tokens=cfg["num_sparse_tokens"],
    num_decoder_layers=cfg["num_decoder_layers"],
    spatial_temp=cfg["spatial_temp"],
    vit_pretrained=cfg["backbone"],
    do_recon=False, do_clustering=False, vq_model=None,
)
model.load_state_dict(state, strict=False)   # encoder-only build ignores decoder/heads
model.eval()
features = model.encode(torch.rand(1, 3, 224, 224))
```

> **Tip:** download with `huggingface_hub` (as above) rather than `git clone` so that
> downloads are registered on the Hub — `git clone` is not counted in download stats.

## Model details

- **Architecture:** ViT encoder (MAE-initialized) + learned sparse latent queries with
  spatial–semantic factorization.
- **Pretraining data:** ImageNet-1K (self-supervised; labels not used).
- **Input:** RGB images in `[0, 1]`, resized to 224×224 (bicubic). ImageNet mean/std
  normalization is applied **inside** the model — pass raw `[0, 1]` images.
- **Weights:** the complete set of trained STELLAR modules (encoder, sparse tokens,
  projections, reconstruction decoder, and clustering heads), stored in `safetensors`.
  Only the third-party MaskGIT-VQGAN tokenizer is excluded — it is downloaded separately
  (from [TiTok](https://huggingface.co/fun-research/TiTok)) and passed via `vq_model`.
- **Framework:** PyTorch.

## Intended uses & limitations

- **Intended use:** extracting compact sparse/dense visual features for downstream
  recognition, segmentation, retrieval, reconstruction, and analysis.
- **Limitations:** pretrained on ImageNet-1K at 224×224, so features reflect that
  distribution; performance on very different domains (e.g. medical, satellite) may
  require fine-tuning. The models are research artifacts and are not safety-tested for
  production decision-making.

## Citation

```bibtex
@inproceedings{zhao2026stellar,
  title     = {Learning Sparse Visual Representations via Spatial-Semantic Factorization},
  author    = {Zhao, Theodore Zhengde and Kiblawi, Sid and Yang, Jianwei and Usuyama, Naoto and Tan, Reuben and Codella, Noel C and Naumann, Tristan and Poon, Hoifung and Wei, Mu},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.01905},
}
```

## License

Released under the [MIT License](LICENSE).