--- license: mit library_name: stellar pipeline_tag: image-feature-extraction datasets: - imagenet-1k tags: - vision - self-supervised-learning - representation-learning - sparse-tokens - vision-transformer - image-feature-extraction --- # STELLAR — Sparse Visual Representations via Spatial–Semantic Factorization **STELLAR** learns a **unified sparse visual representation** that supports both **reconstruction** and **semantics** using as few as **16 tokens**. By factorizing *"what"* (semantics) from *"where"* (spatial layout), each image is encoded as the low-rank product of a **localization** matrix and a **semantics** matrix.

Spatial–semantic factorization

- 📄 **Paper:** [arXiv:2602.01905](https://arxiv.org/abs/2602.01905) (ICML 2026) - 💻 **Code:** [github.com/microsoft/STELLAR](https://github.com/microsoft/STELLAR) These checkpoints contain the **full set of trained STELLAR modules** (encoder, sparse tokens, projections, reconstruction decoder, and clustering heads), so a single file supports feature extraction, image reconstruction, and continued pretraining. All models are self-supervised on **ImageNet-1K** at 224×224. ## Highlights - **Sparse & unified** — one small set of tokens serves both high-level semantics and pixel-level reconstruction. - **Factorized latents** — each token captures a concept (*what*) together with a spatial map of *where* it appears. - **Strong on both axes** — STELLAR-H reaches **2.60 FID** (reconstruction) and **79.1%** ImageNet linear-probing accuracy with just **16 tokens**. ## Available models | Model | Backbone | Tokens | Params | Type | File | | :--- | :--- | :---: | :---: | :--- | :--- | | `stellar-b16` | ViT-B/16 | 16 | 88M | main | [`stellar-b16.safetensors`](stellar-b16.safetensors) | | `stellar-l16` | ViT-L/16 | 16 | 307M | main | [`stellar-l16.safetensors`](stellar-l16.safetensors) | | `stellar-h16` | ViT-H/14 | 16 | 636M | main | [`stellar-h16.safetensors`](stellar-h16.safetensors) | | `stellar-b8` | ViT-B/16 | 8 | 88M | ablation | [`stellar-b8.safetensors`](stellar-b8.safetensors) | | `stellar-b24` | ViT-B/16 | 24 | 88M | ablation | [`stellar-b24.safetensors`](stellar-b24.safetensors) | The main models (`b16`, `l16`, `h16`) are recommended for downstream use; the 8- and 24-token base models are ablations on the number of sparse tokens. ## Usage Install the STELLAR code and the Hub helpers: ```bash pip install huggingface_hub safetensors git clone https://github.com/microsoft/STELLAR && cd STELLAR pip install -r requirements.txt ``` ### Quick start From the STELLAR code directory, use the [`load_stellar.py`](load_stellar.py) helper (it downloads the weights from the Hub for you): ```python import torch from load_stellar import load_stellar, list_models print(list_models()) # ['stellar-b16', 'stellar-l16', ...] model = load_stellar("stellar-b16") # purpose="encode" (default) # RGB image in [0, 1], resized to 224×224 (ImageNet normalization is applied internally) image = torch.rand(1, 3, 224, 224) with torch.no_grad(): out = model.encode(image) out["sparse"] # (1, K, D) sparse concept tokens ("what") out["spatial"] # (1, P, K) per-token spatial maps ("where") out["dense"] # (1, P, D) dense per-patch features out["cls"] # (1, 1, D) global image token ``` ### Reconstruction & continued pretraining The same checkpoint can be loaded for other purposes via the `purpose` argument. Image reconstruction and continued pretraining use the decoder, which predicts [MaskGIT-VQGAN](https://huggingface.co/fun-research/TiTok) tokens — pass the tokenizer path as `vq_model`: ```python # 1. encode -> factorized features (sparse concept tokens + spatial maps) model = load_stellar("stellar-b16", purpose="reconstruct", vq_model=VQGAN_PATH) features = model.encode(image) # dict: sparse (B,K,D), spatial (B,P,K), ... # 2. decode the factorized features -> VQGAN decoder -> pixels out = model.reconstruct(features) # or model.reconstruct(features["sparse"], features["spatial"]) pixels = out["reconstruction"] # (B, 3, H, W) RGB in [0, 1] # 224x224 for /16 models, 256x256 for the /14 H model # out["tokens"] : (B, P) predicted VQGAN token ids # out["logits"] : (B, P, 1024) raw codebook logits # continued pretraining (all modules, gradients enabled) model = load_stellar("stellar-b16", purpose="pretrain", vq_model=VQGAN_PATH) losses = model({"image": image, "labels": labels, ...})["predictions"] ``` `reconstruct` is the **decoder half** of STELLAR: it takes the factorized features and runs low-rank dense map → ViT decoder → VQGAN decoder to return RGB pixels. See [`examples/reconstruction.ipynb`](examples/reconstruction.ipynb) for an end-to-end demo that loads an image and displays the reconstruction. ### What the model returns | Key | Shape | Description | Typical use | | :--- | :--- | :--- | :--- | | `sparse` | `(B, K, D)` | sparse concept tokens | classification, retrieval | | `spatial` | `(B, P, K)` | spatial map of each token | segmentation, visualization | | `dense` | `(B, P, D)` | dense per-patch features | segmentation | | `lowrank` | `(B, P, D)` | reassembled dense map | reconstruction | | `cls` | `(B, 1, D)` | global representation | classification | `B` = batch, `K` = number of sparse tokens, `P` = number of patches (196 for /16 at 224², 256 for /14), `D` = embedding dim (768 / 1024 / 1280 for B / L / H). ### Loading the weights manually ```python import json, torch from huggingface_hub import hf_hub_download from safetensors.torch import load_file from src.models.stellar_model import STELLARModel repo = "microsoft/STELLAR" cfg = json.load(open(hf_hub_download(repo, "config.json")))["models"]["stellar-b16"] state = load_file(hf_hub_download(repo, cfg["weights"])) model = STELLARModel( num_sparse_tokens=cfg["num_sparse_tokens"], num_decoder_layers=cfg["num_decoder_layers"], spatial_temp=cfg["spatial_temp"], vit_pretrained=cfg["backbone"], do_recon=False, do_clustering=False, vq_model=None, ) model.load_state_dict(state, strict=False) # encoder-only build ignores decoder/heads model.eval() features = model.encode(torch.rand(1, 3, 224, 224)) ``` > **Tip:** download with `huggingface_hub` (as above) rather than `git clone` so that > downloads are registered on the Hub — `git clone` is not counted in download stats. ## Model details - **Architecture:** ViT encoder (MAE-initialized) + learned sparse latent queries with spatial–semantic factorization. - **Pretraining data:** ImageNet-1K (self-supervised; labels not used). - **Input:** RGB images in `[0, 1]`, resized to 224×224 (bicubic). ImageNet mean/std normalization is applied **inside** the model — pass raw `[0, 1]` images. - **Weights:** the complete set of trained STELLAR modules (encoder, sparse tokens, projections, reconstruction decoder, and clustering heads), stored in `safetensors`. Only the third-party MaskGIT-VQGAN tokenizer is excluded — it is downloaded separately (from [TiTok](https://huggingface.co/fun-research/TiTok)) and passed via `vq_model`. - **Framework:** PyTorch. ## Intended uses & limitations - **Intended use:** extracting compact sparse/dense visual features for downstream recognition, segmentation, retrieval, reconstruction, and analysis. - **Limitations:** pretrained on ImageNet-1K at 224×224, so features reflect that distribution; performance on very different domains (e.g. medical, satellite) may require fine-tuning. The models are research artifacts and are not safety-tested for production decision-making. ## Citation ```bibtex @inproceedings{zhao2026stellar, title = {Learning Sparse Visual Representations via Spatial-Semantic Factorization}, author = {Zhao, Theodore Zhengde and Kiblawi, Sid and Yang, Jianwei and Usuyama, Naoto and Tan, Reuben and Codella, Noel C and Naumann, Tristan and Poon, Hoifung and Wei, Mu}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026}, url = {https://arxiv.org/abs/2602.01905}, } ``` ## License Released under the [MIT License](LICENSE).