STELLAR / README.md

initial commit

6794be9 4 days ago

8.31 kB

	---
	license: mit
	library_name: stellar
	pipeline_tag: image-feature-extraction
	datasets:
	- imagenet-1k
	tags:
	- vision
	- self-supervised-learning
	- representation-learning
	- sparse-tokens
	- vision-transformer
	- image-feature-extraction
	---

	# STELLAR — Sparse Visual Representations via Spatial–Semantic Factorization

	STELLAR learns a unified sparse visual representation that supports both
	reconstruction and semantics using as few as 16 tokens. By factorizing
	"what" (semantics) from "where" (spatial layout), each image is encoded as the
	low-rank product of a localization matrix and a semantics matrix.

	<p align="center">
	<img src="factorization.svg" alt="Spatial–semantic factorization" width="720">
	</p>

	- 📄 Paper: [arXiv:2602.01905](https://arxiv.org/abs/2602.01905) (ICML 2026)
	- 💻 Code: [github.com/microsoft/STELLAR](https://github.com/microsoft/STELLAR)

	These checkpoints contain the full set of trained STELLAR modules (encoder, sparse
	tokens, projections, reconstruction decoder, and clustering heads), so a single file
	supports feature extraction, image reconstruction, and continued pretraining. All
	models are self-supervised on ImageNet-1K at 224×224.

	## Highlights

	- Sparse & unified — one small set of tokens serves both high-level semantics
	and pixel-level reconstruction.
	- Factorized latents — each token captures a concept (what) together with a
	spatial map of where it appears.
	- Strong on both axes — STELLAR-H reaches 2.60 FID (reconstruction) and
	79.1% ImageNet linear-probing accuracy with just 16 tokens.

	## Available models

	\| Model \| Backbone \| Tokens \| Params \| Type \| File \|
	\| :--- \| :--- \| :---: \| :---: \| :--- \| :--- \|
	\| `stellar-b16` \| ViT-B/16 \| 16 \| 88M \| main \| [`stellar-b16.safetensors`](stellar-b16.safetensors) \|
	\| `stellar-l16` \| ViT-L/16 \| 16 \| 307M \| main \| [`stellar-l16.safetensors`](stellar-l16.safetensors) \|
	\| `stellar-h16` \| ViT-H/14 \| 16 \| 636M \| main \| [`stellar-h16.safetensors`](stellar-h16.safetensors) \|
	\| `stellar-b8` \| ViT-B/16 \| 8 \| 88M \| ablation \| [`stellar-b8.safetensors`](stellar-b8.safetensors) \|
	\| `stellar-b24` \| ViT-B/16 \| 24 \| 88M \| ablation \| [`stellar-b24.safetensors`](stellar-b24.safetensors) \|

	The main models (`b16`, `l16`, `h16`) are recommended for downstream use; the 8- and
	24-token base models are ablations on the number of sparse tokens.

	## Usage

	Install the STELLAR code and the Hub helpers:

	```bash
	pip install huggingface_hub safetensors
	git clone https://github.com/microsoft/STELLAR && cd STELLAR
	pip install -r requirements.txt
	```

	### Quick start

	From the STELLAR code directory, use the [`load_stellar.py`](load_stellar.py) helper
	(it downloads the weights from the Hub for you):

	```python
	import torch
	from load_stellar import load_stellar, list_models

	print(list_models()) # ['stellar-b16', 'stellar-l16', ...]
	model = load_stellar("stellar-b16") # purpose="encode" (default)

	# RGB image in [0, 1], resized to 224×224 (ImageNet normalization is applied internally)
	image = torch.rand(1, 3, 224, 224)
	with torch.no_grad():
	out = model.encode(image)

	out["sparse"] # (1, K, D) sparse concept tokens ("what")
	out["spatial"] # (1, P, K) per-token spatial maps ("where")
	out["dense"] # (1, P, D) dense per-patch features
	out["cls"] # (1, 1, D) global image token
	```

	### Reconstruction & continued pretraining

	The same checkpoint can be loaded for other purposes via the `purpose` argument. Image
	reconstruction and continued pretraining use the decoder, which predicts
	[MaskGIT-VQGAN](https://huggingface.co/fun-research/TiTok) tokens — pass the tokenizer
	path as `vq_model`:

	```python
	# 1. encode -> factorized features (sparse concept tokens + spatial maps)
	model = load_stellar("stellar-b16", purpose="reconstruct", vq_model=VQGAN_PATH)
	features = model.encode(image) # dict: sparse (B,K,D), spatial (B,P,K), ...

	# 2. decode the factorized features -> VQGAN decoder -> pixels
	out = model.reconstruct(features) # or model.reconstruct(features["sparse"], features["spatial"])
	pixels = out["reconstruction"] # (B, 3, H, W) RGB in [0, 1]
	# 224x224 for /16 models, 256x256 for the /14 H model
	# out["tokens"] : (B, P) predicted VQGAN token ids
	# out["logits"] : (B, P, 1024) raw codebook logits

	# continued pretraining (all modules, gradients enabled)
	model = load_stellar("stellar-b16", purpose="pretrain", vq_model=VQGAN_PATH)
	losses = model({"image": image, "labels": labels, ...})["predictions"]
	```

	`reconstruct` is the decoder half of STELLAR: it takes the factorized features and
	runs low-rank dense map → ViT decoder → VQGAN decoder to return RGB pixels. See
	[`examples/reconstruction.ipynb`](examples/reconstruction.ipynb) for an end-to-end demo
	that loads an image and displays the reconstruction.


	### What the model returns

	\| Key \| Shape \| Description \| Typical use \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `sparse` \| `(B, K, D)` \| sparse concept tokens \| classification, retrieval \|
	\| `spatial` \| `(B, P, K)` \| spatial map of each token \| segmentation, visualization \|
	\| `dense` \| `(B, P, D)` \| dense per-patch features \| segmentation \|
	\| `lowrank` \| `(B, P, D)` \| reassembled dense map \| reconstruction \|
	\| `cls` \| `(B, 1, D)` \| global representation \| classification \|

	`B` = batch, `K` = number of sparse tokens, `P` = number of patches (196 for /16 at
	224², 256 for /14), `D` = embedding dim (768 / 1024 / 1280 for B / L / H).

	### Loading the weights manually

	```python
	import json, torch
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from src.models.stellar_model import STELLARModel

	repo = "microsoft/STELLAR"
	cfg = json.load(open(hf_hub_download(repo, "config.json")))["models"]["stellar-b16"]
	state = load_file(hf_hub_download(repo, cfg["weights"]))

	model = STELLARModel(
	num_sparse_tokens=cfg["num_sparse_tokens"],
	num_decoder_layers=cfg["num_decoder_layers"],
	spatial_temp=cfg["spatial_temp"],
	vit_pretrained=cfg["backbone"],
	do_recon=False, do_clustering=False, vq_model=None,
	)
	model.load_state_dict(state, strict=False) # encoder-only build ignores decoder/heads
	model.eval()
	features = model.encode(torch.rand(1, 3, 224, 224))
	```

	> Tip: download with `huggingface_hub` (as above) rather than `git clone` so that
	> downloads are registered on the Hub — `git clone` is not counted in download stats.

	## Model details

	- Architecture: ViT encoder (MAE-initialized) + learned sparse latent queries with
	spatial–semantic factorization.
	- Pretraining data: ImageNet-1K (self-supervised; labels not used).
	- Input: RGB images in `[0, 1]`, resized to 224×224 (bicubic). ImageNet mean/std
	normalization is applied inside the model — pass raw `[0, 1]` images.
	- Weights: the complete set of trained STELLAR modules (encoder, sparse tokens,
	projections, reconstruction decoder, and clustering heads), stored in `safetensors`.
	Only the third-party MaskGIT-VQGAN tokenizer is excluded — it is downloaded separately
	(from [TiTok](https://huggingface.co/fun-research/TiTok)) and passed via `vq_model`.
	- Framework: PyTorch.

	## Intended uses & limitations

	- Intended use: extracting compact sparse/dense visual features for downstream
	recognition, segmentation, retrieval, reconstruction, and analysis.
	- Limitations: pretrained on ImageNet-1K at 224×224, so features reflect that
	distribution; performance on very different domains (e.g. medical, satellite) may
	require fine-tuning. The models are research artifacts and are not safety-tested for
	production decision-making.

	## Citation

	```bibtex
	@inproceedings{zhao2026stellar,
	title = {Learning Sparse Visual Representations via Spatial-Semantic Factorization},
	author = {Zhao, Theodore Zhengde and Kiblawi, Sid and Yang, Jianwei and Usuyama, Naoto and Tan, Reuben and Codella, Noel C and Naumann, Tristan and Poon, Hoifung and Wei, Mu},
	booktitle = {International Conference on Machine Learning (ICML)},
	year = {2026},
	url = {https://arxiv.org/abs/2602.01905},
	}
	```

	## License

	Released under the [MIT License](LICENSE).