hwh-datascience's picture
Upload 4 files
cd063bc verified
metadata
license: apache-2.0
tags:
  - terravision
  - terramind
  - tokenizer
  - vqvae
  - fsq
  - divae
  - geospatial
  - remote-sensing
  - cir
library_name: terratorch

TerraVision Tokenizer — CIR

A DiVAE (Diffusion VQ-VAE) tokenizer for Color InfraRed aerial imagery (3-channel, uint8) trained on Dutch national geospatial data. Part of the TerraVision-NL project.

Architecture

Component Value
Encoder ViT-B (vit_b_enc)
Decoder Patched UNet (unet_patched)
Quantizer FSQ (codebook: 8-8-8-6-5, vocab: 15,360)
Image size 448×448 px
Patch size 16×16 px
Token grid 28×28 = 784 tokens per image
Input channels 3 (Near-Infrared, Red, Green)
Latent dim 5

Geospatial Properties

All TerraVision tokenizers produce the same spatial window in 448 pixels, regardless of the underlying raster resolution. This ensures token grids are spatially aligned across modalities for cross-modal pretraining.

  • Pixel size: 0.08 m
  • Source: Dutch national CIR photography (Beeldmateriaal HRL 2025)

Normalization

Input data should be normalized before encoding:

  • Scheme: standard (clip [0, 255], mean=127.5, std=127.5 → [-1, 1])

See config.json for exact normalization parameters.

Usage

import torch
from huggingface_hub import hf_hub_download
from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE

# Download weights and config
weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt")

# Instantiate model
tokenizer = DiVAE(
    image_size=448,
    patch_size=16,
    n_channels=3,
    enc_type="vit_b_enc",
    dec_type="unet_patched",
    quant_type="fsq",
    codebook_size="8-8-8-6-5",
    latent_dim=5,
    post_mlp=True,
    norm_codes=True,
)

# Load weights
state_dict = torch.load(weights_path, map_location="cpu")
tokenizer.load_state_dict(state_dict)
tokenizer.eval()

# Encode: image → tokens
x = torch.randn(1, 3, 448, 448)
quant, code_loss, tokens = tokenizer.encode(x)
print(tokens.shape)  # (1, 28, 28)

# Decode: tokens → reconstruction (diffusion sampling)
recon = tokenizer(x, timesteps=50)

Training

Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE) following the TerraMind paper methodology (Section 8.1).

  • Checkpoint: cir-best-epoch-0058.ckpt
  • Diffusion: 1000 timesteps, linear schedule, predicts sample