Upload 4 files

29f9b9d verified 6 days ago

2.43 kB

license: apache-2.0
tags:
  - terravision
  - terramind
  - tokenizer
  - vqvae
  - fsq
  - divae
  - geospatial
  - remote-sensing
  - ahn
library_name: terratorch

TerraVision Tokenizer — AHN

A DiVAE (Diffusion VQ-VAE) tokenizer for Fused AHN6 DSM + DTM elevation (2-channel, float32) trained on Dutch national geospatial data. Part of the TerraVision-NL project.

Architecture

Component	Value
Encoder	ViT-B (vit_b_enc)
Decoder	Patched UNet (unet_patched)
Quantizer	FSQ (codebook: `8-8-8-6-5`, vocab: 15,360)
Image size	448×448 px
Patch size	16×16 px
Token grid	28×28 = 784 tokens per image
Input channels	2 (Digital Surface Model, Digital Terrain Model)
Latent dim	5

Geospatial Properties

All TerraVision tokenizers produce the same spatial window in 448 pixels, regardless of the underlying raster resolution. This ensures token grids are spatially aligned across modalities for cross-modal pretraining.

Pixel size: 0.08 m
Source: Actueel Hoogtebestand Nederland 6 (AHN6) at 7.5 cm resolution

Normalization

Input data should be normalized before encoding:

Scheme: minmax (clip [-20, 80] → [0, 1])

See config.json for exact normalization parameters.

Usage

import torch
from huggingface_hub import hf_hub_download
from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE

# Download weights and config
weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt")

# Instantiate model
tokenizer = DiVAE(
    image_size=448,
    patch_size=16,
    n_channels=2,
    enc_type="vit_b_enc",
    dec_type="unet_patched",
    quant_type="fsq",
    codebook_size="8-8-8-6-5",
    latent_dim=5,
    post_mlp=True,
    norm_codes=True,
)

# Load weights
state_dict = torch.load(weights_path, map_location="cpu")
tokenizer.load_state_dict(state_dict)
tokenizer.eval()

# Encode: image → tokens
x = torch.randn(1, 2, 448, 448)
quant, code_loss, tokens = tokenizer.encode(x)
print(tokens.shape)  # (1, 28, 28)

# Decode: tokens → reconstruction (diffusion sampling)
recon = tokenizer(x, timesteps=50)

Training

Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE) following the TerraMind paper methodology (Section 8.1).

Checkpoint: ahn-best-epoch-0002.ckpt
Diffusion: 1000 timesteps, linear schedule, predicts sample