| --- |
| license: apache-2.0 |
| tags: |
| - terravision |
| - terramind |
| - tokenizer |
| - vqvae |
| - fsq |
| - divae |
| - geospatial |
| - remote-sensing |
| - ahn |
| library_name: terratorch |
| --- |
| |
| # TerraVision Tokenizer — AHN |
|
|
| A DiVAE (Diffusion VQ-VAE) tokenizer for **Fused AHN6 DSM + DTM elevation (2-channel, float32)** trained on |
| Dutch national geospatial data. Part of the TerraVision-NL project. |
|
|
| ## Architecture |
|
|
| | Component | Value | |
| |-----------|-------| |
| | Encoder | ViT-B (vit_b_enc) | |
| | Decoder | Patched UNet (unet_patched) | |
| | Quantizer | FSQ (codebook: `8-8-8-6-5`, vocab: 15,360) | |
| | Image size | 448×448 px | |
| | Patch size | 16×16 px | |
| | Token grid | 28×28 = 784 tokens per image | |
| | Input channels | 2 (Digital Surface Model, Digital Terrain Model) | |
| | Latent dim | 5 | |
| |
| ## Geospatial Properties |
| |
| All TerraVision tokenizers produce the **same spatial window** in 448 pixels, |
| regardless of the underlying raster resolution. This ensures token grids are spatially aligned |
| across modalities for cross-modal pretraining. |
| |
| - **Pixel size**: 0.08 m |
| - **Source**: Actueel Hoogtebestand Nederland 6 (AHN6) at 7.5 cm resolution |
| |
| ## Normalization |
| |
| Input data should be normalized before encoding: |
| - **Scheme**: minmax (clip [-20, 80] → [0, 1]) |
| |
| See `config.json` for exact normalization parameters. |
| |
| ## Usage |
| |
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE |
|
|
| # Download weights and config |
| weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt") |
|
|
| # Instantiate model |
| tokenizer = DiVAE( |
| image_size=448, |
| patch_size=16, |
| n_channels=2, |
| enc_type="vit_b_enc", |
| dec_type="unet_patched", |
| quant_type="fsq", |
| codebook_size="8-8-8-6-5", |
| latent_dim=5, |
| post_mlp=True, |
| norm_codes=True, |
| ) |
| |
| # Load weights |
| state_dict = torch.load(weights_path, map_location="cpu") |
| tokenizer.load_state_dict(state_dict) |
| tokenizer.eval() |
|
|
| # Encode: image → tokens |
| x = torch.randn(1, 2, 448, 448) |
| quant, code_loss, tokens = tokenizer.encode(x) |
| print(tokens.shape) # (1, 28, 28) |
| |
| # Decode: tokens → reconstruction (diffusion sampling) |
| recon = tokenizer(x, timesteps=50) |
| ``` |
| |
| ## Training |
| |
| Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE) |
| following the TerraMind paper methodology (Section 8.1). |
| |
| - **Checkpoint**: `ahn-best-epoch-0002.ckpt` |
| - **Diffusion**: 1000 timesteps, linear schedule, predicts sample |
| |