File size: 2,428 Bytes
29f9b9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: apache-2.0
tags:
- terravision
- terramind
- tokenizer
- vqvae
- fsq
- divae
- geospatial
- remote-sensing
- ahn
library_name: terratorch
---

# TerraVision Tokenizer — AHN

A DiVAE (Diffusion VQ-VAE) tokenizer for **Fused AHN6 DSM + DTM elevation (2-channel, float32)** trained on
Dutch national geospatial data. Part of the TerraVision-NL project.

## Architecture

| Component | Value |
|-----------|-------|
| Encoder | ViT-B (vit_b_enc) |
| Decoder | Patched UNet (unet_patched) |
| Quantizer | FSQ (codebook: `8-8-8-6-5`, vocab: 15,360) |
| Image size | 448×448 px |
| Patch size | 16×16 px |
| Token grid | 28×28 = 784 tokens per image |
| Input channels | 2 (Digital Surface Model, Digital Terrain Model) |
| Latent dim | 5 |

## Geospatial Properties

All TerraVision tokenizers produce the **same spatial window** in 448 pixels,
regardless of the underlying raster resolution. This ensures token grids are spatially aligned
across modalities for cross-modal pretraining.

- **Pixel size**: 0.08 m
- **Source**: Actueel Hoogtebestand Nederland 6 (AHN6) at 7.5 cm resolution

## Normalization

Input data should be normalized before encoding:
- **Scheme**: minmax (clip [-20, 80] → [0, 1])

See `config.json` for exact normalization parameters.

## Usage

```python
import torch
from huggingface_hub import hf_hub_download
from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE

# Download weights and config
weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt")

# Instantiate model
tokenizer = DiVAE(
    image_size=448,
    patch_size=16,
    n_channels=2,
    enc_type="vit_b_enc",
    dec_type="unet_patched",
    quant_type="fsq",
    codebook_size="8-8-8-6-5",
    latent_dim=5,
    post_mlp=True,
    norm_codes=True,
)

# Load weights
state_dict = torch.load(weights_path, map_location="cpu")
tokenizer.load_state_dict(state_dict)
tokenizer.eval()

# Encode: image → tokens
x = torch.randn(1, 2, 448, 448)
quant, code_loss, tokens = tokenizer.encode(x)
print(tokens.shape)  # (1, 28, 28)

# Decode: tokens → reconstruction (diffusion sampling)
recon = tokenizer(x, timesteps=50)
```

## Training

Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE)
following the TerraMind paper methodology (Section 8.1).

- **Checkpoint**: `ahn-best-epoch-0002.ckpt`
- **Diffusion**: 1000 timesteps, linear schedule, predicts sample