hwh-datascience's picture
Upload 4 files
29f9b9d verified
---
license: apache-2.0
tags:
- terravision
- terramind
- tokenizer
- vqvae
- fsq
- divae
- geospatial
- remote-sensing
- ahn
library_name: terratorch
---
# TerraVision Tokenizer — AHN
A DiVAE (Diffusion VQ-VAE) tokenizer for **Fused AHN6 DSM + DTM elevation (2-channel, float32)** trained on
Dutch national geospatial data. Part of the TerraVision-NL project.
## Architecture
| Component | Value |
|-----------|-------|
| Encoder | ViT-B (vit_b_enc) |
| Decoder | Patched UNet (unet_patched) |
| Quantizer | FSQ (codebook: `8-8-8-6-5`, vocab: 15,360) |
| Image size | 448×448 px |
| Patch size | 16×16 px |
| Token grid | 28×28 = 784 tokens per image |
| Input channels | 2 (Digital Surface Model, Digital Terrain Model) |
| Latent dim | 5 |
## Geospatial Properties
All TerraVision tokenizers produce the **same spatial window** in 448 pixels,
regardless of the underlying raster resolution. This ensures token grids are spatially aligned
across modalities for cross-modal pretraining.
- **Pixel size**: 0.08 m
- **Source**: Actueel Hoogtebestand Nederland 6 (AHN6) at 7.5 cm resolution
## Normalization
Input data should be normalized before encoding:
- **Scheme**: minmax (clip [-20, 80] → [0, 1])
See `config.json` for exact normalization parameters.
## Usage
```python
import torch
from huggingface_hub import hf_hub_download
from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE
# Download weights and config
weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt")
# Instantiate model
tokenizer = DiVAE(
image_size=448,
patch_size=16,
n_channels=2,
enc_type="vit_b_enc",
dec_type="unet_patched",
quant_type="fsq",
codebook_size="8-8-8-6-5",
latent_dim=5,
post_mlp=True,
norm_codes=True,
)
# Load weights
state_dict = torch.load(weights_path, map_location="cpu")
tokenizer.load_state_dict(state_dict)
tokenizer.eval()
# Encode: image → tokens
x = torch.randn(1, 2, 448, 448)
quant, code_loss, tokens = tokenizer.encode(x)
print(tokens.shape) # (1, 28, 28)
# Decode: tokens → reconstruction (diffusion sampling)
recon = tokenizer(x, timesteps=50)
```
## Training
Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE)
following the TerraMind paper methodology (Section 8.1).
- **Checkpoint**: `ahn-best-epoch-0002.ckpt`
- **Diffusion**: 1000 timesteps, linear schedule, predicts sample