hwh-datascience
/

divae-tokenizer-ahn-448

Model card Files Files and versions

divae-tokenizer-ahn-448 / README.md

hwh-datascience's picture

hwh-datascience

Upload 4 files

29f9b9d verified 8 days ago

|

history blame contribute delete

2.43 kB

	---
	license: apache-2.0
	tags:
	- terravision
	- terramind
	- tokenizer
	- vqvae
	- fsq
	- divae
	- geospatial
	- remote-sensing
	- ahn
	library_name: terratorch
	---

	# TerraVision Tokenizer — AHN

	A DiVAE (Diffusion VQ-VAE) tokenizer for Fused AHN6 DSM + DTM elevation (2-channel, float32) trained on
	Dutch national geospatial data. Part of the TerraVision-NL project.

	## Architecture

	\| Component \| Value \|
	\|-----------\|-------\|
	\| Encoder \| ViT-B (vit_b_enc) \|
	\| Decoder \| Patched UNet (unet_patched) \|
	\| Quantizer \| FSQ (codebook: `8-8-8-6-5`, vocab: 15,360) \|
	\| Image size \| 448×448 px \|
	\| Patch size \| 16×16 px \|
	\| Token grid \| 28×28 = 784 tokens per image \|
	\| Input channels \| 2 (Digital Surface Model, Digital Terrain Model) \|
	\| Latent dim \| 5 \|

	## Geospatial Properties

	All TerraVision tokenizers produce the same spatial window in 448 pixels,
	regardless of the underlying raster resolution. This ensures token grids are spatially aligned
	across modalities for cross-modal pretraining.

	- Pixel size: 0.08 m
	- Source: Actueel Hoogtebestand Nederland 6 (AHN6) at 7.5 cm resolution

	## Normalization

	Input data should be normalized before encoding:
	- Scheme: minmax (clip [-20, 80] → [0, 1])

	See `config.json` for exact normalization parameters.

	## Usage

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from terratorch.models.backbones.terramind.tokenizer.vqvae import DiVAE

	# Download weights and config
	weights_path = hf_hub_download(repo_id="YOUR_REPO_ID", filename="tokenizer.pt")

	# Instantiate model
	tokenizer = DiVAE(
	image_size=448,
	patch_size=16,
	n_channels=2,
	enc_type="vit_b_enc",
	dec_type="unet_patched",
	quant_type="fsq",
	codebook_size="8-8-8-6-5",
	latent_dim=5,
	post_mlp=True,
	norm_codes=True,
	)

	# Load weights
	state_dict = torch.load(weights_path, map_location="cpu")
	tokenizer.load_state_dict(state_dict)
	tokenizer.eval()

	# Encode: image → tokens
	x = torch.randn(1, 2, 448, 448)
	quant, code_loss, tokens = tokenizer.encode(x)
	print(tokens.shape) # (1, 28, 28)

	# Decode: tokens → reconstruction (diffusion sampling)
	recon = tokenizer(x, timesteps=50)
	```

	## Training

	Trained with the TerraVision-NL codebase using DiVAE (diffusion-based VQ-VAE)
	following the TerraMind paper methodology (Section 8.1).

	- Checkpoint: `ahn-best-epoch-0002.ckpt`
	- Diffusion: 1000 timesteps, linear schedule, predicts sample