You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIOne-GeoSeg

AIOne-GeoSeg-330M

A 330M-parameter Vision Transformer with our TCAM decoder for Korean land-cover semantic segmentation on aerial and satellite imagery.

πŸ“„ Paper (coming soon)


Model Description

AIOne-GeoSeg-330M is a semantic segmentation model that classifies every pixel of a Korean aerial / satellite ortho-image into one of 11 land-cover categories (background, building, parking lot, road, street tree, paddy, greenhouse, dry field, forest, bare land, farmland).

The model pairs a DINOv3 ViT-L/16 backbone (sat-493M pretrain) with TCAM (Terrain-Context Adaptive Mutual-attention) Head β€” a segmentation decoder we designed in-house for Korean remote-sensing imagery. TCAM infers an implicit terrain context (mountain / plain / urban) from RGB alone, picks the right backbone scale per pixel, and runs a bidirectional cross-attention between learnable class prototypes and spatial features so that visually similar classes such as paddy / dry field / farmland or street tree / forest can be separated reliably.

  • Custom TCAM head. 6 in-house submodules: FeatureBridge, TCE, SAA, CFMA Γ—3, BoundaryGuidedUpsampler, Classifier. +8.4 % relative mIoU over DPT on the same backbone and data.
  • DINOv3-ViT-L/16 backbone. 24 layers, 1024 hidden, 4 register tokens, satellite-pretrained (sat-493M), unfrozen during training.
  • Single-shot, full-resolution mask. 512Γ—512 RGB input β†’ 512Γ—512 per-class logits, no sliding window required.
  • HF Transformers compatible. Loads via AutoModel with trust_remote_code=True; ships an AIOne_GeoSegImageProcessor.

Key Capabilities

  • Pixel-wise classification of Korean ortho-imagery into 11 land-cover classes.
  • Robust separation of similar classes (paddy / dry field / farmland, street tree / forest) via class-prototype mutual attention.
  • Sharp object and parcel boundaries via boundary-guided upsampling β€” useful for buildings, roads, and field edges.
  • Built-in Korean label set and per-class RGB palette for direct visualization.
  • One-call loading through transformers.AutoModel / AutoImageProcessor.

Classes

ID Korean English RGB
0 λ°°κ²½ Background (0, 0, 0)
1 건물 Building (184, 131, 237)
2 μ£Όμ°¨μž₯ Parking lot (16, 64, 178)
3 λ„λ‘œ Road (42, 65, 247)
4 κ°€λ‘œμˆ˜ Street tree (200, 229, 155)
5 λ…Ό Paddy (191, 255, 255)
6 λΉ„λ‹ν•˜μš°μŠ€ Greenhouse (220, 240, 255)
7 λ°­ Dry field (102, 249, 247)
8 μ‚°λ¦Ό Forest (45, 75, 42)
9 λ‚˜μ§€ Bare land (255, 242, 159)
10 농경지 Farmland (210, 180, 140)

TCAM Head

TCAM = Terrain-Context Adaptive Mutual-attention. It takes multi-layer hidden states from DINOv3-ViT-L/16 at stages {5, 11, 17, 23} and decodes them through 6 submodules:

# Submodule Role Core idea
1 FeatureBridge Γ—4 Reshape ViT hidden states (B, N, C) to 2D maps (B, D, H, W) CLS readout is projected and concatenated to patch tokens, then 1Γ—1 conv compresses channels
2 TCE (Terrain Context Estimator) Infer an implicit terrain context (mountain / plain / urban) from the deepest feature; output FiLM parameters (Ξ³, Ξ²) GAP β†’ MLP. No DEM required β€” Ξ³ is initialized with a +1.0 residual for identity start
3 SAA (Scale-Adaptive Aggregation) Per-pixel softmax fusion of the 4 scales FiLM-modulated by (Ξ³, Ξ²) so that forest pixels lean on deep layers and building / road pixels lean on shallow layers automatically
4 CFMA Γ—3 (Class–Feature Mutual Attention) Bidirectional cross-attention between learnable class prototypes and the fused spatial feature (1) Class β†’ Feature: each class embedding queries the spatial feature; (2) Feature β†’ Class: spatial features re-query the refreshed prototypes. Major gain on confusable classes (paddy / dry field / farmland, street tree / forest)
5 BoundaryGuidedUpsampler Extract a boundary attention map (sigmoid) from a shallow feature and run 4-stage ConvTranspose for 16Γ— upsample Recovers sharp parcel and mountain-ridge boundaries
6 Classifier 3Γ—3 conv β†’ 1Γ—1 conv Pixel-wise logits

Pipeline: FeatureBridge Γ—4 β†’ TCE β†’ SAA β†’ CFMA Γ—3 β†’ BoundaryGuidedUpsampler β†’ Classifier

Parameter Breakdown

Module Params
FeatureBridge Γ— 4 9.98 M
TCE 0.26 M
SAA 0.59 M
CFMA Γ— 3 7.11 M
BoundaryGuidedUpsampler 8.10 M
Classifier 1.33 M
TCAM Head total 27.37 M
Backbone (DINOv3-ViT-L/16) 303.13 M
Total 330.50 M

vs DPT head: +3.24 M (+13.4 % head, +0.99 % total).

TCAM Hyperparameters

Field Value
tcam_hidden_size (D) 384
tcam_num_heads (CFMA) 12
tcam_cfma_layers 3
tcam_tce_hidden_size 192
tcam_boundary_channels 64
readout_type project

Training

Item Value
Dataset Korean aerial photography + AI Hub land-cover
Classes 11 (background + 10 land-cover)
Input 512 Γ— 512, 3-channel RGB
Backbone DINOv3-ViT-L/16 (sat-493M pretrain), unfrozen
Selected layers {5, 11, 17, 23}
Patch size 16 (32 Γ— 32 token grid)
Loss Focal (Ξ³ = 2.5) + Dice
Optimizer AdamW, lr 1.3e-4, wd 0.01
Scheduler CosineAnnealing, Ξ·_min = 1e-6
Batch size 32
Mixed precision bf16
Sampler Weighted (class-balanced)
Epochs 40

Results

TCAM vs DPT (same backbone, same data, 40 epochs)

Head Params Best Val mIoU Best Epoch
DPT 24.13 M 0.6505 39
TCAM (ours) 27.37 M 0.7054 40
Ξ” +3.24 M +0.0549 (+8.4 % rel.) β€”

Training Dynamics

  • Epochs 1–10 (warm-up). mIoU oscillates between 0.53–0.66; train loss falls quickly from 0.486 to 0.361.
  • Epochs 11–35 (stabilization). mIoU climbs monotonically 0.64 β†’ 0.697; focal + dice combination drives minor-class learning.
  • Epochs 36–40 (fine convergence). With LR in the 3e-5 β†’ 1.6e-5 range, mIoU plateaus at 0.694 β†’ 0.705.

Quick Start

Installation

pip install "transformers>=4.45" torch pillow

Inference

import torch
import numpy as np
from PIL import Image
from transformers import AutoModel, AutoImageProcessor

MODEL_ID = "JDONE-Research/AIOne-GeoSeg-330M"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device).eval()
processor = AutoImageProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

image = Image.open("aerial.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    logits = model(pixel_values=inputs["pixel_values"]).logits  # (1, 11, 512, 512)

mask = logits.argmax(dim=1)[0].cpu().numpy()  # (512, 512) int class IDs

id2label = model.config.id2label
print("Detected classes:", sorted({id2label[str(i)] for i in np.unique(mask)}))

Colorized mask

palette = np.array(model.config.label_colors, dtype=np.uint8)  # (11, 3)
color_mask = palette[mask]                                     # (512, 512, 3)
Image.fromarray(color_mask).save("mask.png")

# Side-by-side overlay (50% blend)
resized = image.resize((512, 512))
overlay = (np.array(resized) * 0.5 + color_mask * 0.5).astype(np.uint8)
Image.fromarray(overlay).save("overlay.png")

Batch inference

images = [Image.open(p).convert("RGB") for p in paths]
inputs = processor(images=images, return_tensors="pt").to(device)
with torch.no_grad():
    masks = model(pixel_values=inputs["pixel_values"]).logits.argmax(dim=1).cpu().numpy()
# masks: (B, 512, 512)

Model Specs

Field Value
Architecture AIOne_GeoSeg
Backbone DINOv3 ViT-L/16, sat-493M pretrain (24 layers, 1024 hidden, 16 heads, 4 register tokens)
Segmentation head TCAM (custom) β€” FeatureBridge Γ—4, TCE, SAA, CFMA Γ—3, BoundaryGuidedUpsampler, Classifier
Total parameters 330.50 M (Backbone 303.13 M + TCAM Head 27.37 M)
Weights on disk 1.3 GB (FP32)
Input RGB image, 512Γ—512
Output Per-pixel logits, shape (B, 11, 512, 512)
Backbone feature taps stages [5, 11, 17, 23]
Number of classes 11 (Korean land-cover)
Validation mIoU 0.7054
Training precision bf16
Released checkpoint precision float32
Domain Korean aerial / satellite ortho-imagery

Intended Use

  • Korean land-cover mapping from aerial or satellite ortho-imagery.
  • Change-detection pipelines (run the model on two epochs and diff the masks).
  • Urban-planning, agriculture, and forestry analytics that need per-pixel Korean class labels.
  • Research baseline for comparing other segmentation heads against TCAM.

Out-of-Scope Use

  • Non-commercial only. This release is governed by CC-BY-NC-4.0; do not use it in revenue-generating products or services.
  • Imagery from regions or sensors that differ substantially from the Korean ortho-imagery training distribution (expect degraded accuracy).
  • Sole-source decision-making in legal, regulatory, or safety-critical contexts.
  • Any analysis that infringes on personal privacy, property rights, or applicable geospatial-data regulations.

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0) license.

  • Free to use, share, and adapt for non-commercial purposes with attribution.
  • Not permitted for commercial use. Contact the authors for a commercial license.
  • Provided "as is" without warranties of any kind.

Citation

If you use AIOne-GeoSeg in your research, please cite:

@misc{aione_geoseg_330m,
  title        = {AIOne-GeoSeg-330M: A DINOv3 Vision Transformer with TCAM Head for Korean Land-Cover Segmentation},
  author       = {JDONE Research},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-GeoSeg-330M}}
}

A paper describing the TCAM head, training procedure, and full ablations will be released soon β€” citation details will be updated here when available.

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support