You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AIOne-GeoSeg-330M

A 330M-parameter Vision Transformer with our TCAM decoder for Korean land-cover semantic segmentation on aerial and satellite imagery.

📄 Paper (coming soon)

Model Description

AIOne-GeoSeg-330M is a semantic segmentation model that classifies every pixel of a Korean aerial / satellite ortho-image into one of 11 land-cover categories (background, building, parking lot, road, street tree, paddy, greenhouse, dry field, forest, bare land, farmland).

The model pairs a DINOv3 ViT-L/16 backbone (sat-493M pretrain) with TCAM (Terrain-Context Adaptive Mutual-attention) Head — a segmentation decoder we designed in-house for Korean remote-sensing imagery. TCAM infers an implicit terrain context (mountain / plain / urban) from RGB alone, picks the right backbone scale per pixel, and runs a bidirectional cross-attention between learnable class prototypes and spatial features so that visually similar classes such as paddy / dry field / farmland or street tree / forest can be separated reliably.

Custom TCAM head. 6 in-house submodules: FeatureBridge, TCE, SAA, CFMA ×3, BoundaryGuidedUpsampler, Classifier. +8.4 % relative mIoU over DPT on the same backbone and data.
DINOv3-ViT-L/16 backbone. 24 layers, 1024 hidden, 4 register tokens, satellite-pretrained (sat-493M), unfrozen during training.
Single-shot, full-resolution mask. 512×512 RGB input → 512×512 per-class logits, no sliding window required.
HF Transformers compatible. Loads via AutoModel with trust_remote_code=True; ships an AIOne_GeoSegImageProcessor.

Key Capabilities

Pixel-wise classification of Korean ortho-imagery into 11 land-cover classes.
Robust separation of similar classes (paddy / dry field / farmland, street tree / forest) via class-prototype mutual attention.
Sharp object and parcel boundaries via boundary-guided upsampling — useful for buildings, roads, and field edges.
Built-in Korean label set and per-class RGB palette for direct visualization.
One-call loading through transformers.AutoModel / AutoImageProcessor.

Classes

ID	Korean	English	RGB
0	배경	Background	(0, 0, 0)
1	건물	Building	(184, 131, 237)
2	주차장	Parking lot	(16, 64, 178)
3	도로	Road	(42, 65, 247)
4	가로수	Street tree	(200, 229, 155)
5	논	Paddy	(191, 255, 255)
6	비닐하우스	Greenhouse	(220, 240, 255)
7	밭	Dry field	(102, 249, 247)
8	산림	Forest	(45, 75, 42)
9	나지	Bare land	(255, 242, 159)
10	농경지	Farmland	(210, 180, 140)

TCAM Head

TCAM = Terrain-Context Adaptive Mutual-attention. It takes multi-layer hidden states from DINOv3-ViT-L/16 at stages {5, 11, 17, 23} and decodes them through 6 submodules:

#	Submodule	Role	Core idea
1	FeatureBridge ×4	Reshape ViT hidden states `(B, N, C)` to 2D maps `(B, D, H, W)`	CLS readout is projected and concatenated to patch tokens, then 1×1 conv compresses channels
2	TCE (Terrain Context Estimator)	Infer an implicit terrain context (mountain / plain / urban) from the deepest feature; output FiLM parameters (γ, β)	GAP → MLP. No DEM required — γ is initialized with a +1.0 residual for identity start
3	SAA (Scale-Adaptive Aggregation)	Per-pixel softmax fusion of the 4 scales	FiLM-modulated by (γ, β) so that forest pixels lean on deep layers and building / road pixels lean on shallow layers automatically
4	CFMA ×3 (Class–Feature Mutual Attention)	Bidirectional cross-attention between learnable class prototypes and the fused spatial feature	(1) Class → Feature: each class embedding queries the spatial feature; (2) Feature → Class: spatial features re-query the refreshed prototypes. Major gain on confusable classes (paddy / dry field / farmland, street tree / forest)
5	BoundaryGuidedUpsampler	Extract a boundary attention map (sigmoid) from a shallow feature and run 4-stage ConvTranspose for 16× upsample	Recovers sharp parcel and mountain-ridge boundaries
6	Classifier	3×3 conv → 1×1 conv	Pixel-wise logits

Pipeline: FeatureBridge ×4 → TCE → SAA → CFMA ×3 → BoundaryGuidedUpsampler → Classifier

Parameter Breakdown

Module	Params
FeatureBridge × 4	9.98 M
TCE	0.26 M
SAA	0.59 M
CFMA × 3	7.11 M
BoundaryGuidedUpsampler	8.10 M
Classifier	1.33 M
TCAM Head total	27.37 M
Backbone (DINOv3-ViT-L/16)	303.13 M
Total	330.50 M

vs DPT head: +3.24 M (+13.4 % head, +0.99 % total).

TCAM Hyperparameters

Field	Value
`tcam_hidden_size` (D)	384
`tcam_num_heads` (CFMA)	12
`tcam_cfma_layers`	3
`tcam_tce_hidden_size`	192
`tcam_boundary_channels`	64
`readout_type`	`project`

Training

Item	Value
Dataset	Korean aerial photography + AI Hub land-cover
Classes	11 (background + 10 land-cover)
Input	512 × 512, 3-channel RGB
Backbone	DINOv3-ViT-L/16 (sat-493M pretrain), unfrozen
Selected layers	{5, 11, 17, 23}
Patch size	16 (32 × 32 token grid)
Loss	Focal (γ = 2.5) + Dice
Optimizer	AdamW, lr 1.3e-4, wd 0.01
Scheduler	CosineAnnealing, η_min = 1e-6
Batch size	32
Mixed precision	bf16
Sampler	Weighted (class-balanced)
Epochs	40

Results

TCAM vs DPT (same backbone, same data, 40 epochs)

Head	Params	Best Val mIoU	Best Epoch
DPT	24.13 M	0.6505	39
TCAM (ours)	27.37 M	0.7054	40
Δ	+3.24 M	+0.0549 (+8.4 % rel.)	—

Training Dynamics

Epochs 1–10 (warm-up). mIoU oscillates between 0.53–0.66; train loss falls quickly from 0.486 to 0.361.
Epochs 11–35 (stabilization). mIoU climbs monotonically 0.64 → 0.697; focal + dice combination drives minor-class learning.
Epochs 36–40 (fine convergence). With LR in the 3e-5 → 1.6e-5 range, mIoU plateaus at 0.694 → 0.705.

Quick Start

Installation

pip install "transformers>=4.45" torch pillow

Inference

import torch
import numpy as np
from PIL import Image
from transformers import AutoModel, AutoImageProcessor

MODEL_ID = "JDONE-Research/AIOne-GeoSeg-330M"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device).eval()
processor = AutoImageProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

image = Image.open("aerial.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    logits = model(pixel_values=inputs["pixel_values"]).logits  # (1, 11, 512, 512)

mask = logits.argmax(dim=1)[0].cpu().numpy()  # (512, 512) int class IDs

id2label = model.config.id2label
print("Detected classes:", sorted({id2label[str(i)] for i in np.unique(mask)}))

Colorized mask

palette = np.array(model.config.label_colors, dtype=np.uint8)  # (11, 3)
color_mask = palette[mask]                                     # (512, 512, 3)
Image.fromarray(color_mask).save("mask.png")

# Side-by-side overlay (50% blend)
resized = image.resize((512, 512))
overlay = (np.array(resized) * 0.5 + color_mask * 0.5).astype(np.uint8)
Image.fromarray(overlay).save("overlay.png")

Batch inference

images = [Image.open(p).convert("RGB") for p in paths]
inputs = processor(images=images, return_tensors="pt").to(device)
with torch.no_grad():
    masks = model(pixel_values=inputs["pixel_values"]).logits.argmax(dim=1).cpu().numpy()
# masks: (B, 512, 512)

Model Specs

Field	Value
Architecture	`AIOne_GeoSeg`
Backbone	DINOv3 ViT-L/16, sat-493M pretrain (24 layers, 1024 hidden, 16 heads, 4 register tokens)
Segmentation head	TCAM (custom) — FeatureBridge ×4, TCE, SAA, CFMA ×3, BoundaryGuidedUpsampler, Classifier
Total parameters	330.50 M (Backbone 303.13 M + TCAM Head 27.37 M)
Weights on disk	1.3 GB (FP32)
Input	RGB image, 512×512
Output	Per-pixel logits, shape `(B, 11, 512, 512)`
Backbone feature taps	stages `[5, 11, 17, 23]`
Number of classes	11 (Korean land-cover)
Validation mIoU	0.7054
Training precision	bf16
Released checkpoint precision	float32
Domain	Korean aerial / satellite ortho-imagery

Intended Use

Korean land-cover mapping from aerial or satellite ortho-imagery.
Change-detection pipelines (run the model on two epochs and diff the masks).
Urban-planning, agriculture, and forestry analytics that need per-pixel Korean class labels.
Research baseline for comparing other segmentation heads against TCAM.

Out-of-Scope Use

Non-commercial only. This release is governed by CC-BY-NC-4.0; do not use it in revenue-generating products or services.
Imagery from regions or sensors that differ substantially from the Korean ortho-imagery training distribution (expect degraded accuracy).
Sole-source decision-making in legal, regulatory, or safety-critical contexts.
Any analysis that infringes on personal privacy, property rights, or applicable geospatial-data regulations.

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0) license.

Free to use, share, and adapt for non-commercial purposes with attribution.
Not permitted for commercial use. Contact the authors for a commercial license.
Provided "as is" without warranties of any kind.

Citation

If you use AIOne-GeoSeg in your research, please cite:

@misc{aione_geoseg_330m,
  title        = {AIOne-GeoSeg-330M: A DINOv3 Vision Transformer with TCAM Head for Korean Land-Cover Segmentation},
  author       = {JDONE Research},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-GeoSeg-330M}}
}

A paper describing the TCAM head, training procedure, and full ablations will be released soon — citation details will be updated here when available.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32