galamsey-v9-e3

Fine-tune of LiquidAI/LFM2.5-VL-450M for detecting illegal small-scale gold mining ("galamsey") in Sentinel-2 satellite imagery over Ghana. Used as the perception layer of GalamseyWatch, a two-layer agentic Earth-observation system (perception VLM + LFM2 tool-calling policy) submitted to the Liquid AI × DPhi Space "AI in Space" hackathon.

The browser/WebGPU sibling of this checkpoint is samwell/galamsey-v9-e3-onnx.

Live demo

A click-to-detect dashboard running this model fully in-browser via WebGPU: galamseywatch.vercel.app. Click anywhere over Ghana, the page pulls a Sentinel-2 tile, runs the model on-device, and renders bounding boxes plus a description. ~1 GB one-time download, then cached. Nothing leaves the device.

What it does

Given paired RGB and SWIR false-color composites of a 1.28 km Sentinel-2 tile (10 m/px), the model returns:

A JSON list of bounding boxes for every visible mining pit, normalized to [0, 1].
A natural-language description of the scene (e.g. "Multiple active excavation pits with sediment plumes and exposed lateritic soil").

Both outputs come from the same fine-tune, with two prompts. The grounding prompt emits boxes; the description prompt emits prose. See the GalamseyWatch repo for the exact prompts and the post-processing (NMS, min-area filter).

How it composes with the LFM2 agent

In GalamseyWatch's on-orbit pipeline, this VLM is the perception layer. Its structured output (bounding boxes, derived confidence, scene description) is handed to an LFM2-2.6B tool-calling policy that decides, per tile, what to do under a bandwidth budget:

downlink_now: high-confidence detection worth the bandwidth.
flag_for_review: moderate confidence, log a text-only entry (cheap).
discard: no signal, forest, water, or cloud-obscured.
request_neighbor_tile: feature continues off-frame.
request_higher_resolution: small candidate needs more pixels.

This split lets each model work on inputs sized to its parameter budget: the 450M VLM on pixels, the 2.6B LLM on text plus per-pass scalar context (bandwidth remaining, cloud cover, captured-at, neighbor tiles). The compression boundary is exactly where the VLM hands off bounding boxes and prose to the agent, no shared image input.

See orchestrator/agentic_eo/models/agent.py for the agent setup, system prompt, and tool definitions.

Performance

Evaluated on the SmallMinesDS test split, full pixel-IoU pipeline (bbox-to-mask scoring), RGB + SWIR two-image prompt.

Lift over base model:

Metric	Base LFM2.5-VL-450M	galamsey-v9-e3	Δ
Pixel IoU	0.069	0.332	+0.263 (~4.8×)

Full evaluation, galamsey-v9-e3:

Metric	Value
Pixel IoU	0.332
Pixel recall	0.592
Pixel SDC F1	0.499
Patch accuracy	0.795

Recall, F1, and patch accuracy were not separately recorded for the base run; the IoU lift is the headline number.

Honest ceiling

Galamsey pits are irregular polygons, but this model emits axis-aligned bounding boxes. Even with perfect bbox predictions (every box exactly circumscribing its ground-truth mask), the maximum achievable pixel IoU against the SmallMinesDS pixel-level masks is 0.4692, computed directly by converting every GT mask to its tightest bbox and scoring that against the mask.

So 0.332 = 71% of the achievable ceiling for any bbox-emitting method on this benchmark. A pixel-mask architecture (e.g. a U-Net) operates in a different regime and is not directly comparable.

Training


Base model	LiquidAI/LFM2.5-VL-450M
Dataset	SmallMinesDS (Ofori-Ampofo et al., 2025): 4,270 labeled Ghana patches, CC-BY-SA-4.0
Inputs	Paired RGB and SWIR false-color composites (two-image prompt)
Augmentation	4× D4 dihedral group (flips + rotations)
Method	Full fine-tuning (no LoRA)
Epochs	3 (17,719 steps)
Batch size	4
Learning rate	2e-5, with separate rates for LM / projector / vision tower
Hardware	1× NVIDIA H100 via Modal
Final training loss	0.175 (from 2.10 at step 1)

Intended use

Detection of illegal gold-mining pits in 10 m/px Sentinel-2 imagery over Ghana.
Grounded inspection workflows where a human reviewer wants both bounding boxes and a natural-language second opinion.
Edge / on-orbit deployment: the 450M parameter count and ONNX export make this practical for satellite-class compute.

It is not a general-purpose mining detector. Performance outside Ghana, outside Sentinel-2, or outside the visible/SWIR composites it was trained on is not guaranteed.

Known failure modes

Surfaced honestly so downstream users can plan around them:

Cloud-occluded tiles. Sentinel-2 is optical, not SAR; the model can't see through cloud and may hallucinate mining where it sees only whiteness. Pre-filter on cloud_cover if possible.
Legal quarries. There is no visual signal in a single patch that distinguishes licensed quarrying from galamsey; cross-reference with concession polygons at the post-inference layer.
Freshly-cleared farmland. Similar SWIR signature to exposed soil. The geometric shape (rectilinear vs. amorphous) is the disambiguating cue, not the spectrum.
Tiny pits (2–3 pixels). Bbox effectively a point; pixel IoU is noisy at this regime.
Out-of-distribution geology. Eastern / Volta regions outside the SmallMinesDS training geography.

Inference

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "samwell/galamsey-v9-e3", device_map="auto", dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained("samwell/galamsey-v9-e3")

rgb = Image.open("tile_rgb.png")     # B4+B3+B2 composite
swir = Image.open("tile_swir.png")   # B12+B11+B8 false-color composite

GROUNDING_PROMPT = (
    "You are viewing two images of the same Sentinel-2 patch: a natural-color RGB "
    "composite and a SWIR false-color composite. Using both views, detect any "
    "illegal small-scale gold mining pits. Include any exposed soil, excavation, "
    "or sediment-laden water even if you are uncertain, err toward detection. "
    'Provide result as a valid JSON: [{"label": str, "bbox": [x1,y1,x2,y2]}, ...]. '
    "Coordinates must be normalized to 0-1. Only return [] if the scene is entirely "
    "pristine forest, clean water, or urban built-up area with no disturbance."
)

conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "image": rgb},
        {"type": "image", "image": swir},
        {"type": "text", "text": GROUNDING_PROMPT},
    ],
}]

inputs = processor.apply_chat_template(
    conversation, add_generation_prompt=True, return_tensors="pt",
    return_dict=True, tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

For the description prompt and the full inference pipeline (NMS, min-bbox-area filter, area estimation), see app/src/lib/inference.ts (browser path) and orchestrator/agentic_eo/models/vlm.py (Python path).