You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

InstinctSAM — ViT-B vision encoder for compressed SAM3 (concept / box / point)

Distilled ViT-B/16 vision encoder that recovers SAM3's three promptable-segmentation paths (text/concept, box, point) at ≥80 % of the teacher with a vision encoder ~4.7× smaller than SAM3's ~463M PE-ViT.

Code & full write-up: https://github.com/william-Dic/InstinctSAM (see docs/CONCEPT_DISTILL.md).

Results (held-out COCO val2017; teacher = official facebook/sam3)

Vision encoder Params Stage-A cos Concept IoU (n=500) Box mIoU Point mIoU
SAM3 teacher ~463M 1.00 0.798 (100%) 0.940 (100%) 0.676 (100%)
TinyViT-11M ~28M 0.66 0.513 (64%) 0.743 (79%) 0.521 (77%)
TinyViT-21M ~31M 0.66 0.584 (73%) 0.743 (79%) 0.518 (77%)
ViT-B/16 (this) ~99M 0.75 0.751 (94%) 0.805 (86%) 0.545 (81%)

Key finding: SAM3 concept distillation is vision-encoder-capacity-bound — not text/data/objective-bound. TinyViT saturates at cosine 0.66 to the teacher's features (both output-KD and region-text alignment plateau at ~0.58 concept on 21M); a higher-fidelity ViT-B (Stage-A cosine 0.75) jumps concept to 0.751 (94 % of teacher).

Files (our original weights only — NO Meta-gated SAM3 weights)

  • vit_base_stageA.pt — ViT-B trunk after Stage-A feature distillation to the teacher's (1024, 72, 72) trunk feature (cosine 0.75).
  • concept_vitb_trunk_step6000.pt — ViT-B trunk after end-to-end concept self-distillation (the final model's vision encoder). This is the trunk that produces the results above.

Both are {'trunk': state_dict, 'backbone': 'vit_base', 'model_name': 'base'}.

License / how to assemble the full model

These files contain only the ViT-B vision-encoder weights we trained — they do not include SAM3's detector / mask-decoder / scoring / presence heads, which are Meta-gated under the SAM License.

To run the full model, obtain SAM3 from Meta (gated) and assemble per the repo:

# 1. get SAM3 (Meta-gated) -> checkpoints/official/sam3.pt
# 2. build merged init (SAM3 heads + this ViT-B trunk + distilled MobileCLIP-S1 text)
python src/build_merged_stage3.py --backbone vit_base --model_name base \
  --trunk vit_base_stageA.pt --text_type MobileCLIP-S1 --out merged_vitb.pt
# 3. (optional) reproduce concept training end-to-end
bash scripts/run_vitb_full.sh
# eval:
python src/eval_concept_es3.py --ckpt <merged-or-concept.pt> --backbone vit_base --model_name base \
  --text_type MobileCLIP-S1 --ann annotations/instances_val2017_eval500.json --n_images 500

Method (brief)

End-to-end self-distillation of SAM3's concept/DETR path (teacher↔student Hungarian detection matching + soft KD + CLoCKDistill encoder-memory KD + region-text alignment + EMA), on top of a ViT-B trunk feature-distilled to the teacher. The teacher supplies all targets; no ground-truth labels are used.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support