Instructions to use litert-community/SAM2.1-Hiera-Tiny-Image-Encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/SAM2.1-Hiera-Tiny-Image-Encoder with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- sam2
How to use litert-community/SAM2.1-Hiera-Tiny-Image-Encoder with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(litert-community/SAM2.1-Hiera-Tiny-Image-Encoder) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(litert-community/SAM2.1-Hiera-Tiny-Image-Encoder) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
SAM 2.1 (Hiera-Tiny) image encoder β LiteRT GPU
On-device LiteRT / TFLite conversion of the image encoder of
SAM 2.1 Hiera-Tiny (Meta, Apache-2.0),
running fully on the mobile GPU via the LiteRT CompiledModel API (ML Drift / LITERT_CL delegate).
The whole graph is GPU-resident β no CPU/XNNPACK fallback ops.
This is the heavy backbone of the Segment Anything 2 image path: it turns an RGB image into the multi-scale feature pyramid that a (small) prompt-encoder + mask-decoder then query per click/box.
| Task | Image encoder for promptable segmentation (SAM 2 image path) |
| Backbone | Hiera-Tiny (hierarchical ViT, window + global attention) + FPN neck |
| Input | [1, 3, 1024, 1024] NCHW float32, ImageNet-normalized |
| Outputs | 3 FPN feature maps: [1,256,256,256], [1,256,128,128], [1,256,64,64] |
| Precision / size | FP16, 80 MB |
| Device | Pixel 8a, LiteRT GPU (Accelerator.GPU), ~7 ms / image |
| Residency | Replacing 862 out of 862 node(s) with delegate (LITERT_CL) (full, single partition) |
Preprocessing (must match)
resize to 1024x1024 (bilinear) -> x/255 -> (x - mean) / std
mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225] # ImageNet, RGB, NCHW
GPU-clean conversion (what was re-authored)
Converted with litert-torch. SAM 2's Hiera encoder is not GPU-clean out of the box; these exact,
weights-faithful rewrites were applied (model-side only β no converter patch):
window_partition/window_unpartition: the 6-Dview+permutewindow reshape rejected by the GPU delegate (>4-D) is re-expressed as a sequence of β€4-Dreshape/transposeops (numerically exact, verified vs the original).Sam2MultiScaleAttention: the 5-D fused-QKV reshape is decomposed into separate q/k/v, and attention runs as a 3-D batched SDPA ([B*heads, N, d]). A 4-D SDPA makes the delegate emit a[C,C]->[nW,ws,C,C]BROADCAST_TOon every windowed block; the 3-D form removes all 9.- Windowed positional embedding: the bicubic-interpolate + tile of the constant
pos_embedis baked to a buffer (add only) β removes a runtime interpolate of a constant. - Neck: the (constant, shape-only) sine FPN position encodings are dropped from the graph (compute
them host-side) β removes the remaining
BROADCAST_TOops. - Overflow-safe LayerNorm (scale-before-square) as an fp16 safety margin for the deep stages.
Net: banned ops = NONE, >4-D tensors = 0, full GPU residency.
Fidelity (honest)
Eager re-authoring is numerically exact (cos = 1.000, mae = 0). On-device GPU output vs the
CPU reference, per FPN level:
| Output | cosine |
|---|---|
FPN-0 256x256 (high-res, drives mask detail) |
0.99998 |
FPN-1 128x128 |
0.99994 |
FPN-2 64x64 (coarse image embedding) |
0.99253 |
The deepest 64Γ64 feature drifts slightly on the GPU. This is not LayerNorm overflow (scale-before-square LayerNorm doesn't change it, and the CPU fp16 model matches PyTorch fp32 at corr 0.999999) β it is the mobile GPU computing the deep-stage global attention (64Γ64 = 4096 tokens) in true fp16, where the CPU path upcasts to fp32. The high-resolution features that carry mask boundaries are near-exact, so mask quality is preserved in practice.
Usage (Android / LiteRT CompiledModel)
val model = CompiledModel.create(context.assets, "sam2_tiny_image_encoder_fp16.tflite",
CompiledModel.Options(Accelerator.GPU), null)
// input: [1,3,1024,1024] NCHW, ImageNet-normalized
// outputs: 3 FPN feature maps -> feed to the SAM 2 prompt encoder + mask decoder
Training data & PII
SAM 2 was trained by Meta on SA-1B (licensed photos) and SA-V (licensed videos) with
model-in-the-loop mask annotation. No new training was performed for this conversion β it is a
weights-faithful format change of the public facebook/sam2.1-hiera-tiny checkpoint. Because the
source data is real-world imagery, it may incidentally contain people, faces, vehicles, signage and
other PII; no PII was deliberately collected and this conversion adds none. Apply your own content/PII
filtering as appropriate. See the SAM 2 release and
paper for full dataset details.
License
Apache-2.0, inherited from the upstream SAM 2.1. This is a format conversion; all credit to the original authors (Meta AI).
Variant: decoder-ready (sam2_tiny_image_encoder_v2_fp16.tflite)
A second file in this repo, sam2_tiny_image_encoder_v2_fp16.tflite, additionally folds the SAM 2 mask
decoder's conv_s0 (256β32) / conv_s1 (256β64) projections and the no_memory embedding into the
graph, so it directly emits decoder-ready features:
image_embeddings [1,256,64,64], feat_s1 [1,64,128,128], feat_s0 [1,32,256,256]. Pair it with the
SAM 2.1 Hiera-Tiny mask decoder
for promptable "tap to segment" (see the LiteRT interactive_segmentation sample). Same GPU-clean
re-authoring and fidelity as the base encoder above; FP16, ~80 MB, full LITERT_CL residency (867/867).
- Downloads last month
- -
Model tree for litert-community/SAM2.1-Hiera-Tiny-Image-Encoder
Base model
facebook/sam2.1-hiera-tiny