SAM 2.1 MLX

MLX-native ports of Meta/Facebook SAM 2.1 models for Apple Silicon.

This model is converted from Meta's SAM 2.1 checkpoints and the official facebookresearch/sam2 implementation. It is intended for local image segmentation and video object tracking with MLX, without requiring PyTorch at runtime.

Project repo: https://github.com/avbiswas/sam2-mlx
Model collection: https://huggingface.co/collections/avbiswas/sam2-mlx-6a0a0dcfbbbcb089d13d23cd
Original SAM2 repo: https://github.com/facebookresearch/sam2
Original models: https://huggingface.co/facebook

Install

pip install mlx-sam

or with uv:

uv pip install mlx-sam

Usage

import numpy as np
from mlx_sam import SAM2VideoPredictor

predictor = SAM2VideoPredictor.from_pretrained(
    "avbiswas/sam2.1-hiera-small-mlx"  # replace with this model repo id
)

state = predictor.init_state("path/to/video_or_frames")

predictor.add_new_points_or_box(
    state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[625.0, 429.0]], dtype=np.float32),
    labels=np.array([1], dtype=np.int32),
)

for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
    # masks: NumPy float32 array shaped [objects, 1, height, width]
    pass

Benchmarks

Benchmarks were run on an Apple M2 Max with 32 GB unified memory. Video tests use the SAM2 dog demo clip: 1280x720, 289 frames, 29.97 FPS, 9.64 s.

FP32 MLX vs Torch/MPS

Prompted first-frame fixture at 1024x1024 internal resolution.

Model	Size	Torch/MPS	MLX	Speedup	Parity vs Torch
`sam2.1-hiera-tiny-mlx`	`172.6 MiB`	`96.6 ms`	`71.3 ms`	`1.36x`	mask mean abs `1.17e-05`
`sam2.1-hiera-small-mlx`	`199.7 MiB`	`112.5 ms`	`84.5 ms`	`1.33x`	mask mean abs `8.14e-06`
`sam2.1-hiera-base-plus-mlx`	`336.4 MiB`	`203.5 ms`	`144.7 ms`	`1.41x`	mask mean abs `5.04e-06`
`sam2.1-hiera-large-mlx`	`892.2 MiB`	`433.0 ms`	`341.1 ms`	`1.27x`	mask mean abs `7.84e-06`

Video Tracking

For sam2.1-hiera-small-mlx on the 9.64 second dog clip:

Workload	Torch/MPS	MLX	Result
Full video, post-prompt propagation	`331 ms/frame`	`189 ms/frame`	MLX `1.75x` faster
Full video, total run	`100.5 s`	`94.8 s`	MLX faster end to end
Raw propagation, no save/overlay/final resize	`407 ms/frame`	`287 ms/frame`	MLX `1.42x` faster

Experimental preview mode at 768x768 internal resolution:

Setting	Propagation	Quality vs 1024
`1024x1024` baseline	`268.5 ms/frame`	reference
`768x768`, fp16 memory attention	`52.9 ms/frame`	mean IoU `0.949`, presence `80 / 80` on 80-frame dog clip

Quantized Variants

Quantized models reduce download size and memory footprint. On current MLX kernels, quantization should not be assumed to speed up video tracking; it primarily helps memory and distribution size.

Variant	Typical Size Reduction	Notes
`*-mlx-16bit`	about `2x` smaller	fp16 weights, closest quantized parity
`*-mlx-8bit`	about `2.5x-3x` smaller	int8 linear quantization
`*-mlx-4bit`	about `3.5x` smaller	mixed recipe: int8 trunk/mask decoder, int4 memory/object-pointer layers

Example small model parity vs fp32 MLX:

Model	Size	Parity vs fp32 MLX
`sam2.1-hiera-small-mlx-16bit`	`99.9 MiB`	mask mean abs `8.24e-03`
`sam2.1-hiera-small-mlx-8bit`	`76.7 MiB`	mask mean abs `2.99e-02`
`sam2.1-hiera-small-mlx-4bit`	`56.4 MiB`	mask mean abs `2.87e-02`