---
library_name: mlx
base_model: facebook/sam3.1
tags:
- mlx
- sam3
- sam3.1
- segmentation
- detection
- tracking
- object-multiplex
---

# sam3.1-bf16

This model was converted to MLX format from [`facebook/sam3.1`](https://huggingface.co/facebook/sam3.1)
using mlx-vlm version **0.4.3**.

Open-vocabulary **object detection**, **instance segmentation**, and **video tracking** with **Object Multiplex** on Apple Silicon (~873M parameters).

SAM 3.1 extends SAM 3 with:
- **MultiplexMaskDecoder**: processes 16 objects simultaneously (2.4-4x faster tracking)
- **TriViTDetNeck**: 3 parallel FPN heads (detection, interactive, propagation)
- **DecoupledMemoryAttention**: image cross-attention with RoPE
- Improved detection accuracy (0.90 vs 0.87 on cats benchmark)

## Quick Start

```bash
pip install mlx-vlm>=0.4.3
```

```python
from PIL import Image
from mlx_vlm.utils import load_model, get_model_path
from mlx_vlm.models.sam3.generate import Sam3Predictor
from mlx_vlm.models.sam3_1.processing_sam3_1 import Sam31Processor

model_path = get_model_path("mlx-community/sam3.1-bf16")
model = load_model(model_path)
processor = Sam31Processor.from_pretrained(str(model_path))
predictor = Sam3Predictor(model, processor, score_threshold=0.3)
```

## Object Detection

```python
image = Image.open("photo.jpg")
result = predictor.predict(image, text_prompt="a dog")

for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")
```

## Instance Segmentation

```python
result = predictor.predict(image, text_prompt="a person")

# result.boxes   -> (N, 4) xyxy bounding boxes
# result.masks   -> (N, H, W) binary segmentation masks
# result.scores  -> (N,) confidence scores

import numpy as np
overlay = np.array(image).copy()
W, H = image.size
for i in range(len(result.scores)):
    mask = result.masks[i]
    if mask.shape != (H, W):
        mask = np.array(Image.fromarray(mask.astype(np.float32)).resize((W, H)))
    binary = mask > 0
    overlay[binary] = (overlay[binary] * 0.5 + np.array([255, 0, 0]) * 0.5).astype(np.uint8)
```

## Multi-Prompt Detection

```python
from mlx_vlm.models.sam3_1.generate import predict_multi

result = predict_multi(predictor, image, ["a cat", "a remote control"])
for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] {result.labels[i]} box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")
```

## Box-Guided Detection

```python
import numpy as np
boxes = np.array([[100, 50, 400, 350]])  # xyxy pixel coords
result = predictor.predict(image, text_prompt="a cat", boxes=boxes)
```

## CLI

```bash
# Object detection
python -m mlx_vlm.models.sam3_1.generate --task detect --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16

# Instance segmentation
python -m mlx_vlm.models.sam3_1.generate --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16

# Video tracking
python -m mlx_vlm.models.sam3_1.generate --task track --video input.mp4 --prompt "a car" --model mlx-community/sam3.1-bf16

# Real-time webcam (optimized: backbone caching + tracker propagation)
python -m mlx_vlm.models.sam3_1.generate --task realtime --prompt "a person" --model mlx-community/sam3.1-bf16 --resolution 224
```

| Flag | Default | Description |
|------|---------|-------------|
| `--task` | `segment` | `detect`, `segment`, `track`, `realtime` |
| `--prompt` | *(required)* | Text prompt(s), supports multiple |
| `--resolution` | `1008` | Input resolution (224 for faster realtime) |
| `--detect-every` | `15` | Re-run full detection every N frames |
| `--backbone-every` | `30` | Re-run ViT backbone every N frames |

## Benchmarks (M3 Max, bf16)

### Detection Accuracy

| Prompt | SAM 3 | SAM 3.1 |
|--------|-------|---------|
| "a cat" (2 cats) | 0.87, 0.82 | **0.90, 0.86** |
| "a remote control" | 0.95, 0.94 | 0.94, 0.94 |

### Tracker Multiplex Speed

| Objects | SAM 3 | SAM 3.1 | Speedup |
|---------|-------|---------|---------|
| 3 | 547ms/frame | 227ms/frame | **2.4x** |
| 4 | 608ms/frame | 203ms/frame | **3.0x** |
| 5 | 766ms/frame | 190ms/frame | **4.0x** |

### Optimized Realtime (224px)

| Metric | Value |
|--------|-------|
| Cached frame | 38ms (26 FPS) |
| Sustained average | ~40ms (25 FPS) |
| Baseline (no optimization) | ~212ms (5 FPS) |
| **Total speedup** | **4.6x** |

## Original Model

[facebook/sam3.1](https://huggingface.co/facebook/sam3.1) · [Code](https://github.com/facebookresearch/sam3)

## License

The original SAM 3.1 model weights are released by Meta under the [SAM License](https://huggingface.co/facebook/sam3.1/blob/main/LICENSE), a custom permissive license for commercial and research use.