Upload SAM 3.1 bf16 weights converted from facebook/sam3.1

Browse files

Files changed (3) hide show

README.md +150 -0
config.json +894 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,150 @@

+---
+library_name: mlx
+base_model: facebook/sam3.1
+tags:
+- mlx
+- sam3
+- sam3.1
+- segmentation
+- detection
+- tracking
+- object-multiplex
+---
+# sam3.1-bf16
+[facebook/sam3.1](https://huggingface.co/facebook/sam3.1) converted to MLX (bfloat16, 3.3 GB).
+Open-vocabulary **object detection**, **instance segmentation**, and **video tracking** with **Object Multiplex** on Apple Silicon (~873M parameters).
+SAM 3.1 extends SAM 3 with:
+- **MultiplexMaskDecoder**: processes 16 objects simultaneously (2.4-4x faster tracking)
+- **TriViTDetNeck**: 3 parallel FPN heads (detection, interactive, propagation)
+- **DecoupledMemoryAttention**: image cross-attention with RoPE
+- Improved detection accuracy (0.90 vs 0.87 on cats benchmark)
+## Quick Start
+```bash
+pip install mlx-vlm
+```
+```python
+from PIL import Image
+from mlx_vlm.utils import load_model, get_model_path
+from mlx_vlm.models.sam3.generate import Sam3Predictor
+from mlx_vlm.models.sam3_1.processing_sam3_1 import Sam31Processor
+model_path = get_model_path("mlx-community/sam3.1-bf16")
+model = load_model(model_path)
+processor = Sam31Processor.from_pretrained(str(model_path))
+predictor = Sam3Predictor(model, processor, score_threshold=0.3)
+```
+## Object Detection
+```python
+image = Image.open("photo.jpg")
+result = predictor.predict(image, text_prompt="a dog")
+for i in range(len(result.scores)):
+    x1, y1, x2, y2 = result.boxes[i]
+    print(f"[{result.scores[i]:.2f}] box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")
+```
+## Instance Segmentation
+```python
+result = predictor.predict(image, text_prompt="a person")
+# result.boxes   -> (N, 4) xyxy bounding boxes
+# result.masks   -> (N, H, W) binary segmentation masks
+# result.scores  -> (N,) confidence scores
+import numpy as np
+overlay = np.array(image).copy()
+W, H = image.size
+for i in range(len(result.scores)):
+    mask = result.masks[i]
+    if mask.shape != (H, W):
+        mask = np.array(Image.fromarray(mask.astype(np.float32)).resize((W, H)))
+    binary = mask > 0
+    overlay[binary] = (overlay[binary] * 0.5 + np.array([255, 0, 0]) * 0.5).astype(np.uint8)
+```
+## Multi-Prompt Detection
+```python
+from mlx_vlm.models.sam3_1.generate import predict_multi
+result = predict_multi(predictor, image, ["a cat", "a remote control"])
+for i in range(len(result.scores)):
+    x1, y1, x2, y2 = result.boxes[i]
+    print(f"[{result.scores[i]:.2f}] {result.labels[i]} box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")
+```
+## Box-Guided Detection
+```python
+import numpy as np
+boxes = np.array([[100, 50, 400, 350]])  # xyxy pixel coords
+result = predictor.predict(image, text_prompt="a cat", boxes=boxes)
+```
+## CLI
+```bash
+# Object detection
+python -m mlx_vlm.models.sam3_1.generate --task detect --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16
+# Instance segmentation
+python -m mlx_vlm.models.sam3_1.generate --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16
+# Video tracking
+python -m mlx_vlm.models.sam3_1.generate --task track --video input.mp4 --prompt "a car" --model mlx-community/sam3.1-bf16
+# Real-time webcam (optimized: backbone caching + tracker propagation)
+python -m mlx_vlm.models.sam3_1.generate --task realtime --prompt "a person" --model mlx-community/sam3.1-bf16 --resolution 224
+```
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--task` | `segment` | `detect`, `segment`, `track`, `realtime` |
+| `--prompt` | *(required)* | Text prompt(s), supports multiple |
+| `--resolution` | `1008` | Input resolution (224 for faster realtime) |
+| `--detect-every` | `15` | Re-run full detection every N frames |
+| `--backbone-every` | `30` | Re-run ViT backbone every N frames |
+## Benchmarks (M3 Max, bf16)
+### Detection Accuracy
+| Prompt | SAM 3 | SAM 3.1 |
+|--------|-------|---------|
+| "a cat" (2 cats) | 0.87, 0.82 | **0.90, 0.86** |
+| "a remote control" | 0.95, 0.94 | 0.94, 0.94 |
+### Tracker Multiplex Speed
+| Objects | SAM 3 | SAM 3.1 | Speedup |
+|---------|-------|---------|---------|
+| 3 | 547ms/frame | 227ms/frame | **2.4x** |
+| 4 | 608ms/frame | 203ms/frame | **3.0x** |
+| 5 | 766ms/frame | 190ms/frame | **4.0x** |
+### Optimized Realtime (224px)
+| Metric | Value |
+|--------|-------|
+| Cached frame | 38ms (26 FPS) |
+| Sustained average | ~40ms (25 FPS) |
+| Baseline (no optimization) | ~212ms (5 FPS) |
+| **Total speedup** | **4.6x** |
+## Original Model
+[facebook/sam3.1](https://huggingface.co/facebook/sam3.1) · [Code](https://github.com/facebookresearch/sam3)
+## License
+The original SAM 3.1 model weights are released by Meta under the [SAM License](https://huggingface.co/facebook/sam3.1/blob/main/LICENSE), a custom permissive license for commercial and research use.

config.json ADDED Viewed

	@@ -0,0 +1,894 @@

+{
+  "architectures": [
+    "Sam3VideoModel"
+  ],
+  "assoc_iou_thresh": 0.1,
+  "decrease_trk_keep_alive_for_empty_masklets": false,
+  "det_nms_thresh": 0.1,
+  "detector_config": {
+    "detr_decoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "box_rpb_mode": "log",
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dropout": 0.1,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "relu",
+      "hidden_dropout": 0.0,
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 2048,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_detr_decoder",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 8,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_layers": 6,
+      "num_queries": 200,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0,
+      "use_presence_token": true
+    },
+    "detr_encoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dropout": 0.1,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "relu",
+      "hidden_dropout": 0.0,
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 2048,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_detr_encoder",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 8,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_layers": 6,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    },
+    "geometry_encoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dropout": 0.1,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "relu",
+      "hidden_dropout": 0.0,
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 2048,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_geometry_encoder",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 8,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_layers": 3,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "roi_size": 7,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    },
+    "initializer_range": 0.02,
+    "mask_decoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dropout": 0.0,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_mask_decoder",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 8,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_return_sequences": 1,
+      "num_upsampling_stages": 3,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    },
+    "model_type": "sam3",
+    "text_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_dropout": 0.0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": 49406,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": 49407,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "gelu",
+      "hidden_size": 1024,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_factor": 1.0,
+      "initializer_range": 0.02,
+      "intermediate_size": 4096,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-05,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "max_position_embeddings": 32,
+      "min_length": 0,
+      "model_type": "clip_text_model",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 16,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_hidden_layers": 24,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": 1,
+      "prefix": null,
+      "problem_type": null,
+      "projection_dim": 512,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0,
+      "vocab_size": 49408
+    },
+    "vision_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "backbone_config": {
+        "_name_or_path": "",
+        "add_cross_attention": false,
+        "architectures": null,
+        "attention_dropout": 0.0,
+        "bad_words_ids": null,
+        "begin_suppress_tokens": null,
+        "bos_token_id": null,
+        "chunk_size_feed_forward": 0,
+        "cross_attention_hidden_size": null,
+        "decoder_start_token_id": null,
+        "diversity_penalty": 0.0,
+        "do_sample": false,
+        "dtype": null,
+        "early_stopping": false,
+        "encoder_no_repeat_ngram_size": 0,
+        "eos_token_id": null,
+        "exponential_decay_length_penalty": null,
+        "finetuning_task": null,
+        "forced_bos_token_id": null,
+        "forced_eos_token_id": null,
+        "global_attn_indexes": [
+          7,
+          15,
+          23,
+          31
+        ],
+        "hidden_act": "gelu",
+        "hidden_dropout": 0.0,
+        "hidden_size": 1024,
+        "id2label": {
+          "0": "LABEL_0",
+          "1": "LABEL_1"
+        },
+        "image_size": 1008,
+        "initializer_range": 0.02,
+        "intermediate_size": 4736,
+        "is_decoder": false,
+        "is_encoder_decoder": false,
+        "label2id": {
+          "LABEL_0": 0,
+          "LABEL_1": 1
+        },
+        "layer_norm_eps": 1e-06,
+        "layer_scale_init_value": null,
+        "length_penalty": 1.0,
+        "max_length": 20,
+        "min_length": 0,
+        "model_type": "sam3_vit_model",
+        "no_repeat_ngram_size": 0,
+        "num_attention_heads": 16,
+        "num_beam_groups": 1,
+        "num_beams": 1,
+        "num_channels": 3,
+        "num_hidden_layers": 32,
+        "num_return_sequences": 1,
+        "output_attentions": false,
+        "output_hidden_states": false,
+        "output_scores": false,
+        "pad_token_id": null,
+        "patch_size": 14,
+        "prefix": null,
+        "pretrain_image_size": 336,
+        "problem_type": null,
+        "qkv_bias": true,
+        "remove_invalid_values": false,
+        "repetition_penalty": 1.0,
+        "return_dict": true,
+        "return_dict_in_generate": false,
+        "rope_theta": 10000.0,
+        "sep_token_id": null,
+        "suppress_tokens": null,
+        "task_specific_params": null,
+        "temperature": 1.0,
+        "tie_encoder_decoder": false,
+        "tie_word_embeddings": true,
+        "tokenizer_class": null,
+        "top_k": 50,
+        "top_p": 1.0,
+        "typical_p": 1.0,
+        "window_size": 24
+      },
+      "backbone_feature_sizes": [
+        [
+          288,
+          288
+        ],
+        [
+          144,
+          144
+        ],
+        [
+          72,
+          72
+        ]
+      ],
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "fpn_hidden_size": 256,
+      "fpn_kernel_size": 2,
+      "fpn_stride": 2,
+      "hidden_act": "gelu",
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_vision_model",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_feature_levels": 3,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "scale_factors": [
+        4.0,
+        2.0,
+        1.0
+      ],
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    }
+  },
+  "dtype": "float32",
+  "fill_hole_area": 16,
+  "high_conf_thresh": 0.8,
+  "high_iou_thresh": 0.8,
+  "hotstart_delay": 15,
+  "hotstart_dup_thresh": 8,
+  "hotstart_unmatch_thresh": 8,
+  "init_trk_keep_alive": 30,
+  "initializer_range": 0.02,
+  "low_res_mask_size": 288,
+  "max_num_objects": 10000,
+  "max_trk_keep_alive": 30,
+  "min_trk_keep_alive": -1,
+  "model_type": "sam3.1_video",
+  "new_det_thresh": 0.7,
+  "recondition_every_nth_frame": 16,
+  "recondition_on_trk_masks": false,
+  "score_threshold_detection": 0.5,
+  "suppress_overlapping_based_on_recent_occlusion_threshold": 0.7,
+  "suppress_unmatched_only_within_hotstart": true,
+  "tracker_config": {
+    "enable_occlusion_spatial_embedding": true,
+    "enable_temporal_pos_encoding_for_object_pointers": true,
+    "image_size": 1008,
+    "initializer_range": 0.02,
+    "mask_decoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_downsample_rate": 2,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dtype": null,
+      "dynamic_multimask_stability_delta": 0.05,
+      "dynamic_multimask_stability_thresh": 0.98,
+      "dynamic_multimask_via_stability": true,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "gelu",
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "iou_head_depth": 3,
+      "iou_head_hidden_dim": 256,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "mlp_dim": 2048,
+      "model_type": "",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 8,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_hidden_layers": 2,
+      "num_multimask_outputs": 3,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    },
+    "mask_downsampler_embed_dim": 256,
+    "mask_downsampler_hidden_act": "gelu",
+    "mask_downsampler_kernel_size": 3,
+    "mask_downsampler_padding": 1,
+    "mask_downsampler_stride": 2,
+    "mask_downsampler_total_stride": 16,
+    "max_cond_frame_num": 4,
+    "max_object_pointers_in_encoder": 16,
+    "memory_attention_downsample_rate": 1,
+    "memory_attention_dropout": 0.1,
+    "memory_attention_feed_forward_hidden_act": "relu",
+    "memory_attention_feed_forward_hidden_size": 2048,
+    "memory_attention_hidden_size": 256,
+    "memory_attention_num_attention_heads": 1,
+    "memory_attention_num_layers": 4,
+    "memory_attention_rope_dropout": 0.1,
+    "memory_attention_rope_feat_sizes": [
+      72,
+      72
+    ],
+    "memory_attention_rope_theta": 10000,
+    "memory_encoder_hidden_size": 256,
+    "memory_encoder_output_channels": 64,
+    "memory_fuser_embed_dim": 256,
+    "memory_fuser_hidden_act": "gelu",
+    "memory_fuser_intermediate_dim": 1024,
+    "memory_fuser_kernel_size": 7,
+    "memory_fuser_layer_scale_init_value": 1e-06,
+    "memory_fuser_num_layers": 2,
+    "memory_fuser_padding": 3,
+    "model_type": "sam3_tracker_video",
+    "multimask_max_pt_num": 1,
+    "multimask_min_pt_num": 0,
+    "multimask_output_for_tracking": true,
+    "multimask_output_in_sam": true,
+    "num_maskmem": 7,
+    "prompt_encoder_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "gelu",
+      "hidden_size": 256,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "image_size": 1008,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "mask_input_channels": 16,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_point_embeddings": 4,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "patch_size": 14,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "scale": 1,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    },
+    "sigmoid_bias_for_mem_enc": -10.0,
+    "sigmoid_scale_for_mem_enc": 20.0,
+    "vision_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "backbone_config": {
+        "_name_or_path": "",
+        "add_cross_attention": false,
+        "architectures": null,
+        "attention_dropout": 0.0,
+        "bad_words_ids": null,
+        "begin_suppress_tokens": null,
+        "bos_token_id": null,
+        "chunk_size_feed_forward": 0,
+        "cross_attention_hidden_size": null,
+        "decoder_start_token_id": null,
+        "diversity_penalty": 0.0,
+        "do_sample": false,
+        "dtype": null,
+        "early_stopping": false,
+        "encoder_no_repeat_ngram_size": 0,
+        "eos_token_id": null,
+        "exponential_decay_length_penalty": null,
+        "finetuning_task": null,
+        "forced_bos_token_id": null,
+        "forced_eos_token_id": null,
+        "global_attn_indexes": [
+          7,
+          15,
+          23,
+          31
+        ],
+        "hidden_act": "gelu",
+        "hidden_dropout": 0.0,
+        "hidden_size": 1024,
+        "id2label": {
+          "0": "LABEL_0",
+          "1": "LABEL_1"
+        },
+        "image_size": 1008,
+        "initializer_range": 0.02,
+        "intermediate_size": 4736,
+        "is_decoder": false,
+        "is_encoder_decoder": false,
+        "label2id": {
+          "LABEL_0": 0,
+          "LABEL_1": 1
+        },
+        "layer_norm_eps": 1e-06,
+        "layer_scale_init_value": null,
+        "length_penalty": 1.0,
+        "max_length": 20,
+        "min_length": 0,
+        "model_type": "sam3_vit_model",
+        "no_repeat_ngram_size": 0,
+        "num_attention_heads": 16,
+        "num_beam_groups": 1,
+        "num_beams": 1,
+        "num_channels": 3,
+        "num_hidden_layers": 32,
+        "num_return_sequences": 1,
+        "output_attentions": false,
+        "output_hidden_states": false,
+        "output_scores": false,
+        "pad_token_id": null,
+        "patch_size": 14,
+        "prefix": null,
+        "pretrain_image_size": 336,
+        "problem_type": null,
+        "qkv_bias": true,
+        "remove_invalid_values": false,
+        "repetition_penalty": 1.0,
+        "return_dict": true,
+        "return_dict_in_generate": false,
+        "rope_theta": 10000.0,
+        "sep_token_id": null,
+        "suppress_tokens": null,
+        "task_specific_params": null,
+        "temperature": 1.0,
+        "tie_encoder_decoder": false,
+        "tie_word_embeddings": true,
+        "tokenizer_class": null,
+        "top_k": 50,
+        "top_p": 1.0,
+        "typical_p": 1.0,
+        "window_size": 24
+      },
+      "backbone_feature_sizes": [
+        [
+          288,
+          288
+        ],
+        [
+          144,
+          144
+        ],
+        [
+          72,
+          72
+        ]
+      ],
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "dtype": null,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "fpn_hidden_size": 256,
+      "fpn_kernel_size": 2,
+      "fpn_stride": 2,
+      "hidden_act": "gelu",
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "sam3_vision_model",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_feature_levels": 3,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "scale_factors": [
+        4.0,
+        2.0,
+        1.0
+      ],
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "typical_p": 1.0
+    }
+  },
+  "transformers_version": "5.0.0.dev0",
+  "trk_assoc_iou_thresh": 0.5
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1b1c19dcc9bdd68438bcd74433fadc90740e73c37a1f386872672d134879c42
+size 3493016552