File size: 6,462 Bytes
f77871e
 
d6e5e3b
 
 
 
 
 
f77871e
3a9e490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4b953a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a9e490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: apache-2.0
datasets:
- violetcliff/SmartHome-Bench
language:
- en
base_model:
- Qwen/Qwen2.5-3B-Instruct
---
# Model Card for Memories-S0

**Memories-S0** is a highly efficient, 3-billion-parameter video understanding model designed specifically for the security and surveillance domain. It leverages synthetic data generation (via Veo 3) and extreme optimization strategies to achieve state-of-the-art performance on edge devices.

## Model Details

* **Model Name:** Memories-S0
* **Organization:** Memories.ai Research
* **Model Architecture:** 3B Parameter VideoLLM
* **Release Date:** Jan 2026
* **License:** Apache 2.0
* **Paper:** [Memories-SO: An Efficient and Accurate Framework for Security Video Understanding](https://memories.ai/research/Camera)
* **Code Repository:** [https://github.com/Memories-ai-labs/memories-s0](https://github.com/Memories-ai-labs/memories-s0)

### Model Description

Memories-S0 is designed to address two key challenges in security video understanding: data scarcity and deployment efficiency on resource-constrained devices.

* **Data Innovation:** The model is pre-trained on a massive, diverse set of synthetic surveillance videos generated by advanced video generation models (like Veo 3). This allows for pixel-perfect annotations and covers diverse scenarios (e.g., dimly lit hallways, unattended packages).
* **Extreme Efficiency:** It utilizes an innovative input token compression algorithm that dynamically prunes redundant background tokens, focusing computation on foreground objects and motion. This allows the 3B model to run efficiently on mobile/edge hardware.
* **Post-Training:** The model employs a unique post-training strategy using Reinforcement Learning (RL) and event-based temporal shuffling to enhance sequential understanding without expensive full fine-tuning.

## Installation

```bash
conda create -n memories-s0 python=3.10 -y
conda activate memories-s0

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url <https://download.pytorch.org/whl/cu121>

# Install dependencies for Qwen2.5-VL architecture and Flash Attention
pip install transformers>=4.37.0 accelerate qwen_vl_utils
pip install flash-attn --no-build-isolation

```

## Inference

The following script demonstrates how to run the **Memories-S0** model. It automatically handles the loading of weights from the official Hugging Face repository.

```python
import torch
import argparse
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Official Model Repository
MODEL_ID = "Memories-ai/security_model"

def run_inference(video_path, model_id=MODEL_ID):
    # Load Model with Flash Attention 2 for efficiency
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto",
    )

    processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

    # Define Security Analysis Prompt
    prompt_text = """YOUR_PROMPT"""

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_path},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]

    # Preprocessing
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    inputs = inputs.to("cuda")

    # Generate
    generated_ids = model.generate(**inputs, max_new_tokens=768)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print(output_text[0])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--video_path", type=str, required=True, help="Path to input video")
    args = parser.parse_args()
    run_inference(args.video_path)

```

## Intended Use

### Primary Use Cases

* **Security & Surveillance:** Detecting anomalies, tracking suspicious activities, and monitoring public safety.
* **Smart Home Monitoring:** Analyzing video feeds for unusual events (e.g., falls, intruders) as benchmarked on SmartHomeBench.
* **Edge Computing:** Deploying high-performance video analysis directly on cameras or local gateways with limited memory and compute power.

### Out-of-Scope Use Cases

* General open-domain video understanding (e.g., movie classification) may not be optimal as the model is specialized for surveillance angles and events.
* Biometric identification (Face Recognition) is not the primary design goal; the focus is on action and event understanding.

## Performance (SmartHomeBench)

We evaluated Memories-S0(3B) on the **SmartHomeBench** dataset, a recognized benchmark for smart home video anomaly detection.

Despite having only **3B parameters**, our model achieves an **F1-score of 79.21** using a simple **Zero-shot** prompt, surpassing larger models like VILA-13b and performing competitively against GPT-4o and Claude-3.5-Sonnet (which require complex Chain-of-Thought prompting).

| Model | Params | Prompting Method | Accuracy | Precision | Recall | **F1-score** |
| --- | --- | --- | --- | --- | --- | --- |
| **Memories-S0 (Ours)** | **3B** | **Zero-shot** | **71.33** | **73.04** | **86.51** | **79.21** |
| VILA-13b | 13B | Few-shot CoT | 67.17 | 69.18 | 70.57 | 69.87 |
| GPT-4o | Closed | Zero-shot | 68.41 | 80.09 | 55.16 | 65.33 |
| Gemini-1.5-Pro | Closed | Zero-shot | 57.36 | 84.34 | 25.73 | 39.43 |

## Citation

If you use this model or framework in your research, please cite our technical report:

```bibtex
@techreport{memories_s0_2025,
  title       = {{Memories-S0}: An Efficient and Accurate Framework for Security Video Understanding},
  author      = {{Memories.ai Research}},
  institution = {Memories.ai},
  year        = {2025},
  month       = oct,
  url         = {https://huggingface.co/Memories-ai/security_model},
  note        = {Accessed: 2025-11-20}
}

```