File size: 6,462 Bytes
f77871e d6e5e3b f77871e 3a9e490 f4b953a 3a9e490 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
license: apache-2.0
datasets:
- violetcliff/SmartHome-Bench
language:
- en
base_model:
- Qwen/Qwen2.5-3B-Instruct
---
# Model Card for Memories-S0
**Memories-S0** is a highly efficient, 3-billion-parameter video understanding model designed specifically for the security and surveillance domain. It leverages synthetic data generation (via Veo 3) and extreme optimization strategies to achieve state-of-the-art performance on edge devices.
## Model Details
* **Model Name:** Memories-S0
* **Organization:** Memories.ai Research
* **Model Architecture:** 3B Parameter VideoLLM
* **Release Date:** Jan 2026
* **License:** Apache 2.0
* **Paper:** [Memories-SO: An Efficient and Accurate Framework for Security Video Understanding](https://memories.ai/research/Camera)
* **Code Repository:** [https://github.com/Memories-ai-labs/memories-s0](https://github.com/Memories-ai-labs/memories-s0)
### Model Description
Memories-S0 is designed to address two key challenges in security video understanding: data scarcity and deployment efficiency on resource-constrained devices.
* **Data Innovation:** The model is pre-trained on a massive, diverse set of synthetic surveillance videos generated by advanced video generation models (like Veo 3). This allows for pixel-perfect annotations and covers diverse scenarios (e.g., dimly lit hallways, unattended packages).
* **Extreme Efficiency:** It utilizes an innovative input token compression algorithm that dynamically prunes redundant background tokens, focusing computation on foreground objects and motion. This allows the 3B model to run efficiently on mobile/edge hardware.
* **Post-Training:** The model employs a unique post-training strategy using Reinforcement Learning (RL) and event-based temporal shuffling to enhance sequential understanding without expensive full fine-tuning.
## Installation
```bash
conda create -n memories-s0 python=3.10 -y
conda activate memories-s0
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url <https://download.pytorch.org/whl/cu121>
# Install dependencies for Qwen2.5-VL architecture and Flash Attention
pip install transformers>=4.37.0 accelerate qwen_vl_utils
pip install flash-attn --no-build-isolation
```
## Inference
The following script demonstrates how to run the **Memories-S0** model. It automatically handles the loading of weights from the official Hugging Face repository.
```python
import torch
import argparse
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# Official Model Repository
MODEL_ID = "Memories-ai/security_model"
def run_inference(video_path, model_id=MODEL_ID):
# Load Model with Flash Attention 2 for efficiency
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define Security Analysis Prompt
prompt_text = """YOUR_PROMPT"""
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": video_path},
{"type": "text", "text": prompt_text},
],
}
]
# Preprocessing
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=768)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--video_path", type=str, required=True, help="Path to input video")
args = parser.parse_args()
run_inference(args.video_path)
```
## Intended Use
### Primary Use Cases
* **Security & Surveillance:** Detecting anomalies, tracking suspicious activities, and monitoring public safety.
* **Smart Home Monitoring:** Analyzing video feeds for unusual events (e.g., falls, intruders) as benchmarked on SmartHomeBench.
* **Edge Computing:** Deploying high-performance video analysis directly on cameras or local gateways with limited memory and compute power.
### Out-of-Scope Use Cases
* General open-domain video understanding (e.g., movie classification) may not be optimal as the model is specialized for surveillance angles and events.
* Biometric identification (Face Recognition) is not the primary design goal; the focus is on action and event understanding.
## Performance (SmartHomeBench)
We evaluated Memories-S0(3B) on the **SmartHomeBench** dataset, a recognized benchmark for smart home video anomaly detection.
Despite having only **3B parameters**, our model achieves an **F1-score of 79.21** using a simple **Zero-shot** prompt, surpassing larger models like VILA-13b and performing competitively against GPT-4o and Claude-3.5-Sonnet (which require complex Chain-of-Thought prompting).
| Model | Params | Prompting Method | Accuracy | Precision | Recall | **F1-score** |
| --- | --- | --- | --- | --- | --- | --- |
| **Memories-S0 (Ours)** | **3B** | **Zero-shot** | **71.33** | **73.04** | **86.51** | **79.21** |
| VILA-13b | 13B | Few-shot CoT | 67.17 | 69.18 | 70.57 | 69.87 |
| GPT-4o | Closed | Zero-shot | 68.41 | 80.09 | 55.16 | 65.33 |
| Gemini-1.5-Pro | Closed | Zero-shot | 57.36 | 84.34 | 25.73 | 39.43 |
## Citation
If you use this model or framework in your research, please cite our technical report:
```bibtex
@techreport{memories_s0_2025,
title = {{Memories-S0}: An Efficient and Accurate Framework for Security Video Understanding},
author = {{Memories.ai Research}},
institution = {Memories.ai},
year = {2025},
month = oct,
url = {https://huggingface.co/Memories-ai/security_model},
note = {Accessed: 2025-11-20}
}
``` |