Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# arc-vision
|
| 2 |
+
|
| 3 |
+
Vision models for [arc-vision](https://github.com/llamadrone/llamadrone) — the autonomous drone vision pipeline.
|
| 4 |
+
|
| 5 |
+
## Files
|
| 6 |
+
|
| 7 |
+
| File | Size | Description |
|
| 8 |
+
|------|------|-------------|
|
| 9 |
+
| `yolov11s.hef` | 16 MB | YOLO11s object detection (640×640, 80 COCO classes, Hailo-10H) |
|
| 10 |
+
| `clip_vit_b_16_text_encoder.safetensors` | 254 MB | CLIP ViT-B/16 text encoder weights for target designation |
|
| 11 |
+
| `clip_tokenizer.json` | 3.6 MB | CLIP tokenizer |
|
| 12 |
+
| `clip_vit_b_16_image_encoder.hef` | 76 MB | CLIP ViT-B/16 image encoder compiled for Hailo-10H NPU |
|
| 13 |
+
|
| 14 |
+
## Architecture
|
| 15 |
+
|
| 16 |
+
- **Detection:** YOLO11s on Hailo-10H NPU (~46 FPS)
|
| 17 |
+
- **Target designation:** CLIP text encoder (CPU, candle) + image encoder (NPU, ~53 FPS) for zero-shot matching
|
| 18 |
+
- **Re-ID:** CLIP image embeddings for cross-frame track association
|
| 19 |
+
|
| 20 |
+
Describe a target in natural language ("silver car", "person in red jacket") and the pipeline matches against live camera frames.
|
| 21 |
+
|
| 22 |
+
## Sources
|
| 23 |
+
|
| 24 |
+
- YOLO11s + CLIP ViT-B/16 image encoder: compiled from [Hailo Model Zoo](https://github.com/hailo-ai/hailo_model_zoo) v5.2.0
|
| 25 |
+
- CLIP text encoder + tokenizer: exported from [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16)
|