danielmunicio commited on
Commit
a0bb31d
·
verified ·
1 Parent(s): 76d62ff

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +116 -0
  2. checkpoints/latest-checkpoint.pt +3 -0
  3. config.json +22 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - vla
5
+ - robotics
6
+ - drone-navigation
7
+ - prismatic
8
+ - angle-prediction
9
+ - qwen2.5
10
+ - dinov2
11
+ - siglip
12
+ library_name: prismatic
13
+ ---
14
+
15
+ # MiniVLA Angle Selector
16
+
17
+ A Vision-Language-Action (VLA) model for drone angle prediction. Given a forward-facing drone camera image and a navigation prompt (e.g., "Navigate to the red cube"), predicts a flight direction as one of 36 discrete angles (0-350 degrees, 10-degree increments).
18
+
19
+ ## Architecture
20
+
21
+ - **Vision Backbone**: DINOv2 + SigLIP fused @ 224px
22
+ - **LLM Backbone**: Qwen2.5 0.5B
23
+ - **Projector**: FusedGeLU MLP (no-align, single-stage)
24
+ - **Total Parameters**: ~1.26B (vision + projector + LLM)
25
+ - **Inference VRAM**: ~2.5 GB (bf16)
26
+
27
+ ## Training
28
+
29
+ 1. **VLM Pretraining**: Single-stage training on LLaVA 665k dataset (projector + LLM jointly, no separate alignment stage)
30
+ 2. **Angle Fine-tuning**: LoRA (r=16, alpha=32) + unfrozen embeddings on 21k drone navigation samples
31
+
32
+ ## Performance
33
+
34
+ | Metric | Value |
35
+ |--------|-------|
36
+ | Val Accuracy (exact match) | 80.0% |
37
+ | Val Angular Error | 3.2 degrees |
38
+ | Angle Bins | 36 (10-degree steps) |
39
+
40
+ ## Usage
41
+
42
+ ### With Prismatic (openvla-mini)
43
+
44
+ ```python
45
+ from prismatic import load
46
+
47
+ vlm = load("path/to/minivla-angle-selector")
48
+ ```
49
+
50
+ ### Standalone
51
+
52
+ ```python
53
+ from prismatic.models.materialize import (
54
+ get_vision_backbone_and_transform,
55
+ get_llm_backbone_and_tokenizer,
56
+ get_vlm,
57
+ )
58
+ import torch
59
+
60
+ # Build model
61
+ vision_backbone, _ = get_vision_backbone_and_transform(
62
+ "dinosiglip-vit-so-224px", "resize-naive", image_sequence_len=1
63
+ )
64
+ llm_backbone, tokenizer = get_llm_backbone_and_tokenizer(
65
+ "qwen25-0_5b-pure", llm_max_length=2048, inference_mode=True,
66
+ )
67
+ vlm = get_vlm(
68
+ model_id="minivla-angle-selector",
69
+ arch_specifier="no-align+fused-gelu-mlp",
70
+ vision_backbone=vision_backbone,
71
+ llm_backbone=llm_backbone,
72
+ )
73
+
74
+ # Load weights
75
+ ckpt = torch.load("checkpoints/latest-checkpoint.pt", map_location="cpu")["model"]
76
+ vlm.projector.load_state_dict(ckpt["projector"])
77
+ vlm.llm_backbone.load_state_dict(ckpt["llm_backbone"])
78
+ vlm.vision_backbone.load_state_dict(ckpt["vision_backbone"])
79
+ vlm.to("cuda", dtype=torch.bfloat16)
80
+ vlm.eval()
81
+
82
+ # Predict angle from image
83
+ from PIL import Image
84
+ image = Image.open("drone_view.png")
85
+ prompt_builder = vlm.get_prompt_builder()
86
+ prompt_builder.add_turn("human", "Navigate the drone to the red cube")
87
+ input_prompt = prompt_builder.get_prompt()
88
+
89
+ tok = vlm.llm_backbone.tokenizer
90
+ input_ids = tok(input_prompt, return_tensors="pt").input_ids.to("cuda")
91
+ pixel_values = vlm.vision_backbone.get_image_transform()(image)
92
+ pixel_values = pixel_values[None, ...].to("cuda", dtype=torch.bfloat16)
93
+
94
+ with torch.no_grad():
95
+ output = vlm.forward(input_ids=input_ids, pixel_values=pixel_values, return_dict=True)
96
+ num_patches = vlm.vision_backbone.num_patches
97
+ action_logit = output.logits[0, num_patches:, :][-1, :]
98
+ token_id = action_logit.argmax().item()
99
+
100
+ # Convert token to angle
101
+ vocab_size = len(tok)
102
+ angle_code = (vocab_size - 1 - token_id) % 36
103
+ angle_degrees = angle_code * 10
104
+ print(f"Predicted angle: {angle_degrees} degrees")
105
+ ```
106
+
107
+ ## Action Space
108
+
109
+ | Code | Angle | Direction |
110
+ |------|-------|-----------|
111
+ | 0 | 0 deg | +X (right) |
112
+ | 9 | 90 deg | +Y (forward) |
113
+ | 18 | 180 deg | -X (left) |
114
+ | 27 | 270 deg | -Y (backward) |
115
+
116
+ Token mapping: `token_id = vocab_size - 1 - angle_code`
checkpoints/latest-checkpoint.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d68ef9aa95ff6ceb2b5f5529afdb4bf058267c74c8a320443f6e9e753c7d7913
3
+ size 5009405931
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": {
3
+ "model_id": "minivla-angle-selector",
4
+ "vision_backbone_id": "dinosiglip-vit-so-224px",
5
+ "llm_backbone_id": "qwen25-0_5b-pure",
6
+ "arch_specifier": "no-align+fused-gelu-mlp",
7
+ "image_resize_strategy": "resize-naive",
8
+ "image_sequence_len": 1,
9
+ "llm_max_length": 2048
10
+ },
11
+ "training": {
12
+ "type": "angle_prediction",
13
+ "num_angle_codes": 36,
14
+ "angle_step_deg": 10,
15
+ "vocab_size": 151665,
16
+ "val_accuracy": 0.7771428571428571,
17
+ "val_angular_error_deg": 3.390151515151515,
18
+ "epoch": 6,
19
+ "pretrained_vlm": "LLaVA 665k (single-stage, no-align)",
20
+ "finetuned_with": "LoRA r=16 alpha=32 + unfrozen embeddings (merged)"
21
+ }
22
+ }