| --- |
| license: mit |
| tags: |
| - vla |
| - robotics |
| - drone-navigation |
| - prismatic |
| - angle-prediction |
| - qwen2.5 |
| - dinov2 |
| - siglip |
| library_name: prismatic |
| --- |
| |
| # MiniVLA Angle Selector |
|
|
| A Vision-Language-Action (VLA) model for drone angle prediction. Given a forward-facing drone camera image and a navigation prompt (e.g., "Navigate to the red cube"), predicts a flight direction as one of 36 discrete angles (0-350 degrees, 10-degree increments). |
|
|
| ## Architecture |
|
|
| - **Vision Backbone**: DINOv2 + SigLIP fused @ 224px |
| - **LLM Backbone**: Qwen2.5 0.5B |
| - **Projector**: FusedGeLU MLP (no-align, single-stage) |
| - **Total Parameters**: ~1.26B (vision + projector + LLM) |
| - **Inference VRAM**: ~2.5 GB (bf16) |
|
|
| ## Training |
|
|
| 1. **VLM Pretraining**: Single-stage training on LLaVA 665k dataset (projector + LLM jointly, no separate alignment stage) |
| 2. **Angle Fine-tuning**: LoRA (r=16, alpha=32) + unfrozen embeddings on 21k drone navigation samples |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Val Accuracy (exact match) | 80.0% | |
| | Val Angular Error | 3.2 degrees | |
| | Angle Bins | 36 (10-degree steps) | |
|
|
| ## Usage |
|
|
| ### With Prismatic (openvla-mini) |
|
|
| ```python |
| from prismatic import load |
| |
| vlm = load("path/to/minivla-angle-selector") |
| ``` |
|
|
| ### Standalone |
|
|
| ```python |
| from prismatic.models.materialize import ( |
| get_vision_backbone_and_transform, |
| get_llm_backbone_and_tokenizer, |
| get_vlm, |
| ) |
| import torch |
| |
| # Build model |
| vision_backbone, _ = get_vision_backbone_and_transform( |
| "dinosiglip-vit-so-224px", "resize-naive", image_sequence_len=1 |
| ) |
| llm_backbone, tokenizer = get_llm_backbone_and_tokenizer( |
| "qwen25-0_5b-pure", llm_max_length=2048, inference_mode=True, |
| ) |
| vlm = get_vlm( |
| model_id="minivla-angle-selector", |
| arch_specifier="no-align+fused-gelu-mlp", |
| vision_backbone=vision_backbone, |
| llm_backbone=llm_backbone, |
| ) |
| |
| # Load weights |
| ckpt = torch.load("checkpoints/latest-checkpoint.pt", map_location="cpu")["model"] |
| vlm.projector.load_state_dict(ckpt["projector"]) |
| vlm.llm_backbone.load_state_dict(ckpt["llm_backbone"]) |
| vlm.vision_backbone.load_state_dict(ckpt["vision_backbone"]) |
| vlm.to("cuda", dtype=torch.bfloat16) |
| vlm.eval() |
| |
| # Predict angle from image |
| from PIL import Image |
| image = Image.open("drone_view.png") |
| prompt_builder = vlm.get_prompt_builder() |
| prompt_builder.add_turn("human", "Navigate the drone to the red cube") |
| input_prompt = prompt_builder.get_prompt() |
| |
| tok = vlm.llm_backbone.tokenizer |
| input_ids = tok(input_prompt, return_tensors="pt").input_ids.to("cuda") |
| pixel_values = vlm.vision_backbone.get_image_transform()(image) |
| pixel_values = pixel_values[None, ...].to("cuda", dtype=torch.bfloat16) |
| |
| with torch.no_grad(): |
| output = vlm.forward(input_ids=input_ids, pixel_values=pixel_values, return_dict=True) |
| num_patches = vlm.vision_backbone.num_patches |
| action_logit = output.logits[0, num_patches:, :][-1, :] |
| token_id = action_logit.argmax().item() |
| |
| # Convert token to angle |
| vocab_size = len(tok) |
| angle_code = (vocab_size - 1 - token_id) % 36 |
| angle_degrees = angle_code * 10 |
| print(f"Predicted angle: {angle_degrees} degrees") |
| ``` |
|
|
| ## Action Space |
|
|
| | Code | Angle | Direction | |
| |------|-------|-----------| |
| | 0 | 0 deg | +X (right) | |
| | 9 | 90 deg | +Y (forward) | |
| | 18 | 180 deg | -X (left) | |
| | 27 | 270 deg | -Y (backward) | |
|
|
| Token mapping: `token_id = vocab_size - 1 - angle_code` |
|
|