--- license: mit tags: - vla - robotics - drone-navigation - prismatic - angle-prediction - qwen2.5 - dinov2 - siglip library_name: prismatic --- # MiniVLA Angle Selector A Vision-Language-Action (VLA) model for drone angle prediction. Given a forward-facing drone camera image and a navigation prompt (e.g., "Navigate to the red cube"), predicts a flight direction as one of 36 discrete angles (0-350 degrees, 10-degree increments). ## Architecture - **Vision Backbone**: DINOv2 + SigLIP fused @ 224px - **LLM Backbone**: Qwen2.5 0.5B - **Projector**: FusedGeLU MLP (no-align, single-stage) - **Total Parameters**: ~1.26B (vision + projector + LLM) - **Inference VRAM**: ~2.5 GB (bf16) ## Training 1. **VLM Pretraining**: Single-stage training on LLaVA 665k dataset (projector + LLM jointly, no separate alignment stage) 2. **Angle Fine-tuning**: LoRA (r=16, alpha=32) + unfrozen embeddings on 21k drone navigation samples ## Performance | Metric | Value | |--------|-------| | Val Accuracy (exact match) | 80.0% | | Val Angular Error | 3.2 degrees | | Angle Bins | 36 (10-degree steps) | ## Usage ### With Prismatic (openvla-mini) ```python from prismatic import load vlm = load("path/to/minivla-angle-selector") ``` ### Standalone ```python from prismatic.models.materialize import ( get_vision_backbone_and_transform, get_llm_backbone_and_tokenizer, get_vlm, ) import torch # Build model vision_backbone, _ = get_vision_backbone_and_transform( "dinosiglip-vit-so-224px", "resize-naive", image_sequence_len=1 ) llm_backbone, tokenizer = get_llm_backbone_and_tokenizer( "qwen25-0_5b-pure", llm_max_length=2048, inference_mode=True, ) vlm = get_vlm( model_id="minivla-angle-selector", arch_specifier="no-align+fused-gelu-mlp", vision_backbone=vision_backbone, llm_backbone=llm_backbone, ) # Load weights ckpt = torch.load("checkpoints/latest-checkpoint.pt", map_location="cpu")["model"] vlm.projector.load_state_dict(ckpt["projector"]) vlm.llm_backbone.load_state_dict(ckpt["llm_backbone"]) vlm.vision_backbone.load_state_dict(ckpt["vision_backbone"]) vlm.to("cuda", dtype=torch.bfloat16) vlm.eval() # Predict angle from image from PIL import Image image = Image.open("drone_view.png") prompt_builder = vlm.get_prompt_builder() prompt_builder.add_turn("human", "Navigate the drone to the red cube") input_prompt = prompt_builder.get_prompt() tok = vlm.llm_backbone.tokenizer input_ids = tok(input_prompt, return_tensors="pt").input_ids.to("cuda") pixel_values = vlm.vision_backbone.get_image_transform()(image) pixel_values = pixel_values[None, ...].to("cuda", dtype=torch.bfloat16) with torch.no_grad(): output = vlm.forward(input_ids=input_ids, pixel_values=pixel_values, return_dict=True) num_patches = vlm.vision_backbone.num_patches action_logit = output.logits[0, num_patches:, :][-1, :] token_id = action_logit.argmax().item() # Convert token to angle vocab_size = len(tok) angle_code = (vocab_size - 1 - token_id) % 36 angle_degrees = angle_code * 10 print(f"Predicted angle: {angle_degrees} degrees") ``` ## Action Space | Code | Angle | Direction | |------|-------|-----------| | 0 | 0 deg | +X (right) | | 9 | 90 deg | +Y (forward) | | 18 | 180 deg | -X (left) | | 27 | 270 deg | -Y (backward) | Token mapping: `token_id = vocab_size - 1 - angle_code`