AIC Robotics Policy: JEPA Vision Encoder + Custom Policy Head

This repository contains the trained weights and deployment code for a complete visual imitation-learning pipeline tackling the AIC cable-insertion task.

Architecture

This model utilizes a two-stage "Option B" configuration:

  1. Vision Brain (lewm_epoch_74_object.ckpt): A Joint-Embedding Predictive Architecture (JEPA), pre-trained for 74 epochs on the AIC LeRobot dataset to build a robust latent understanding of robotic maneuvers.
  2. Reflex Controller (policy_head_aic.pth): A lightning-fast, 3-layer Action MLP that maps the 192-dimensional visual latents straight to precise 6-DOF (degrees of freedom) Tool Center Point coordinates. This head was trained alongside the frozen JEPA backbone.

Performance

  • Inference Latency: ~115ms per frame on typical consumer hardware (RTX 4050 6GB).
  • Target Frequency: ~8-10 Hz control loop.
  • Output Tracking: Linear X/Y/Z and Angular Roll/Pitch/Yaw target offsets.

πŸš€ Team Deployment Guide

If you are a team member evaluating this model, follow these exact steps to run the 115ms predictive controller on your own machine.

Step 1: Environment Setup

First, get the exact python packages required to run the JEPA neural network natively:

# Clone this huggingface repository
git clone https://huggingface.co/Rupesh386/aic-robotics-policy
cd aic-robotics-policy

# Setup a fast virtual environment using uv
uv venv
source .venv/bin/activate
uv pip install torch torchvision torchaudio omegaconf einops transformers

Step 2: Running the 115ms Inference

We have included the exact Reflex Controller script. This script automatically handles loading the lewm_epoch_74 brain and the policy_head reflexes.

# Execute the visual inference latency test
uv run python inference_policy_aic.py

Expected Terminal Output: The script will pull a 3-frame historical sample and output the exact milliseconds it took to calculate the target, along with the raw 6-DOF coordinates (X,Y,Z Linear & Roll, Pitch, Yaw Angular offsets).


Dataset Details

Trained on the 20 Hz LeRobot aic dataset (80 Episodes, 22,090 total frames). Original dataset inputs consist of multi-angle AV1 video and continuous Kinematic float states.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading