AIC Robotics Policy: JEPA Vision Encoder + Custom Policy Head
This repository contains the trained weights and deployment code for a complete visual imitation-learning pipeline tackling the AIC cable-insertion task.
Architecture
This model utilizes a two-stage "Option B" configuration:
- Vision Brain (
lewm_epoch_74_object.ckpt): A Joint-Embedding Predictive Architecture (JEPA), pre-trained for 74 epochs on the AIC LeRobot dataset to build a robust latent understanding of robotic maneuvers. - Reflex Controller (
policy_head_aic.pth): A lightning-fast, 3-layer Action MLP that maps the 192-dimensional visual latents straight to precise 6-DOF (degrees of freedom) Tool Center Point coordinates. This head was trained alongside the frozen JEPA backbone.
Performance
- Inference Latency: ~115ms per frame on typical consumer hardware (RTX 4050 6GB).
- Target Frequency: ~8-10 Hz control loop.
- Output Tracking: Linear X/Y/Z and Angular Roll/Pitch/Yaw target offsets.
π Team Deployment Guide
If you are a team member evaluating this model, follow these exact steps to run the 115ms predictive controller on your own machine.
Step 1: Environment Setup
First, get the exact python packages required to run the JEPA neural network natively:
# Clone this huggingface repository
git clone https://huggingface.co/Rupesh386/aic-robotics-policy
cd aic-robotics-policy
# Setup a fast virtual environment using uv
uv venv
source .venv/bin/activate
uv pip install torch torchvision torchaudio omegaconf einops transformers
Step 2: Running the 115ms Inference
We have included the exact Reflex Controller script. This script automatically handles loading the lewm_epoch_74 brain and the policy_head reflexes.
# Execute the visual inference latency test
uv run python inference_policy_aic.py
Expected Terminal Output: The script will pull a 3-frame historical sample and output the exact milliseconds it took to calculate the target, along with the raw 6-DOF coordinates (X,Y,Z Linear & Roll, Pitch, Yaw Angular offsets).
Dataset Details
Trained on the 20 Hz LeRobot aic dataset (80 Episodes, 22,090 total frames). Original dataset inputs consist of multi-angle AV1 video and continuous Kinematic float states.