Spaces:
Configuration error
Configuration error
File size: 9,542 Bytes
8004c1c 0bbc8f7 b46439f 0bbc8f7 b46439f 0bbc8f7 b46439f 0bbc8f7 b46439f 0bbc8f7 b46439f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# EMMA: Extracting Multiple physical parameters from Multimodal Data
**CVPR 2026**
[Farhat Shaikh](https://scholar.google.com/citations?hl=en&user=mbAOSW0AAAAJ), [Ayan Banerjee](https://scholar.google.com/citations?user=UAlc7tEAAAAJ&hl=en), [Sandeep K. S. Gupta](https://scholar.google.com/citations?user=U9bcQkMAAAAJ&hl=en)
**IMPACT Lab, School of Computing & Augmented Intelligence (SCAI), Arizona State University**
[**Project page**](https://impactlabasu.github.io/EMMA-CVPR2026/) · [**Demo video**](https://youtu.be/Uo79pVlM6Rk)
---
## Overview
EMMA is a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, and assumptions about known initial conditions, EMMA performs joint inference of **explicit parameters**, **implicit dynamical components**, and **calibration invariants** within a unified continuous-time model.
The user supplies the parametric structure of the governing ODE; EMMA solves the inverse problem of recovering its parameters, along with any latent forcing and invariants, from multimodal observations.
## Key contributions
- **Multi-modal dynamical parameter extraction** from video, audio, and time-series reconstructed from visual charts.
- **Recovery under unobserved forcing inputs** by inferring latent actuation (e.g. wheel speed) from audio.
- **Estimation of implicit dynamics** associated with unmeasured physical effects (e.g. frictional drag).
- **Invariant calibration from raw video**, eliminating assumptions about known initial conditions or coordinate frames.
- **Extensive validation** on 100+ scenarios: Delfys benchmark (75 videos), real-world rover and quadrotor, and simulation charts.
## Architecture
<p align="center"><img src="docs/EMMA-arc.png" alt="EMMA architecture" width="780" /></p>
EMMA follows a three-step pipeline: **Sense · Learn · Verify**.
1. **Sense.** Video, audio, and chart images are converted into time-aligned signals through modality-specific pipelines.
2. **Learn.** A Liquid Time-Constant (LTC) network models the system's latent dynamics in continuous time.
3. **Verify.** A differentiable ODE solver simulates the recovered parameters and checks them against the observations under a physics-informed loss.
## Results
EMMA delivers accurate multi-parameter recovery across diverse physical systems. Full tables and ablations are in the [paper](docs/42612.pdf).
| System | Parameters recovered | EMMA error | Best baseline |
|--------|----------------------|------------|---------------|
| Pendulum (90 cm) | Length *L*, damping *τ* | **L = 0.86 ± 0.07 m** (GT 0.90) | Delfys, PySINDy |
| Torricelli (med.) | Drainage *k* | **0.0132 ± 0.0008** (GT 0.0128) | matches Delfys |
| Sliding block (med.) | Angle *α*, friction *μ* | **α = 24.72°, μ = 0.205** (GT 25°, 0.20) | Delfys, PySINDy |
| LED decay (med.) | γ | **0.91 ± 0.0** (GT 0.92) | matches Delfys |
| Rover | 9 params (5 with known ground truth) | **8.8 % ± 1.7 %** mean error | *first work under hidden forcing* |
| Quadrotor | 12 params (7 with known ground truth) | **15.9 % ± 7.4 %** mean error | *first work under hidden forcing* |
| Simulation charts | Lotka-Volterra, Lorenz, F8 Crusader, HIV, AID | **>10× lower error** than PySINDy on implicit dynamics | PySINDy |
Compared against **PAIG**, **NIRPI**, and **Delfys** on the video benchmarks and **PySINDy** on the chart-based simulations.
## Supported systems
| Category | Systems |
|----------|---------|
| Delfys benchmark | Pendulum, Torricelli drainage, Sliding block, LED decay, Free fall |
| Real-world platforms | Differential-drive rover (9 params), 6-DoF quadrotor (12 params) |
| Simulation charts | Lotka-Volterra, Chaotic Lorenz, F8 Crusader, HIV therapy, AID (Type-1 diabetes) |
## Installation
Tested with **Python 3.10+** on macOS and Linux.
```bash
git clone https://github.com/ImpactLabASU/EMMA-CVPR2026.git
cd EMMA-CVPR2026
python3 -m venv .venv && source .venv/bin/activate # optional but recommended
pip install -r requirements.txt
```
**System tools**
- [FFmpeg](https://ffmpeg.org/) on your `PATH` (MoviePy uses it for audio extraction): `brew install ffmpeg` (macOS) or `sudo apt install ffmpeg` (Ubuntu).
- YOLO weights (default `yolo11m.pt`): `pip install ultralytics` then `yolo download model=yolo11m.pt`, or download from the Ultralytics releases page.
- A CUDA GPU is optional; every script falls back to CPU automatically.
## Repository layout
| Folder | Purpose | Entry points |
| --- | --- | --- |
| `Baseline/` | Physics-informed EMMA pipelines (Free Fall, LED, Pendulum, Sliding Block, Torricelli) plus ablation utilities. | `FreeFall/free_fall.py`, `LED/led.py`, `Pendulum/run-*.py`, `Sliding block/sliding_block*.py`, `Torricelli/toricelli*.py`, `architecture_ablation.py`, `run_additional_ablations.py` |
| `Rover/` | Rover perception, parameter estimation, multimodal ablations, helper shell script. | `run.py`, `rover-ablation.py`, `rover_multimodal_ablation.py`, `run_rover_ablation.sh` |
| `Drone/` | Drone pipeline orchestrator (vision + audio + EMMA optimization). | `new_run.py` |
| `CGM/` | Continuous glucose monitor chart digitizer. | `extract_cgm_data.py` |
## Data
- **Baseline datasets** come from the Delfys "Physical Parameter Prediction" set on Kaggle (https://www.kaggle.com/datasets/jaswar/physical-parameter-prediction). Download it and copy the experiment folders into `Baseline/`; the scripts discover the data automatically.
- **Sample rover and drone videos** are available here: **[Dropbox](https://www.dropbox.com/scl/fo/cjiym1h53puvv2ml6o8vn/APkfhTz64DnkYkHt554ZPj0?rlkey=hw3odtpzn6vl2nsfbe4pkekcq&dl=0)**. Place them under `Rover/` and `Drone/`.
## Usage
### Baseline pipelines
Each baseline follows the same recipe:
1. `cd Baseline/<Experiment>/`
2. Edit the configuration block inside `main()`:
- `video_path`: path to the source video; leave empty to reuse existing data files.
- `weights_path`: YOLO weights (`yolo11m.pt` by default).
- `pixel_to_meter` (Free Fall, Torricelli, Sliding Block): set from your calibration grid.
- `output_folder`: a unique run directory (e.g. `run_01`); the script creates `output/` and `data/` under it.
3. Run `python3 <script>.py`.
4. Optional: `python3 <script>.py --simulation-only` skips retraining and reuses the latest `*_coefficients.csv` and `*_emma_final_model.pth` (Free Fall, LED, Pendulum).
| Experiment | Script | Key outputs |
| --- | --- | --- |
| Free Fall | `FreeFall/free_fall.py` (`free_fall-m.py` for the medium set) | trajectory CSV, `free_fall_coefficients.csv`, trained model, annotated video |
| LED decay | `LED/led.py` | trajectory CSV, `led_coefficients.csv`, trained model, intensity figures |
| Pendulum | `Pendulum/run-45.py`, `run-90.py`, `run-150.py` | `thetaData.txt`, `omegaData.txt`, `pendulum_coefficients.csv`, trained model |
| Sliding block | `Sliding block/sliding_block.py` (`-low`, `-med` variants) | trajectory CSVs, `sliding_block_coefficients.csv`, trained model |
| Torricelli | `Torricelli/toricelli.py` (`toricelli-m.py`, `torricelli-sm.py`) | height trajectories, `torricelli_coefficients.csv`, trained model |
**PySINDy baselines.** Each experiment folder has `pysindy_results/pysindy.py`; run it from that folder (after the main pipeline has written the EMMA-formatted CSVs) for sparse-regression baselines.
**Ablations.** From `Baseline/`: `python3 architecture_ablation.py` and `python3 run_additional_ablations.py` (require pendulum datasets under `Baseline/Pendulum-EMMA/<angle>_v*/data/`).
### Rover
```bash
cd Rover
# set video_path and weights_path in run.py (see the CONFIGURATION SECTION)
python3 run.py
```
Outputs: `rover_coefficients.csv`, `rover_EMMA_final_model.pth`, plots, GIF. Ablations: `python3 rover-ablation.py`, `python3 rover_multimodal_ablation.py`, or `bash run_rover_ablation.sh` (edit variables first). If you already have processed `data/*.txt`, set `video_path = ""` to skip detection.
### Drone
```bash
cd Drone
EMMA_RUN_ORCHESTRATOR=1 python3 new_run.py --video /path/to/DroneVideo.mp4 --weights /path/to/yolo11m.pt
```
> **Note:** Full orchestration also needs an external `Dronepipeline/` folder containing `droneExtract.py`, `droneExtractAudio.py`, and `EMMA_drone_torch_ltc_optimized.py`. These are not bundled here; without them, `new_run.py` falls back to the local vision-only pipeline.
### CGM chart digitizer
```bash
cd CGM
python3 extract_cgm_data.py # reads CGMData.png, writes cgm_data.txt + a visualization
```
## Troubleshooting
- **Module not found:** re-run `pip install -r requirements.txt` in the active virtual environment. For `torch`/`torchvision`, use the [PyTorch selector](https://pytorch.org/get-started/locally/).
- **YOLO weights missing:** download `yolo11m.pt` and point `weights_path` to it.
- **FFmpeg errors:** install FFmpeg (`brew install ffmpeg` / `sudo apt install ffmpeg`).
## Citation
```bibtex
@InProceedings{Shaikh_2026_CVPR,
author = {Shaikh, Farhat and Banerjee, Ayan and Gupta, Sandeep},
title = {EMMA: Extracting Multiple physical parameters from Multimodal Data},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {1716-1725}
}
```
Also on [arXiv](https://arxiv.org/abs/2605.24047).
|