Imitation Learning Performance Evaluation Package

A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration β†’ test loop (keyboard/mouse) β†’ per-trial record table β†’ final evaluation summary.


Imitation Leaning Test

Above: Imitation learning example β€” peg-in-hole task on the NIST task board. This package lets you evaluate such imitation-learning results with the 7 metrics (SR, DE, TL, AD, T_avg, TR, SaR).


Evaluation Program

GUI during operation β€” Main window and Test Control window while running trials.

Program GUI

Evaluation result (7 metrics) β€” Final evaluation result window with SR, DE, TL, AD, T_avg, TR, SaR.

Result


7 Evaluation Metrics

Metric Description
SR (Task Success Rate) Fraction of trials that succeeded (%)
DE (Data Efficiency) Efficiency based on success rate and number of demonstration data
TL (Learning Efficiency) Time (minutes) per successful task when using a learning-time log (txt); TL = learning time (min) / (SR/100); N/A if target success rate not met
AD (Adaptability) Difference in success rate between baseline and Out-of-Distribution environments
T_avg (Task Time) Average task duration per trial (seconds)
TR (Real-time Performance) Average inference time collected during trials (ms)
SaR (Operational Safety) Fraction of trials with no safety incident (%)

Quick Start

# From project root (metrics/)
python run_evaluation.py

Or run as a module:

python -m metrics_eval.runner

When the GUI opens, enter the evaluation settings and click Start Test.
For installation and CJK font setup, see INSTALL_ENG.md.


GUI Overview

1. Main Window

  • Top: Evaluation flow diagram image (picture2.png; scale is controlled by IMG_DISPLAY_SCALE in runner.py)
  • Evaluation Settings panel
    • Total test trials, Adaptability trials (Out of Distribution)
    • Number of demo data (for learning), Target task success rate (%)
    • Out of Distribution range (%)
    • Learning time log (txt, minutes): path to a txt file containing a single number (training time in minutes) for TL
  • Start Test button
  • Status message and safety-incident indicator (when F key is pressed)

2. Test Control Window

Opened when you start a test. It provides mouse buttons that mirror the keyboard.

  • [S] Start Task β€” Record trial start time, start inference-time collection
  • [E] End Task β€” Record trial end time, stop inference-time collection
  • [Y] Success / [N] Fail β€” Mark the trial as success or failure
  • [F] Safety Incident β€” Mark safety incident for this trial (red notice on main window)

Button style matches the Start Test button on the main window.

3. Per-Trial Results Window (Record Table)

A table that adds one row per trial as the test runs.

Trial Adapt. Success Task time (s) Safety issue Avg real-time (ms)
  • Adapt.: β€” for baseline trials, OOD for Out-of-Distribution trials
  • Safety issue: No / Yes (incident)
  • For many trials, the table grows vertically and can be scrolled.
  • Window width is aligned with the Test Control window.

4. Evaluation Result Window

When all trials are done, an Evaluation Results toplevel opens with a summary of the 7 metrics.

β€”β€”β€” Evaluation Results β€”β€”β€”
Task success rate (SR): ...
Data efficiency (DE): ...
Learning efficiency (TL): ...
Adaptability (AD): ...
Task time average: ...
Real-time performance (TR, avg inference time): ... ms
Task safety rate (SaR): ...

Input: Keyboard / Mouse

During a test you can use either the keyboard or the Test Control window buttons.

Action Key Mouse
Start task S [S] Start Task
End task E [E] End Task
Success Y [Y] Success
Failure N [N] Fail
Safety incident F [F] Safety Incident

Typical flow per trial: S β†’ (run task) β†’ E β†’ Y or N. Press F during the task if a safety incident occurs.


User Code Integration

Learning Efficiency (TL): Learning Time Log (txt)

TL is computed as learning time (minutes) / (task success rate as decimal) = minutes per success, when the current SR meets or exceeds the target success rate.
In the GUI, set Learning time log (txt, minutes) to the path of a txt file that contains a single number (the training time in minutes). Your training script should write this file (e.g. one line: 120.5 for 120.5 minutes). If the file is missing or SR is below target, TL is shown as N/A.

Real-time Performance (TR): Inference Time Collection

If you call record_inference_time after each inference in your policy loop, that trial’s inference times are collected and their average is used for TR and for the Avg real-time (ms) column in the per-trial table.

import time
from metrics_eval.inference_recorder import record_inference_time

# Per trial: call between [S] and [E]
for step in range(...):
    t0 = time.perf_counter()
    action = policy(obs)
    record_inference_time(time.perf_counter() - t0)

The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.


Project Structure

metrics/
β”œβ”€β”€ run_evaluation.py             # GUI entry point
β”œβ”€β”€ README.md                     # This file (Hugging Face)
β”œβ”€β”€ README_ENG.md                 # English (with formulas)
β”œβ”€β”€ README_KOR.md                 # Korean (with formulas)
β”œβ”€β”€ INSTALL_ENG.md                # Installation guide (English)
β”œβ”€β”€ INSTALL_KOR.md                # Installation guide (Korean)
β”œβ”€β”€ pictures/
β”‚   β”œβ”€β”€ video1.mp4                # Video preview
β”‚   β”œβ”€β”€ program.png               # GUI during operation
β”‚   β”œβ”€β”€ result.png                # Evaluation result screenshot
β”‚   └── picture2.png              # Flow diagram
β”œβ”€β”€ metrics_eval/
β”‚   β”œβ”€β”€ __init__.py               # Exports 7 metric functions
β”‚   β”œβ”€β”€ task_success_rate.py      # SR
β”‚   β”œβ”€β”€ data_efficiency.py        # DE
β”‚   β”œβ”€β”€ learning_efficiency.py   # TL
β”‚   β”œβ”€β”€ adaptability.py          # AD
β”‚   β”œβ”€β”€ task_time.py              # T_avg
β”‚   β”œβ”€β”€ realtime_performance.py  # TR
β”‚   β”œβ”€β”€ operational_safety.py    # SaR
β”‚   β”œβ”€β”€ inference_recorder.py    # record_inference_time (TR)
β”‚   └── runner.py                 # GUI, test loop, result display
└── docs/
    └── design_metrics_evaluation.md

To use only the metric functions, import from metrics_eval:

from metrics_eval import (
    compute_sr,
    compute_de,
    compute_tl_display,
    compute_ad,
    compute_task_time_avg,
    compute_tr,
    compute_sar,
)

Summary

  • Run: python run_evaluation.py
  • Input: Evaluation settings β†’ Start Test β†’ per trial: S β†’ E β†’ Y/N (and F if needed)
  • Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
  • TL: Learning time log (txt with one number, minutes) β†’ TL = time / (SR/100) when SR β‰₯ target
  • TR: record_inference_time (between S and E) β†’ TR and per-trial average real-time

For installation and environment setup, see INSTALL_ENG.md.

For detailed explanations of the 7 quantitative metrics (formulas and symbol definitions), see README_ENG.md.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading