Imitation Learning Performance Evaluation Package

A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration → test loop (keyboard/mouse) → per-trial record table → final evaluation summary.

Imitation Leaning Test

Above: Imitation learning example — peg-in-hole task on the NIST task board. This package lets you evaluate such imitation-learning results with the 7 metrics (SR, DE, TL, AD, T_avg, TR, SaR).

Evaluation Program

GUI during operation — Main window and Test Control window while running trials.

Evaluation result (7 metrics) — Final evaluation result window with SR, DE, TL, AD, T_avg, TR, SaR.

7 Evaluation Metrics

Metric	Description
SR (Task Success Rate)	Fraction of trials that succeeded (%)
DE (Data Efficiency)	Efficiency based on success rate and number of demonstration data
TL (Learning Efficiency)	Time (minutes) per successful task when using a learning-time log (txt); TL = learning time (min) / (SR/100); N/A if target success rate not met
AD (Adaptability)	Difference in success rate between baseline and Out-of-Distribution environments
T_avg (Task Time)	Average task duration per trial (seconds)
TR (Real-time Performance)	Average inference time collected during trials (ms)
SaR (Operational Safety)	Fraction of trials with no safety incident (%)

Quick Start

# From project root (metrics/)
python run_evaluation.py

Or run as a module:

python -m metrics_eval.runner

When the GUI opens, enter the evaluation settings and click Start Test.
For installation and CJK font setup, see INSTALL_ENG.md.

GUI Overview

1. Main Window

Top: Evaluation flow diagram image (picture2.png; scale is controlled by IMG_DISPLAY_SCALE in runner.py)
Evaluation Settings panel
- Total test trials, Adaptability trials (Out of Distribution)
- Number of demo data (for learning), Target task success rate (%)
- Out of Distribution range (%)
- Learning time log (txt, minutes): path to a txt file containing a single number (training time in minutes) for TL
Start Test button
Status message and safety-incident indicator (when F key is pressed)

2. Test Control Window

Opened when you start a test. It provides mouse buttons that mirror the keyboard.

[S] Start Task — Record trial start time, start inference-time collection
[E] End Task — Record trial end time, stop inference-time collection
[Y] Success / [N] Fail — Mark the trial as success or failure
[F] Safety Incident — Mark safety incident for this trial (red notice on main window)

Button style matches the Start Test button on the main window.

3. Per-Trial Results Window (Record Table)

A table that adds one row per trial as the test runs.

Trial	Adapt.	Success	Task time (s)	Safety issue	Avg real-time (ms)

Adapt.: — for baseline trials, OOD for Out-of-Distribution trials
Safety issue: No / Yes (incident)
For many trials, the table grows vertically and can be scrolled.
Window width is aligned with the Test Control window.

4. Evaluation Result Window

When all trials are done, an Evaluation Results toplevel opens with a summary of the 7 metrics.

——— Evaluation Results ———
Task success rate (SR): ...
Data efficiency (DE): ...
Learning efficiency (TL): ...
Adaptability (AD): ...
Task time average: ...
Real-time performance (TR, avg inference time): ... ms
Task safety rate (SaR): ...

Input: Keyboard / Mouse

During a test you can use either the keyboard or the Test Control window buttons.

Action	Key	Mouse
Start task	S	[S] Start Task
End task	E	[E] End Task
Success	Y	[Y] Success
Failure	N	[N] Fail
Safety incident	F	[F] Safety Incident

Typical flow per trial: S → (run task) → E → Y or N. Press F during the task if a safety incident occurs.

User Code Integration

Learning Efficiency (TL): Learning Time Log (txt)

TL is computed as learning time (minutes) / (task success rate as decimal) = minutes per success, when the current SR meets or exceeds the target success rate.
In the GUI, set Learning time log (txt, minutes) to the path of a txt file that contains a single number (the training time in minutes). Your training script should write this file (e.g. one line: 120.5 for 120.5 minutes). If the file is missing or SR is below target, TL is shown as N/A.

Real-time Performance (TR): Inference Time Collection

If you call record_inference_time after each inference in your policy loop, that trial’s inference times are collected and their average is used for TR and for the Avg real-time (ms) column in the per-trial table.

import time
from metrics_eval.inference_recorder import record_inference_time

# Per trial: call between [S] and [E]
for step in range(...):
    t0 = time.perf_counter()
    action = policy(obs)
    record_inference_time(time.perf_counter() - t0)

The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.

Project Structure

metrics/
├── run_evaluation.py             # GUI entry point
├── README.md                     # This file (Hugging Face)
├── README_ENG.md                 # English (with formulas)
├── README_KOR.md                 # Korean (with formulas)
├── INSTALL_ENG.md                # Installation guide (English)
├── INSTALL_KOR.md                # Installation guide (Korean)
├── pictures/
│   ├── video1.mp4                # Video preview
│   ├── program.png               # GUI during operation
│   ├── result.png                # Evaluation result screenshot
│   └── picture2.png              # Flow diagram
├── metrics_eval/
│   ├── __init__.py               # Exports 7 metric functions
│   ├── task_success_rate.py      # SR
│   ├── data_efficiency.py        # DE
│   ├── learning_efficiency.py   # TL
│   ├── adaptability.py          # AD
│   ├── task_time.py              # T_avg
│   ├── realtime_performance.py  # TR
│   ├── operational_safety.py    # SaR
│   ├── inference_recorder.py    # record_inference_time (TR)
│   └── runner.py                 # GUI, test loop, result display
└── docs/
    └── design_metrics_evaluation.md

To use only the metric functions, import from metrics_eval:

from metrics_eval import (
    compute_sr,
    compute_de,
    compute_tl_display,
    compute_ad,
    compute_task_time_avg,
    compute_tr,
    compute_sar,
)

Summary

Run: python run_evaluation.py
Input: Evaluation settings → Start Test → per trial: S → E → Y/N (and F if needed)
Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
TL: Learning time log (txt with one number, minutes) → TL = time / (SR/100) when SR ≥ target
TR: record_inference_time (between S and E) → TR and per-trial average real-time

For installation and environment setup, see INSTALL_ENG.md.

For detailed explanations of the 7 quantitative metrics (formulas and symbol definitions), see README_ENG.md.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics