Imitation Learning Performance Evaluation Package
A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration β test loop (keyboard/mouse) β per-trial record table β final evaluation summary.
Imitation Leaning Test
Above: Imitation learning example β peg-in-hole task on the NIST task board. This package lets you evaluate such imitation-learning results with the 7 metrics (SR, DE, TL, AD, T_avg, TR, SaR).
Evaluation Program
GUI during operation β Main window and Test Control window while running trials.
Evaluation result (7 metrics) β Final evaluation result window with SR, DE, TL, AD, T_avg, TR, SaR.
7 Evaluation Metrics
| Metric | Description |
|---|---|
| SR (Task Success Rate) | Fraction of trials that succeeded (%) |
| DE (Data Efficiency) | Efficiency based on success rate and number of demonstration data |
| TL (Learning Efficiency) | Time (minutes) per successful task when using a learning-time log (txt); TL = learning time (min) / (SR/100); N/A if target success rate not met |
| AD (Adaptability) | Difference in success rate between baseline and Out-of-Distribution environments |
| T_avg (Task Time) | Average task duration per trial (seconds) |
| TR (Real-time Performance) | Average inference time collected during trials (ms) |
| SaR (Operational Safety) | Fraction of trials with no safety incident (%) |
Quick Start
# From project root (metrics/)
python run_evaluation.py
Or run as a module:
python -m metrics_eval.runner
When the GUI opens, enter the evaluation settings and click Start Test.
For installation and CJK font setup, see INSTALL_ENG.md.
GUI Overview
1. Main Window
- Top: Evaluation flow diagram image (
picture2.png; scale is controlled byIMG_DISPLAY_SCALEinrunner.py) - Evaluation Settings panel
- Total test trials, Adaptability trials (Out of Distribution)
- Number of demo data (for learning), Target task success rate (%)
- Out of Distribution range (%)
- Learning time log (txt, minutes): path to a txt file containing a single number (training time in minutes) for TL
- Start Test button
- Status message and safety-incident indicator (when F key is pressed)
2. Test Control Window
Opened when you start a test. It provides mouse buttons that mirror the keyboard.
- [S] Start Task β Record trial start time, start inference-time collection
- [E] End Task β Record trial end time, stop inference-time collection
- [Y] Success / [N] Fail β Mark the trial as success or failure
- [F] Safety Incident β Mark safety incident for this trial (red notice on main window)
Button style matches the Start Test button on the main window.
3. Per-Trial Results Window (Record Table)
A table that adds one row per trial as the test runs.
| Trial | Adapt. | Success | Task time (s) | Safety issue | Avg real-time (ms) |
|---|
- Adapt.:
βfor baseline trials,OODfor Out-of-Distribution trials - Safety issue:
No/Yes(incident) - For many trials, the table grows vertically and can be scrolled.
- Window width is aligned with the Test Control window.
4. Evaluation Result Window
When all trials are done, an Evaluation Results toplevel opens with a summary of the 7 metrics.
βββ Evaluation Results βββ
Task success rate (SR): ...
Data efficiency (DE): ...
Learning efficiency (TL): ...
Adaptability (AD): ...
Task time average: ...
Real-time performance (TR, avg inference time): ... ms
Task safety rate (SaR): ...
Input: Keyboard / Mouse
During a test you can use either the keyboard or the Test Control window buttons.
| Action | Key | Mouse |
|---|---|---|
| Start task | S | [S] Start Task |
| End task | E | [E] End Task |
| Success | Y | [Y] Success |
| Failure | N | [N] Fail |
| Safety incident | F | [F] Safety Incident |
Typical flow per trial: S β (run task) β E β Y or N. Press F during the task if a safety incident occurs.
User Code Integration
Learning Efficiency (TL): Learning Time Log (txt)
TL is computed as learning time (minutes) / (task success rate as decimal) = minutes per success, when the current SR meets or exceeds the target success rate.
In the GUI, set Learning time log (txt, minutes) to the path of a txt file that contains a single number (the training time in minutes). Your training script should write this file (e.g. one line: 120.5 for 120.5 minutes). If the file is missing or SR is below target, TL is shown as N/A.
Real-time Performance (TR): Inference Time Collection
If you call record_inference_time after each inference in your policy loop, that trialβs inference times are collected and their average is used for TR and for the Avg real-time (ms) column in the per-trial table.
import time
from metrics_eval.inference_recorder import record_inference_time
# Per trial: call between [S] and [E]
for step in range(...):
t0 = time.perf_counter()
action = policy(obs)
record_inference_time(time.perf_counter() - t0)
The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.
Project Structure
metrics/
βββ run_evaluation.py # GUI entry point
βββ README.md # This file (Hugging Face)
βββ README_ENG.md # English (with formulas)
βββ README_KOR.md # Korean (with formulas)
βββ INSTALL_ENG.md # Installation guide (English)
βββ INSTALL_KOR.md # Installation guide (Korean)
βββ pictures/
β βββ video1.mp4 # Video preview
β βββ program.png # GUI during operation
β βββ result.png # Evaluation result screenshot
β βββ picture2.png # Flow diagram
βββ metrics_eval/
β βββ __init__.py # Exports 7 metric functions
β βββ task_success_rate.py # SR
β βββ data_efficiency.py # DE
β βββ learning_efficiency.py # TL
β βββ adaptability.py # AD
β βββ task_time.py # T_avg
β βββ realtime_performance.py # TR
β βββ operational_safety.py # SaR
β βββ inference_recorder.py # record_inference_time (TR)
β βββ runner.py # GUI, test loop, result display
βββ docs/
βββ design_metrics_evaluation.md
To use only the metric functions, import from metrics_eval:
from metrics_eval import (
compute_sr,
compute_de,
compute_tl_display,
compute_ad,
compute_task_time_avg,
compute_tr,
compute_sar,
)
Summary
- Run:
python run_evaluation.py - Input: Evaluation settings β Start Test β per trial: S β E β Y/N (and F if needed)
- Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
- TL: Learning time log (txt with one number, minutes) β TL = time / (SR/100) when SR β₯ target
- TR:
record_inference_time(between S and E) β TR and per-trial average real-time
For installation and environment setup, see INSTALL_ENG.md.
For detailed explanations of the 7 quantitative metrics (formulas and symbol definitions), see README_ENG.md.

