energy-optimization-space / EVALUATION_GUIDE.md
Sushruth21's picture
feat: comprehensive LLM inference evaluation and validation
71b3314

LLM Inference Evaluation Guide

Overview

This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.


What Gets Evaluated

When you run the LLM inference, these components are judged:

1. Action Quality

  • What: Whether the LLM produces valid actions in format: action_type,intensity
  • How: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
  • Grade: Action validity rate (% of valid vs invalid actions)
  • Example Good: reduce_ram,0.8 (valid)
  • Example Bad: invalid_action,2.5 (invalid - gets converted to monitor_system,0.5)

2. Reward Accumulation

  • What: Total reward achieved across all steps
  • How: Each action generates reward = intensity × 0.1
    • Plus task completion bonus (difficulty × 0.5)
    • Clamped to 0-1 range
  • Grade Metrics:
    • Total reward (sum of all step rewards)
    • Average reward per step
    • Reward trend (increasing/decreasing)
    • Max and min rewards achieved

3. Resource Optimization

  • What: How much RAM and Energy the LLM actually reduced
  • How: Compare initial vs final state
  • Grade Metrics:
    • RAM reduction: Initial 80% → Final X%
    • Energy reduction: Initial 8.0 kWh → Final X kWh
    • Efficiency: Resources saved ÷ Steps taken

4. Task Completion

  • What: Whether the LLM completed the assigned task
  • How: Environment checks if current state meets task targets within max_steps
  • Example Task: Task 1 (basic_ram_reduction)
    • Target: RAM < 70%, Energy < 7.5 kWh
    • Deadline: 10 steps max
    • Success: If both targets met within 10 steps → Reward bonus + task marked complete

5. Grader Score (0.001 - 0.999)

  • What: Task-specific evaluation score from the grader
  • How: Grader function evaluates final observation against task targets
  • Formula Example (Task 1):
    RAM score = (100 - final_RAM) / (100 - 70)  [0-1]
    Energy score = (10 - final_energy) / (10 - 7.5)  [0-1]
    Efficiency = bonus for steps taken within limit
    Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
    Clamped to [0.001, 0.999]
    

6. Decision-Making Efficiency

  • What: How quickly and thoughtfully the LLM makes good decisions
  • How: Track history of actions and rewards
  • Grade Metrics:
    • Time-to-first-good-action
    • Convergence speed (steps to reach target)
    • Backtracking frequency (bad decision reversals)
    • Adaptive behavior (does agent improve over time?)

Understanding the Tasks & Difficulty

Each task has explicit graders that measure performance:

Task Difficulty RAM Target Energy Target Max Steps Grader Name
basic_ram_reduction 1 < 70% < 7.5 kWh 10 task_1_basic_ram_reduction_grader
energy_optimization 2 < 75% < 6.0 kWh 15 task_2_energy_optimization_grader
balanced_optimization 3 < 60% < 5.0 kWh 20 task_3_balanced_optimization_grader
advanced_efficiency 4 < 50% < 4.0 kWh 25 task_4_advanced_efficiency_grader
expert_optimization 5 < 40% < 3.0 kWh 30 task_5_expert_optimization_grader

How to Run Evaluation

Step 1: Start the Environment Server

cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000

Step 2A: Quick Baseline Test

Run the heuristic agent to establish a baseline:

python evaluate_inference.py

Output you'll see:

BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh

Step 1: optimize_energy, Intensity: 0.65      | Reward: +0.065 | RAM: 80.0% | Energy:  6.7 kWh
Step 2: balance_resources, Intensity: 0.42    | Reward: +0.042 | RAM: 77.9% | Energy:  6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068

Step 2B: Run Full LLM Evaluation

# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"  # or your preferred model

# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py

Output you'll see:

[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'

Step 1: LLM chooses → reduce_ram,0.7
  Reward: +0.070
  RAM: 80% → 73%
  Energy: 8.0 kWh

Step 2: LLM chooses → reduce_ram,0.6
  Reward: +0.060
  RAM: 73% → 67.0%
  Energy: 8.0 kWh

Step 3: LLM chooses → monitor_system,0.5
  Reward: +0.050
  RAM: 67% (already optimized)
  Energy: 8.0 kWh

[Task Completed!]
Task Completion Bonus: +0.5 reward

GRADER EVALUATION:
  Task: basic_ram_reduction
  Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
  Grader Score: 0.782 ✓

EPISODE SUMMARY:
  Total Reward: 0.680
  Average Reward/Step: 0.227
  Task Completed: YES
  Grader Score: 0.782

Evaluating Model Performance

Quality Levels (by Grader Score)

Score Range Rating Interpretation
0.9 - 0.999 ★★★★★ EXCELLENT Model solved task optimally, excellent resource optimization
0.7 - 0.9 ★★★★ GOOD Model completed task efficiently, minor suboptimality
0.5 - 0.7 ★★★ FAIR Model made good progress, some inefficiency
0.3 - 0.5 ★★ POOR Model struggled, significant suboptimality
0.001 - 0.3 ★ VERY POOR Model barely optimized or failed task

Expected Performance Baselines

Based on environment design:

Random Agent:

  • Avg Reward: ~0.05 per step
  • Task Completion: ~10% chance
  • Grader Score: ~0.2-0.3
  • Insight: No structure, just luck

Heuristic Agent (simple if-else rules):

  • Avg Reward: ~0.1 per step
  • Task Completion: ~60% chance (easy tasks only)
  • Grader Score: ~0.5-0.6
  • Insight: Follows simple logic, decent for easy tasks

Competent LLM Agent (Qwen, GPT, etc.):

  • Avg Reward: ~0.12-0.15 per step
  • Task Completion: ~70-80% (easy-medium tasks)
  • Grader Score: ~0.65-0.80
  • Insight: Understands environment, makes reasonable decisions

Expert LLM Agent (with few-shot examples):

  • Avg Reward: ~0.16-0.20 per step
  • Task Completion: ~85-95% (even hard tasks)
  • Grader Score: ~0.80-0.95
  • Insight: Optimized strategy, efficient resource management

Detailed Metrics Collected

The evaluate_inference.py script tracks:

{
    "task_name": "basic_ram_reduction",
    "difficulty": 1,
    "total_steps": 5,
    "total_reward": 0.482,           # Sum of all step rewards
    "avg_reward": 0.0964,             # Average per step
    "reward_range": [0.05, 0.12],    # Min to max reward
    "valid_actions": 5,               # Actions that parsed correctly
    "invalid_actions": 0,             # Actions that failed parsing
    "action_validity_rate": 1.0,      # % of valid actions
    "initial_ram": 80.0,
    "final_ram": 50.0,                # Resource improvement
    "initial_energy": 8.0,
    "final_energy": 4.5,              # Resource improvement
    "task_completed": True,            # Did it hit targets?
    "final_task_progress": 1.0,       # 0.0 = no progress, 1.0 = complete
    "grader_score": 0.782             # Task-specific grader evaluation
}

What Makes a Good LLM Agent

  1. Prompt Understanding: Parses the observation correctly
  2. Action Validity: Produces valid actions in correct format
  3. Resource Awareness: Tracks RAM/Energy trade-offs
  4. Goal Orientation: Works toward task targets, not random
  5. Efficiency: Achieves targets in fewer steps when possible
  6. Adaptability: Adjusts strategy if initial approach fails

How Task Graders Work

Each task has a task_N_..._grader() function that:

  1. Takes the final observation
  2. Calculates how close you are to targets
  3. Considers step efficiency
  4. Returns score in [0.001, 0.999]

Example: Task 1 Grader Logic

def task_1_basic_ram_reduction_grader(observation):
    # RAM reduction from baseline 80% to target < 70%
    ram_score = (100 - observation.ram_usage) / (100 - 70)  # 0=bad, 1=good
    
    # Energy reduction from baseline 8.0 to target < 7.5
    energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
    
    # Step efficiency (bonus for using few steps)
    if observation.steps_taken <= 10:
        efficiency = 1.0
    else:
        efficiency = max(0, 1.0 - (steps - 10) * 0.08)
    
    # Combine with weights
    composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
    
    # Clamp to valid range and return
    return max(0.001, min(0.999, composite))

Commands Reference

# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"

# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py

# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"

# Test environment directly
python test_env_direct.py

# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks

Interpreting Results

Good Sign:

  • ✅ Grader score > 0.6
  • ✅ Action validity rate = 100%
  • ✅ Task completed = True
  • ✅ Avg reward increasing over steps
  • ✅ Resource reduction matches task targets

Bad Sign:

  • ❌ Grader score < 0.3
  • ❌ Many invalid actions
  • ❌ Task not completed
  • ❌ Random action patterns
  • ❌ Resources not improving

Summary

The LLM is evaluated on:

  1. Can it parse observations? (Action validity)
  2. Can it make good decisions? (Reward accumulation)
  3. Can it complete tasks? (Task targets met)
  4. How efficiently? (Steps taken, resource optimization)
  5. What's the final score? (Grader evaluation)

An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.