Spaces:

Sushruth21
/

energy-optimization-space

Sleeping

App Files Files Community

energy-optimization-space / EVALUATION_GUIDE.md

Sushruth21

feat: comprehensive LLM inference evaluation and validation

71b3314 4 days ago

preview code

raw

history blame contribute delete

10.2 kB

LLM Inference Evaluation Guide

Overview

This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.

What Gets Evaluated

When you run the LLM inference, these components are judged:

1. Action Quality

What: Whether the LLM produces valid actions in format: action_type,intensity
How: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
Grade: Action validity rate (% of valid vs invalid actions)
Example Good: reduce_ram,0.8 (valid)
Example Bad: invalid_action,2.5 (invalid - gets converted to monitor_system,0.5)

2. Reward Accumulation

What: Total reward achieved across all steps
How: Each action generates reward = intensity × 0.1
- Plus task completion bonus (difficulty × 0.5)
- Clamped to 0-1 range
Grade Metrics:
- Total reward (sum of all step rewards)
- Average reward per step
- Reward trend (increasing/decreasing)
- Max and min rewards achieved

3. Resource Optimization

What: How much RAM and Energy the LLM actually reduced
How: Compare initial vs final state
Grade Metrics:
- RAM reduction: Initial 80% → Final X%
- Energy reduction: Initial 8.0 kWh → Final X kWh
- Efficiency: Resources saved ÷ Steps taken

4. Task Completion

What: Whether the LLM completed the assigned task
How: Environment checks if current state meets task targets within max_steps
Example Task: Task 1 (basic_ram_reduction)
- Target: RAM < 70%, Energy < 7.5 kWh
- Deadline: 10 steps max
- Success: If both targets met within 10 steps → Reward bonus + task marked complete

5. Grader Score (0.001 - 0.999)

What: Task-specific evaluation score from the grader
How: Grader function evaluates final observation against task targets

Formula Example (Task 1):

RAM score = (100 - final_RAM) / (100 - 70)  [0-1]
Energy score = (10 - final_energy) / (10 - 7.5)  [0-1]
Efficiency = bonus for steps taken within limit
Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
Clamped to [0.001, 0.999]

6. Decision-Making Efficiency

What: How quickly and thoughtfully the LLM makes good decisions
How: Track history of actions and rewards
Grade Metrics:
- Time-to-first-good-action
- Convergence speed (steps to reach target)
- Backtracking frequency (bad decision reversals)
- Adaptive behavior (does agent improve over time?)

Understanding the Tasks & Difficulty

Each task has explicit graders that measure performance:

Task	Difficulty	RAM Target	Energy Target	Max Steps	Grader Name
basic_ram_reduction	1	< 70%	< 7.5 kWh	10	task_1_basic_ram_reduction_grader
energy_optimization	2	< 75%	< 6.0 kWh	15	task_2_energy_optimization_grader
balanced_optimization	3	< 60%	< 5.0 kWh	20	task_3_balanced_optimization_grader
advanced_efficiency	4	< 50%	< 4.0 kWh	25	task_4_advanced_efficiency_grader
expert_optimization	5	< 40%	< 3.0 kWh	30	task_5_expert_optimization_grader

How to Run Evaluation

Step 1: Start the Environment Server

cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000

Step 2A: Quick Baseline Test

Run the heuristic agent to establish a baseline:

python evaluate_inference.py

Output you'll see:

BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh

Step 1: optimize_energy, Intensity: 0.65      | Reward: +0.065 | RAM: 80.0% | Energy:  6.7 kWh
Step 2: balance_resources, Intensity: 0.42    | Reward: +0.042 | RAM: 77.9% | Energy:  6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068

Step 2B: Run Full LLM Evaluation

# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"  # or your preferred model

# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py

Output you'll see:

[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'

Step 1: LLM chooses → reduce_ram,0.7
  Reward: +0.070
  RAM: 80% → 73%
  Energy: 8.0 kWh

Step 2: LLM chooses → reduce_ram,0.6
  Reward: +0.060
  RAM: 73% → 67.0%
  Energy: 8.0 kWh

Step 3: LLM chooses → monitor_system,0.5
  Reward: +0.050
  RAM: 67% (already optimized)
  Energy: 8.0 kWh

[Task Completed!]
Task Completion Bonus: +0.5 reward

GRADER EVALUATION:
  Task: basic_ram_reduction
  Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
  Grader Score: 0.782 ✓

EPISODE SUMMARY:
  Total Reward: 0.680
  Average Reward/Step: 0.227
  Task Completed: YES
  Grader Score: 0.782

Evaluating Model Performance

Quality Levels (by Grader Score)

Score Range	Rating	Interpretation
0.9 - 0.999	★★★★★ EXCELLENT	Model solved task optimally, excellent resource optimization
0.7 - 0.9	★★★★ GOOD	Model completed task efficiently, minor suboptimality
0.5 - 0.7	★★★ FAIR	Model made good progress, some inefficiency
0.3 - 0.5	★★ POOR	Model struggled, significant suboptimality
0.001 - 0.3	★ VERY POOR	Model barely optimized or failed task

Expected Performance Baselines

Based on environment design:

Random Agent:

Avg Reward: ~0.05 per step
Task Completion: ~10% chance
Grader Score: ~0.2-0.3
Insight: No structure, just luck

Heuristic Agent (simple if-else rules):

Avg Reward: ~0.1 per step
Task Completion: ~60% chance (easy tasks only)
Grader Score: ~0.5-0.6
Insight: Follows simple logic, decent for easy tasks

Competent LLM Agent (Qwen, GPT, etc.):

Avg Reward: ~0.12-0.15 per step
Task Completion: ~70-80% (easy-medium tasks)
Grader Score: ~0.65-0.80
Insight: Understands environment, makes reasonable decisions

Expert LLM Agent (with few-shot examples):

Avg Reward: ~0.16-0.20 per step
Task Completion: ~85-95% (even hard tasks)
Grader Score: ~0.80-0.95
Insight: Optimized strategy, efficient resource management

Detailed Metrics Collected

The evaluate_inference.py script tracks:

{
    "task_name": "basic_ram_reduction",
    "difficulty": 1,
    "total_steps": 5,
    "total_reward": 0.482,           # Sum of all step rewards
    "avg_reward": 0.0964,             # Average per step
    "reward_range": [0.05, 0.12],    # Min to max reward
    "valid_actions": 5,               # Actions that parsed correctly
    "invalid_actions": 0,             # Actions that failed parsing
    "action_validity_rate": 1.0,      # % of valid actions
    "initial_ram": 80.0,
    "final_ram": 50.0,                # Resource improvement
    "initial_energy": 8.0,
    "final_energy": 4.5,              # Resource improvement
    "task_completed": True,            # Did it hit targets?
    "final_task_progress": 1.0,       # 0.0 = no progress, 1.0 = complete
    "grader_score": 0.782             # Task-specific grader evaluation
}

What Makes a Good LLM Agent

Prompt Understanding: Parses the observation correctly
Action Validity: Produces valid actions in correct format
Resource Awareness: Tracks RAM/Energy trade-offs
Goal Orientation: Works toward task targets, not random
Efficiency: Achieves targets in fewer steps when possible
Adaptability: Adjusts strategy if initial approach fails

How Task Graders Work

Each task has a task_N_..._grader() function that:

Takes the final observation
Calculates how close you are to targets
Considers step efficiency
Returns score in [0.001, 0.999]

Example: Task 1 Grader Logic

def task_1_basic_ram_reduction_grader(observation):
    # RAM reduction from baseline 80% to target < 70%
    ram_score = (100 - observation.ram_usage) / (100 - 70)  # 0=bad, 1=good
    
    # Energy reduction from baseline 8.0 to target < 7.5
    energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
    
    # Step efficiency (bonus for using few steps)
    if observation.steps_taken <= 10:
        efficiency = 1.0
    else:
        efficiency = max(0, 1.0 - (steps - 10) * 0.08)
    
    # Combine with weights
    composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
    
    # Clamp to valid range and return
    return max(0.001, min(0.999, composite))

Commands Reference

# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"

# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py

# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"

# Test environment directly
python test_env_direct.py

# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks

Interpreting Results

Good Sign:

✅ Grader score > 0.6
✅ Action validity rate = 100%
✅ Task completed = True
✅ Avg reward increasing over steps
✅ Resource reduction matches task targets

Bad Sign:

❌ Grader score < 0.3
❌ Many invalid actions
❌ Task not completed
❌ Random action patterns
❌ Resources not improving

Summary

The LLM is evaluated on:

Can it parse observations? (Action validity)
Can it make good decisions? (Reward accumulation)
Can it complete tasks? (Task targets met)
How efficiently? (Steps taken, resource optimization)
What's the final score? (Grader evaluation)

An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.