energy-optimization-space / EVALUATION_GUIDE.md
Sushruth21's picture
feat: comprehensive LLM inference evaluation and validation
71b3314
# LLM Inference Evaluation Guide
## Overview
This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.
---
## What Gets Evaluated
When you run the LLM inference, these components are judged:
### 1. **Action Quality**
- **What**: Whether the LLM produces valid actions in format: `action_type,intensity`
- **How**: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
- **Grade**: Action validity rate (% of valid vs invalid actions)
- **Example Good**: `reduce_ram,0.8` (valid)
- **Example Bad**: `invalid_action,2.5` (invalid - gets converted to `monitor_system,0.5`)
### 2. **Reward Accumulation**
- **What**: Total reward achieved across all steps
- **How**: Each action generates reward = intensity × 0.1
- Plus task completion bonus (difficulty × 0.5)
- Clamped to 0-1 range
- **Grade Metrics**:
- Total reward (sum of all step rewards)
- Average reward per step
- Reward trend (increasing/decreasing)
- Max and min rewards achieved
### 3. **Resource Optimization**
- **What**: How much RAM and Energy the LLM actually reduced
- **How**: Compare initial vs final state
- **Grade Metrics**:
- RAM reduction: Initial 80% → Final X%
- Energy reduction: Initial 8.0 kWh → Final X kWh
- Efficiency: Resources saved ÷ Steps taken
### 4. **Task Completion**
- **What**: Whether the LLM completed the assigned task
- **How**: Environment checks if current state meets task targets within max_steps
- **Example Task**: Task 1 (basic_ram_reduction)
- Target: RAM < 70%, Energy < 7.5 kWh
- Deadline: 10 steps max
- Success: If both targets met within 10 steps → Reward bonus + task marked complete
### 5. **Grader Score (0.001 - 0.999)**
- **What**: Task-specific evaluation score from the grader
- **How**: Grader function evaluates final observation against task targets
- **Formula Example** (Task 1):
```
RAM score = (100 - final_RAM) / (100 - 70) [0-1]
Energy score = (10 - final_energy) / (10 - 7.5) [0-1]
Efficiency = bonus for steps taken within limit
Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
Clamped to [0.001, 0.999]
```
### 6. **Decision-Making Efficiency**
- **What**: How quickly and thoughtfully the LLM makes good decisions
- **How**: Track history of actions and rewards
- **Grade Metrics**:
- Time-to-first-good-action
- Convergence speed (steps to reach target)
- Backtracking frequency (bad decision reversals)
- Adaptive behavior (does agent improve over time?)
---
## Understanding the Tasks & Difficulty
Each task has explicit graders that measure performance:
| Task | Difficulty | RAM Target | Energy Target | Max Steps | Grader Name |
|------|-----------|-----------|---------------|-----------|------------|
| basic_ram_reduction | 1 | < 70% | < 7.5 kWh | 10 | task_1_basic_ram_reduction_grader |
| energy_optimization | 2 | < 75% | < 6.0 kWh | 15 | task_2_energy_optimization_grader |
| balanced_optimization | 3 | < 60% | < 5.0 kWh | 20 | task_3_balanced_optimization_grader |
| advanced_efficiency | 4 | < 50% | < 4.0 kWh | 25 | task_4_advanced_efficiency_grader |
| expert_optimization | 5 | < 40% | < 3.0 kWh | 30 | task_5_expert_optimization_grader |
---
## How to Run Evaluation
### Step 1: Start the Environment Server
```bash
cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000
```
### Step 2A: Quick Baseline Test
Run the heuristic agent to establish a baseline:
```bash
python evaluate_inference.py
```
**Output you'll see:**
```
BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh
Step 1: optimize_energy, Intensity: 0.65 | Reward: +0.065 | RAM: 80.0% | Energy: 6.7 kWh
Step 2: balance_resources, Intensity: 0.42 | Reward: +0.042 | RAM: 77.9% | Energy: 6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068
```
### Step 2B: Run Full LLM Evaluation
```bash
# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct" # or your preferred model
# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py
```
**Output you'll see:**
```
[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'
Step 1: LLM chooses → reduce_ram,0.7
Reward: +0.070
RAM: 80% → 73%
Energy: 8.0 kWh
Step 2: LLM chooses → reduce_ram,0.6
Reward: +0.060
RAM: 73% → 67.0%
Energy: 8.0 kWh
Step 3: LLM chooses → monitor_system,0.5
Reward: +0.050
RAM: 67% (already optimized)
Energy: 8.0 kWh
[Task Completed!]
Task Completion Bonus: +0.5 reward
GRADER EVALUATION:
Task: basic_ram_reduction
Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
Grader Score: 0.782 ✓
EPISODE SUMMARY:
Total Reward: 0.680
Average Reward/Step: 0.227
Task Completed: YES
Grader Score: 0.782
```
---
## Evaluating Model Performance
### Quality Levels (by Grader Score)
| Score Range | Rating | Interpretation |
|------------|--------|----------------|
| 0.9 - 0.999 | ★★★★★ EXCELLENT | Model solved task optimally, excellent resource optimization |
| 0.7 - 0.9 | ★★★★ GOOD | Model completed task efficiently, minor suboptimality |
| 0.5 - 0.7 | ★★★ FAIR | Model made good progress, some inefficiency |
| 0.3 - 0.5 | ★★ POOR | Model struggled, significant suboptimality |
| 0.001 - 0.3 | ★ VERY POOR | Model barely optimized or failed task |
### Expected Performance Baselines
Based on environment design:
**Random Agent:**
- Avg Reward: ~0.05 per step
- Task Completion: ~10% chance
- Grader Score: ~0.2-0.3
- Insight: No structure, just luck
**Heuristic Agent** (simple if-else rules):
- Avg Reward: ~0.1 per step
- Task Completion: ~60% chance (easy tasks only)
- Grader Score: ~0.5-0.6
- Insight: Follows simple logic, decent for easy tasks
**Competent LLM Agent** (Qwen, GPT, etc.):
- Avg Reward: ~0.12-0.15 per step
- Task Completion: ~70-80% (easy-medium tasks)
- Grader Score: ~0.65-0.80
- Insight: Understands environment, makes reasonable decisions
**Expert LLM Agent** (with few-shot examples):
- Avg Reward: ~0.16-0.20 per step
- Task Completion: ~85-95% (even hard tasks)
- Grader Score: ~0.80-0.95
- Insight: Optimized strategy, efficient resource management
---
## Detailed Metrics Collected
The `evaluate_inference.py` script tracks:
```python
{
"task_name": "basic_ram_reduction",
"difficulty": 1,
"total_steps": 5,
"total_reward": 0.482, # Sum of all step rewards
"avg_reward": 0.0964, # Average per step
"reward_range": [0.05, 0.12], # Min to max reward
"valid_actions": 5, # Actions that parsed correctly
"invalid_actions": 0, # Actions that failed parsing
"action_validity_rate": 1.0, # % of valid actions
"initial_ram": 80.0,
"final_ram": 50.0, # Resource improvement
"initial_energy": 8.0,
"final_energy": 4.5, # Resource improvement
"task_completed": True, # Did it hit targets?
"final_task_progress": 1.0, # 0.0 = no progress, 1.0 = complete
"grader_score": 0.782 # Task-specific grader evaluation
}
```
---
## What Makes a Good LLM Agent
1. **Prompt Understanding**: Parses the observation correctly
2. **Action Validity**: Produces valid actions in correct format
3. **Resource Awareness**: Tracks RAM/Energy trade-offs
4. **Goal Orientation**: Works toward task targets, not random
5. **Efficiency**: Achieves targets in fewer steps when possible
6. **Adaptability**: Adjusts strategy if initial approach fails
---
## How Task Graders Work
Each task has a **task_N_..._grader()** function that:
1. Takes the final observation
2. Calculates how close you are to targets
3. Considers step efficiency
4. Returns score in [0.001, 0.999]
**Example: Task 1 Grader Logic**
```python
def task_1_basic_ram_reduction_grader(observation):
# RAM reduction from baseline 80% to target < 70%
ram_score = (100 - observation.ram_usage) / (100 - 70) # 0=bad, 1=good
# Energy reduction from baseline 8.0 to target < 7.5
energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
# Step efficiency (bonus for using few steps)
if observation.steps_taken <= 10:
efficiency = 1.0
else:
efficiency = max(0, 1.0 - (steps - 10) * 0.08)
# Combine with weights
composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
# Clamp to valid range and return
return max(0.001, min(0.999, composite))
```
---
## Commands Reference
```bash
# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"
# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py
# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"
# Test environment directly
python test_env_direct.py
# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks
```
---
## Interpreting Results
**Good Sign:**
- ✅ Grader score > 0.6
- ✅ Action validity rate = 100%
- ✅ Task completed = True
- ✅ Avg reward increasing over steps
- ✅ Resource reduction matches task targets
**Bad Sign:**
- ❌ Grader score < 0.3
- ❌ Many invalid actions
- ❌ Task not completed
- ❌ Random action patterns
- ❌ Resources not improving
---
## Summary
The LLM is evaluated on:
1. **Can it parse observations?** (Action validity)
2. **Can it make good decisions?** (Reward accumulation)
3. **Can it complete tasks?** (Task targets met)
4. **How efficiently?** (Steps taken, resource optimization)
5. **What's the final score?** (Grader evaluation)
An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.