# LLM Inference Evaluation Guide

## Overview

This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.

---

## What Gets Evaluated

When you run the LLM inference, these components are judged:

### 1. **Action Quality**
- **What**: Whether the LLM produces valid actions in format: `action_type,intensity`
- **How**: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
- **Grade**: Action validity rate (% of valid vs invalid actions)
- **Example Good**: `reduce_ram,0.8` (valid)
- **Example Bad**: `invalid_action,2.5` (invalid - gets converted to `monitor_system,0.5`)

### 2. **Reward Accumulation**
- **What**: Total reward achieved across all steps
- **How**: Each action generates reward = intensity × 0.1
  - Plus task completion bonus (difficulty × 0.5)
  - Clamped to 0-1 range
- **Grade Metrics**:
  - Total reward (sum of all step rewards)
  - Average reward per step
  - Reward trend (increasing/decreasing)
  - Max and min rewards achieved

### 3. **Resource Optimization**
- **What**: How much RAM and Energy the LLM actually reduced
- **How**: Compare initial vs final state
- **Grade Metrics**:
  - RAM reduction: Initial 80% → Final X%
  - Energy reduction: Initial 8.0 kWh → Final X kWh
  - Efficiency: Resources saved ÷ Steps taken

### 4. **Task Completion**
- **What**: Whether the LLM completed the assigned task
- **How**: Environment checks if current state meets task targets within max_steps
- **Example Task**: Task 1 (basic_ram_reduction)
  - Target: RAM < 70%, Energy < 7.5 kWh
  - Deadline: 10 steps max
  - Success: If both targets met within 10 steps → Reward bonus + task marked complete

### 5. **Grader Score (0.001 - 0.999)**
- **What**: Task-specific evaluation score from the grader
- **How**: Grader function evaluates final observation against task targets
- **Formula Example** (Task 1):
  ```
  RAM score = (100 - final_RAM) / (100 - 70)  [0-1]
  Energy score = (10 - final_energy) / (10 - 7.5)  [0-1]
  Efficiency = bonus for steps taken within limit
  Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
  Clamped to [0.001, 0.999]
  ```

### 6. **Decision-Making Efficiency**
- **What**: How quickly and thoughtfully the LLM makes good decisions
- **How**: Track history of actions and rewards
- **Grade Metrics**:
  - Time-to-first-good-action
  - Convergence speed (steps to reach target)
  - Backtracking frequency (bad decision reversals)
  - Adaptive behavior (does agent improve over time?)

---

## Understanding the Tasks & Difficulty

Each task has explicit graders that measure performance:

| Task | Difficulty | RAM Target | Energy Target | Max Steps | Grader Name |
|------|-----------|-----------|---------------|-----------|------------|
| basic_ram_reduction | 1 | < 70% | < 7.5 kWh | 10 | task_1_basic_ram_reduction_grader |
| energy_optimization | 2 | < 75% | < 6.0 kWh | 15 | task_2_energy_optimization_grader |
| balanced_optimization | 3 | < 60% | < 5.0 kWh | 20 | task_3_balanced_optimization_grader |
| advanced_efficiency | 4 | < 50% | < 4.0 kWh | 25 | task_4_advanced_efficiency_grader |
| expert_optimization | 5 | < 40% | < 3.0 kWh | 30 | task_5_expert_optimization_grader |

---

## How to Run Evaluation

### Step 1: Start the Environment Server
```bash
cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000
```

### Step 2A: Quick Baseline Test
Run the heuristic agent to establish a baseline:
```bash
python evaluate_inference.py
```

**Output you'll see:**
```
BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh

Step 1: optimize_energy, Intensity: 0.65      | Reward: +0.065 | RAM: 80.0% | Energy:  6.7 kWh
Step 2: balance_resources, Intensity: 0.42    | Reward: +0.042 | RAM: 77.9% | Energy:  6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068
```

### Step 2B: Run Full LLM Evaluation
```bash
# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"  # or your preferred model

# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py
```

**Output you'll see:**
```
[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'

Step 1: LLM chooses → reduce_ram,0.7
  Reward: +0.070
  RAM: 80% → 73%
  Energy: 8.0 kWh

Step 2: LLM chooses → reduce_ram,0.6
  Reward: +0.060
  RAM: 73% → 67.0%
  Energy: 8.0 kWh

Step 3: LLM chooses → monitor_system,0.5
  Reward: +0.050
  RAM: 67% (already optimized)
  Energy: 8.0 kWh

[Task Completed!]
Task Completion Bonus: +0.5 reward

GRADER EVALUATION:
  Task: basic_ram_reduction
  Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
  Grader Score: 0.782 ✓

EPISODE SUMMARY:
  Total Reward: 0.680
  Average Reward/Step: 0.227
  Task Completed: YES
  Grader Score: 0.782
```

---

## Evaluating Model Performance

### Quality Levels (by Grader Score)

| Score Range | Rating | Interpretation |
|------------|--------|----------------|
| 0.9 - 0.999 | ★★★★★ EXCELLENT | Model solved task optimally, excellent resource optimization |
| 0.7 - 0.9 | ★★★★ GOOD | Model completed task efficiently, minor suboptimality |
| 0.5 - 0.7 | ★★★ FAIR | Model made good progress, some inefficiency |
| 0.3 - 0.5 | ★★ POOR | Model struggled, significant suboptimality |
| 0.001 - 0.3 | ★ VERY POOR | Model barely optimized or failed task |

### Expected Performance Baselines

Based on environment design:

**Random Agent:**
- Avg Reward: ~0.05 per step
- Task Completion: ~10% chance
- Grader Score: ~0.2-0.3
- Insight: No structure, just luck

**Heuristic Agent** (simple if-else rules):
- Avg Reward: ~0.1 per step
- Task Completion: ~60% chance (easy tasks only)
- Grader Score: ~0.5-0.6
- Insight: Follows simple logic, decent for easy tasks

**Competent LLM Agent** (Qwen, GPT, etc.):
- Avg Reward: ~0.12-0.15 per step
- Task Completion: ~70-80% (easy-medium tasks)
- Grader Score: ~0.65-0.80
- Insight: Understands environment, makes reasonable decisions

**Expert LLM Agent** (with few-shot examples):
- Avg Reward: ~0.16-0.20 per step
- Task Completion: ~85-95% (even hard tasks)
- Grader Score: ~0.80-0.95
- Insight: Optimized strategy, efficient resource management

---

## Detailed Metrics Collected

The `evaluate_inference.py` script tracks:

```python
{
    "task_name": "basic_ram_reduction",
    "difficulty": 1,
    "total_steps": 5,
    "total_reward": 0.482,           # Sum of all step rewards
    "avg_reward": 0.0964,             # Average per step
    "reward_range": [0.05, 0.12],    # Min to max reward
    "valid_actions": 5,               # Actions that parsed correctly
    "invalid_actions": 0,             # Actions that failed parsing
    "action_validity_rate": 1.0,      # % of valid actions
    "initial_ram": 80.0,
    "final_ram": 50.0,                # Resource improvement
    "initial_energy": 8.0,
    "final_energy": 4.5,              # Resource improvement
    "task_completed": True,            # Did it hit targets?
    "final_task_progress": 1.0,       # 0.0 = no progress, 1.0 = complete
    "grader_score": 0.782             # Task-specific grader evaluation
}
```

---

## What Makes a Good LLM Agent

1. **Prompt Understanding**: Parses the observation correctly
2. **Action Validity**: Produces valid actions in correct format
3. **Resource Awareness**: Tracks RAM/Energy trade-offs
4. **Goal Orientation**: Works toward task targets, not random
5. **Efficiency**: Achieves targets in fewer steps when possible
6. **Adaptability**: Adjusts strategy if initial approach fails

---

## How Task Graders Work

Each task has a **task_N_..._grader()** function that:

1. Takes the final observation
2. Calculates how close you are to targets
3. Considers step efficiency
4. Returns score in [0.001, 0.999]

**Example: Task 1 Grader Logic**

```python
def task_1_basic_ram_reduction_grader(observation):
    # RAM reduction from baseline 80% to target < 70%
    ram_score = (100 - observation.ram_usage) / (100 - 70)  # 0=bad, 1=good
    
    # Energy reduction from baseline 8.0 to target < 7.5
    energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
    
    # Step efficiency (bonus for using few steps)
    if observation.steps_taken <= 10:
        efficiency = 1.0
    else:
        efficiency = max(0, 1.0 - (steps - 10) * 0.08)
    
    # Combine with weights
    composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
    
    # Clamp to valid range and return
    return max(0.001, min(0.999, composite))
```

---

## Commands Reference

```bash
# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"

# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py

# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"

# Test environment directly
python test_env_direct.py

# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks
```

---

## Interpreting Results

**Good Sign:**
- ✅ Grader score > 0.6
- ✅ Action validity rate = 100%
- ✅ Task completed = True
- ✅ Avg reward increasing over steps
- ✅ Resource reduction matches task targets

**Bad Sign:**
- ❌ Grader score < 0.3
- ❌ Many invalid actions
- ❌ Task not completed
- ❌ Random action patterns
- ❌ Resources not improving

---

## Summary

The LLM is evaluated on:
1. **Can it parse observations?** (Action validity)
2. **Can it make good decisions?** (Reward accumulation)
3. **Can it complete tasks?** (Task targets met)
4. **How efficiently?** (Steps taken, resource optimization)
5. **What's the final score?** (Grader evaluation)

An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.