File size: 10,211 Bytes
71b3314
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# LLM Inference Evaluation Guide

## Overview

This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.

---

## What Gets Evaluated

When you run the LLM inference, these components are judged:

### 1. **Action Quality**
- **What**: Whether the LLM produces valid actions in format: `action_type,intensity`
- **How**: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
- **Grade**: Action validity rate (% of valid vs invalid actions)
- **Example Good**: `reduce_ram,0.8` (valid)
- **Example Bad**: `invalid_action,2.5` (invalid - gets converted to `monitor_system,0.5`)

### 2. **Reward Accumulation**
- **What**: Total reward achieved across all steps
- **How**: Each action generates reward = intensity × 0.1
  - Plus task completion bonus (difficulty × 0.5)
  - Clamped to 0-1 range
- **Grade Metrics**:
  - Total reward (sum of all step rewards)
  - Average reward per step
  - Reward trend (increasing/decreasing)
  - Max and min rewards achieved

### 3. **Resource Optimization**
- **What**: How much RAM and Energy the LLM actually reduced
- **How**: Compare initial vs final state
- **Grade Metrics**:
  - RAM reduction: Initial 80% → Final X%
  - Energy reduction: Initial 8.0 kWh → Final X kWh
  - Efficiency: Resources saved ÷ Steps taken

### 4. **Task Completion**
- **What**: Whether the LLM completed the assigned task
- **How**: Environment checks if current state meets task targets within max_steps
- **Example Task**: Task 1 (basic_ram_reduction)
  - Target: RAM < 70%, Energy < 7.5 kWh
  - Deadline: 10 steps max
  - Success: If both targets met within 10 steps → Reward bonus + task marked complete

### 5. **Grader Score (0.001 - 0.999)**
- **What**: Task-specific evaluation score from the grader
- **How**: Grader function evaluates final observation against task targets
- **Formula Example** (Task 1):
  ```
  RAM score = (100 - final_RAM) / (100 - 70)  [0-1]
  Energy score = (10 - final_energy) / (10 - 7.5)  [0-1]
  Efficiency = bonus for steps taken within limit
  Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
  Clamped to [0.001, 0.999]
  ```

### 6. **Decision-Making Efficiency**
- **What**: How quickly and thoughtfully the LLM makes good decisions
- **How**: Track history of actions and rewards
- **Grade Metrics**:
  - Time-to-first-good-action
  - Convergence speed (steps to reach target)
  - Backtracking frequency (bad decision reversals)
  - Adaptive behavior (does agent improve over time?)

---

## Understanding the Tasks & Difficulty

Each task has explicit graders that measure performance:

| Task | Difficulty | RAM Target | Energy Target | Max Steps | Grader Name |
|------|-----------|-----------|---------------|-----------|------------|
| basic_ram_reduction | 1 | < 70% | < 7.5 kWh | 10 | task_1_basic_ram_reduction_grader |
| energy_optimization | 2 | < 75% | < 6.0 kWh | 15 | task_2_energy_optimization_grader |
| balanced_optimization | 3 | < 60% | < 5.0 kWh | 20 | task_3_balanced_optimization_grader |
| advanced_efficiency | 4 | < 50% | < 4.0 kWh | 25 | task_4_advanced_efficiency_grader |
| expert_optimization | 5 | < 40% | < 3.0 kWh | 30 | task_5_expert_optimization_grader |

---

## How to Run Evaluation

### Step 1: Start the Environment Server
```bash
cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000
```

### Step 2A: Quick Baseline Test
Run the heuristic agent to establish a baseline:
```bash
python evaluate_inference.py
```

**Output you'll see:**
```
BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh

Step 1: optimize_energy, Intensity: 0.65      | Reward: +0.065 | RAM: 80.0% | Energy:  6.7 kWh
Step 2: balance_resources, Intensity: 0.42    | Reward: +0.042 | RAM: 77.9% | Energy:  6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068
```

### Step 2B: Run Full LLM Evaluation
```bash
# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"  # or your preferred model

# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py
```

**Output you'll see:**
```
[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'

Step 1: LLM chooses → reduce_ram,0.7
  Reward: +0.070
  RAM: 80% → 73%
  Energy: 8.0 kWh

Step 2: LLM chooses → reduce_ram,0.6
  Reward: +0.060
  RAM: 73% → 67.0%
  Energy: 8.0 kWh

Step 3: LLM chooses → monitor_system,0.5
  Reward: +0.050
  RAM: 67% (already optimized)
  Energy: 8.0 kWh

[Task Completed!]
Task Completion Bonus: +0.5 reward

GRADER EVALUATION:
  Task: basic_ram_reduction
  Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
  Grader Score: 0.782 ✓

EPISODE SUMMARY:
  Total Reward: 0.680
  Average Reward/Step: 0.227
  Task Completed: YES
  Grader Score: 0.782
```

---

## Evaluating Model Performance

### Quality Levels (by Grader Score)

| Score Range | Rating | Interpretation |
|------------|--------|----------------|
| 0.9 - 0.999 | ★★★★★ EXCELLENT | Model solved task optimally, excellent resource optimization |
| 0.7 - 0.9 | ★★★★ GOOD | Model completed task efficiently, minor suboptimality |
| 0.5 - 0.7 | ★★★ FAIR | Model made good progress, some inefficiency |
| 0.3 - 0.5 | ★★ POOR | Model struggled, significant suboptimality |
| 0.001 - 0.3 | ★ VERY POOR | Model barely optimized or failed task |

### Expected Performance Baselines

Based on environment design:

**Random Agent:**
- Avg Reward: ~0.05 per step
- Task Completion: ~10% chance
- Grader Score: ~0.2-0.3
- Insight: No structure, just luck

**Heuristic Agent** (simple if-else rules):
- Avg Reward: ~0.1 per step
- Task Completion: ~60% chance (easy tasks only)
- Grader Score: ~0.5-0.6
- Insight: Follows simple logic, decent for easy tasks

**Competent LLM Agent** (Qwen, GPT, etc.):
- Avg Reward: ~0.12-0.15 per step
- Task Completion: ~70-80% (easy-medium tasks)
- Grader Score: ~0.65-0.80
- Insight: Understands environment, makes reasonable decisions

**Expert LLM Agent** (with few-shot examples):
- Avg Reward: ~0.16-0.20 per step
- Task Completion: ~85-95% (even hard tasks)
- Grader Score: ~0.80-0.95
- Insight: Optimized strategy, efficient resource management

---

## Detailed Metrics Collected

The `evaluate_inference.py` script tracks:

```python
{
    "task_name": "basic_ram_reduction",
    "difficulty": 1,
    "total_steps": 5,
    "total_reward": 0.482,           # Sum of all step rewards
    "avg_reward": 0.0964,             # Average per step
    "reward_range": [0.05, 0.12],    # Min to max reward
    "valid_actions": 5,               # Actions that parsed correctly
    "invalid_actions": 0,             # Actions that failed parsing
    "action_validity_rate": 1.0,      # % of valid actions
    "initial_ram": 80.0,
    "final_ram": 50.0,                # Resource improvement
    "initial_energy": 8.0,
    "final_energy": 4.5,              # Resource improvement
    "task_completed": True,            # Did it hit targets?
    "final_task_progress": 1.0,       # 0.0 = no progress, 1.0 = complete
    "grader_score": 0.782             # Task-specific grader evaluation
}
```

---

## What Makes a Good LLM Agent

1. **Prompt Understanding**: Parses the observation correctly
2. **Action Validity**: Produces valid actions in correct format
3. **Resource Awareness**: Tracks RAM/Energy trade-offs
4. **Goal Orientation**: Works toward task targets, not random
5. **Efficiency**: Achieves targets in fewer steps when possible
6. **Adaptability**: Adjusts strategy if initial approach fails

---

## How Task Graders Work

Each task has a **task_N_..._grader()** function that:

1. Takes the final observation
2. Calculates how close you are to targets
3. Considers step efficiency
4. Returns score in [0.001, 0.999]

**Example: Task 1 Grader Logic**

```python
def task_1_basic_ram_reduction_grader(observation):
    # RAM reduction from baseline 80% to target < 70%
    ram_score = (100 - observation.ram_usage) / (100 - 70)  # 0=bad, 1=good
    
    # Energy reduction from baseline 8.0 to target < 7.5
    energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
    
    # Step efficiency (bonus for using few steps)
    if observation.steps_taken <= 10:
        efficiency = 1.0
    else:
        efficiency = max(0, 1.0 - (steps - 10) * 0.08)
    
    # Combine with weights
    composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
    
    # Clamp to valid range and return
    return max(0.001, min(0.999, composite))
```

---

## Commands Reference

```bash
# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"

# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py

# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"

# Test environment directly
python test_env_direct.py

# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks
```

---

## Interpreting Results

**Good Sign:**
- ✅ Grader score > 0.6
- ✅ Action validity rate = 100%
- ✅ Task completed = True
- ✅ Avg reward increasing over steps
- ✅ Resource reduction matches task targets

**Bad Sign:**
- ❌ Grader score < 0.3
- ❌ Many invalid actions
- ❌ Task not completed
- ❌ Random action patterns
- ❌ Resources not improving

---

## Summary

The LLM is evaluated on:
1. **Can it parse observations?** (Action validity)
2. **Can it make good decisions?** (Reward accumulation)
3. **Can it complete tasks?** (Task targets met)
4. **How efficiently?** (Steps taken, resource optimization)
5. **What's the final score?** (Grader evaluation)

An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.