Spaces:

Sushruth21
/

energy-optimization-space

Sleeping

App Files Files Community

energy-optimization-space / EVALUATION_GUIDE.md

Sushruth21

feat: comprehensive LLM inference evaluation and validation

71b3314 4 days ago

preview code

raw

history blame contribute delete

10.2 kB

	# LLM Inference Evaluation Guide

	## Overview

	This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.

	---

	## What Gets Evaluated

	When you run the LLM inference, these components are judged:

	### 1. Action Quality
	- What: Whether the LLM produces valid actions in format: `action_type,intensity`
	- How: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
	- Grade: Action validity rate (% of valid vs invalid actions)
	- Example Good: `reduce_ram,0.8` (valid)
	- Example Bad: `invalid_action,2.5` (invalid - gets converted to `monitor_system,0.5`)

	### 2. Reward Accumulation
	- What: Total reward achieved across all steps
	- How: Each action generates reward = intensity × 0.1
	- Plus task completion bonus (difficulty × 0.5)
	- Clamped to 0-1 range
	- Grade Metrics:
	- Total reward (sum of all step rewards)
	- Average reward per step
	- Reward trend (increasing/decreasing)
	- Max and min rewards achieved

	### 3. Resource Optimization
	- What: How much RAM and Energy the LLM actually reduced
	- How: Compare initial vs final state
	- Grade Metrics:
	- RAM reduction: Initial 80% → Final X%
	- Energy reduction: Initial 8.0 kWh → Final X kWh
	- Efficiency: Resources saved ÷ Steps taken

	### 4. Task Completion
	- What: Whether the LLM completed the assigned task
	- How: Environment checks if current state meets task targets within max_steps
	- Example Task: Task 1 (basic_ram_reduction)
	- Target: RAM < 70%, Energy < 7.5 kWh
	- Deadline: 10 steps max
	- Success: If both targets met within 10 steps → Reward bonus + task marked complete

	### 5. Grader Score (0.001 - 0.999)
	- What: Task-specific evaluation score from the grader
	- How: Grader function evaluates final observation against task targets
	- Formula Example (Task 1):
	```
	RAM score = (100 - final_RAM) / (100 - 70) [0-1]
	Energy score = (10 - final_energy) / (10 - 7.5) [0-1]
	Efficiency = bonus for steps taken within limit
	Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2)
	Clamped to [0.001, 0.999]
	```

	### 6. Decision-Making Efficiency
	- What: How quickly and thoughtfully the LLM makes good decisions
	- How: Track history of actions and rewards
	- Grade Metrics:
	- Time-to-first-good-action
	- Convergence speed (steps to reach target)
	- Backtracking frequency (bad decision reversals)
	- Adaptive behavior (does agent improve over time?)

	---

	## Understanding the Tasks & Difficulty

	Each task has explicit graders that measure performance:

	\| Task \| Difficulty \| RAM Target \| Energy Target \| Max Steps \| Grader Name \|
	\|------\|-----------\|-----------\|---------------\|-----------\|------------\|
	\| basic_ram_reduction \| 1 \| < 70% \| < 7.5 kWh \| 10 \| task_1_basic_ram_reduction_grader \|
	\| energy_optimization \| 2 \| < 75% \| < 6.0 kWh \| 15 \| task_2_energy_optimization_grader \|
	\| balanced_optimization \| 3 \| < 60% \| < 5.0 kWh \| 20 \| task_3_balanced_optimization_grader \|
	\| advanced_efficiency \| 4 \| < 50% \| < 4.0 kWh \| 25 \| task_4_advanced_efficiency_grader \|
	\| expert_optimization \| 5 \| < 40% \| < 3.0 kWh \| 30 \| task_5_expert_optimization_grader \|

	---

	## How to Run Evaluation

	### Step 1: Start the Environment Server
	```bash
	cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
	python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000
	```

	### Step 2A: Quick Baseline Test
	Run the heuristic agent to establish a baseline:
	```bash
	python evaluate_inference.py
	```

	Output you'll see:
	```
	BASELINE TEST: Random Actions
	Initial RAM: 80.0%, Energy: 8.0 kWh

	Step 1: optimize_energy, Intensity: 0.65 \| Reward: +0.065 \| RAM: 80.0% \| Energy: 6.7 kWh
	Step 2: balance_resources, Intensity: 0.42 \| Reward: +0.042 \| RAM: 77.9% \| Energy: 6.3 kWh
	...
	Baseline Total Reward: 0.341
	Baseline Avg Reward: 0.068
	```

	### Step 2B: Run Full LLM Evaluation
	```bash
	# Set your LLM credentials
	$env:HF_TOKEN = "your_hf_token_here"
	$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct" # or your preferred model

	# Run inference on Task 1
	$env:ENERGY_TASK = "basic_ram_reduction"
	python inference.py
	```

	Output you'll see:
	```
	[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'

	Step 1: LLM chooses → reduce_ram,0.7
	Reward: +0.070
	RAM: 80% → 73%
	Energy: 8.0 kWh

	Step 2: LLM chooses → reduce_ram,0.6
	Reward: +0.060
	RAM: 73% → 67.0%
	Energy: 8.0 kWh

	Step 3: LLM chooses → monitor_system,0.5
	Reward: +0.050
	RAM: 67% (already optimized)
	Energy: 8.0 kWh

	[Task Completed!]
	Task Completion Bonus: +0.5 reward

	GRADER EVALUATION:
	Task: basic_ram_reduction
	Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
	Grader Score: 0.782 ✓

	EPISODE SUMMARY:
	Total Reward: 0.680
	Average Reward/Step: 0.227
	Task Completed: YES
	Grader Score: 0.782
	```

	---

	## Evaluating Model Performance

	### Quality Levels (by Grader Score)

	\| Score Range \| Rating \| Interpretation \|
	\|------------\|--------\|----------------\|
	\| 0.9 - 0.999 \| ★★★★★ EXCELLENT \| Model solved task optimally, excellent resource optimization \|
	\| 0.7 - 0.9 \| ★★★★ GOOD \| Model completed task efficiently, minor suboptimality \|
	\| 0.5 - 0.7 \| ★★★ FAIR \| Model made good progress, some inefficiency \|
	\| 0.3 - 0.5 \| ★★ POOR \| Model struggled, significant suboptimality \|
	\| 0.001 - 0.3 \| ★ VERY POOR \| Model barely optimized or failed task \|

	### Expected Performance Baselines

	Based on environment design:

	Random Agent:
	- Avg Reward: ~0.05 per step
	- Task Completion: ~10% chance
	- Grader Score: ~0.2-0.3
	- Insight: No structure, just luck

	Heuristic Agent (simple if-else rules):
	- Avg Reward: ~0.1 per step
	- Task Completion: ~60% chance (easy tasks only)
	- Grader Score: ~0.5-0.6
	- Insight: Follows simple logic, decent for easy tasks

	Competent LLM Agent (Qwen, GPT, etc.):
	- Avg Reward: ~0.12-0.15 per step
	- Task Completion: ~70-80% (easy-medium tasks)
	- Grader Score: ~0.65-0.80
	- Insight: Understands environment, makes reasonable decisions

	Expert LLM Agent (with few-shot examples):
	- Avg Reward: ~0.16-0.20 per step
	- Task Completion: ~85-95% (even hard tasks)
	- Grader Score: ~0.80-0.95
	- Insight: Optimized strategy, efficient resource management

	---

	## Detailed Metrics Collected

	The `evaluate_inference.py` script tracks:

	```python
	{
	"task_name": "basic_ram_reduction",
	"difficulty": 1,
	"total_steps": 5,
	"total_reward": 0.482, # Sum of all step rewards
	"avg_reward": 0.0964, # Average per step
	"reward_range": [0.05, 0.12], # Min to max reward
	"valid_actions": 5, # Actions that parsed correctly
	"invalid_actions": 0, # Actions that failed parsing
	"action_validity_rate": 1.0, # % of valid actions
	"initial_ram": 80.0,
	"final_ram": 50.0, # Resource improvement
	"initial_energy": 8.0,
	"final_energy": 4.5, # Resource improvement
	"task_completed": True, # Did it hit targets?
	"final_task_progress": 1.0, # 0.0 = no progress, 1.0 = complete
	"grader_score": 0.782 # Task-specific grader evaluation
	}
	```

	---

	## What Makes a Good LLM Agent

	1. Prompt Understanding: Parses the observation correctly
	2. Action Validity: Produces valid actions in correct format
	3. Resource Awareness: Tracks RAM/Energy trade-offs
	4. Goal Orientation: Works toward task targets, not random
	5. Efficiency: Achieves targets in fewer steps when possible
	6. Adaptability: Adjusts strategy if initial approach fails

	---

	## How Task Graders Work

	Each task has a task_N_..._grader() function that:

	1. Takes the final observation
	2. Calculates how close you are to targets
	3. Considers step efficiency
	4. Returns score in [0.001, 0.999]

	Example: Task 1 Grader Logic

	```python
	def task_1_basic_ram_reduction_grader(observation):
	# RAM reduction from baseline 80% to target < 70%
	ram_score = (100 - observation.ram_usage) / (100 - 70) # 0=bad, 1=good

	# Energy reduction from baseline 8.0 to target < 7.5
	energy_score = (10 - observation.energy_consumption) / (10 - 7.5)

	# Step efficiency (bonus for using few steps)
	if observation.steps_taken <= 10:
	efficiency = 1.0
	else:
	efficiency = max(0, 1.0 - (steps - 10) * 0.08)

	# Combine with weights
	composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)

	# Clamp to valid range and return
	return max(0.001, min(0.999, composite))
	```

	---

	## Commands Reference

	```bash
	# View available graders
	python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"

	# Run evaluation on specific task
	$env:ENERGY_TASK="basic_ram_reduction"
	$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	$env:HF_TOKEN="hf_xxx"
	python inference.py

	# Check grader metadata
	python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"

	# Test environment directly
	python test_env_direct.py

	# Run HTTP endpoint tests
	curl http://localhost:8000/graders
	curl http://localhost:8000/validate
	curl http://localhost:8000/tasks
	```

	---

	## Interpreting Results

	Good Sign:
	- ✅ Grader score > 0.6
	- ✅ Action validity rate = 100%
	- ✅ Task completed = True
	- ✅ Avg reward increasing over steps
	- ✅ Resource reduction matches task targets

	Bad Sign:
	- ❌ Grader score < 0.3
	- ❌ Many invalid actions
	- ❌ Task not completed
	- ❌ Random action patterns
	- ❌ Resources not improving

	---

	## Summary

	The LLM is evaluated on:
	1. Can it parse observations? (Action validity)
	2. Can it make good decisions? (Reward accumulation)
	3. Can it complete tasks? (Task targets met)
	4. How efficiently? (Steps taken, resource optimization)
	5. What's the final score? (Grader evaluation)

	An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.