LLM Inference Evaluation Guide
Overview
This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals.
What Gets Evaluated
When you run the LLM inference, these components are judged:
1. Action Quality
- What: Whether the LLM produces valid actions in format:
action_type,intensity - How: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0]
- Grade: Action validity rate (% of valid vs invalid actions)
- Example Good:
reduce_ram,0.8(valid) - Example Bad:
invalid_action,2.5(invalid - gets converted tomonitor_system,0.5)
2. Reward Accumulation
- What: Total reward achieved across all steps
- How: Each action generates reward = intensity × 0.1
- Plus task completion bonus (difficulty × 0.5)
- Clamped to 0-1 range
- Grade Metrics:
- Total reward (sum of all step rewards)
- Average reward per step
- Reward trend (increasing/decreasing)
- Max and min rewards achieved
3. Resource Optimization
- What: How much RAM and Energy the LLM actually reduced
- How: Compare initial vs final state
- Grade Metrics:
- RAM reduction: Initial 80% → Final X%
- Energy reduction: Initial 8.0 kWh → Final X kWh
- Efficiency: Resources saved ÷ Steps taken
4. Task Completion
- What: Whether the LLM completed the assigned task
- How: Environment checks if current state meets task targets within max_steps
- Example Task: Task 1 (basic_ram_reduction)
- Target: RAM < 70%, Energy < 7.5 kWh
- Deadline: 10 steps max
- Success: If both targets met within 10 steps → Reward bonus + task marked complete
5. Grader Score (0.001 - 0.999)
- What: Task-specific evaluation score from the grader
- How: Grader function evaluates final observation against task targets
- Formula Example (Task 1):
RAM score = (100 - final_RAM) / (100 - 70) [0-1] Energy score = (10 - final_energy) / (10 - 7.5) [0-1] Efficiency = bonus for steps taken within limit Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2) Clamped to [0.001, 0.999]
6. Decision-Making Efficiency
- What: How quickly and thoughtfully the LLM makes good decisions
- How: Track history of actions and rewards
- Grade Metrics:
- Time-to-first-good-action
- Convergence speed (steps to reach target)
- Backtracking frequency (bad decision reversals)
- Adaptive behavior (does agent improve over time?)
Understanding the Tasks & Difficulty
Each task has explicit graders that measure performance:
| Task | Difficulty | RAM Target | Energy Target | Max Steps | Grader Name |
|---|---|---|---|---|---|
| basic_ram_reduction | 1 | < 70% | < 7.5 kWh | 10 | task_1_basic_ram_reduction_grader |
| energy_optimization | 2 | < 75% | < 6.0 kWh | 15 | task_2_energy_optimization_grader |
| balanced_optimization | 3 | < 60% | < 5.0 kWh | 20 | task_3_balanced_optimization_grader |
| advanced_efficiency | 4 | < 50% | < 4.0 kWh | 25 | task_4_advanced_efficiency_grader |
| expert_optimization | 5 | < 40% | < 3.0 kWh | 30 | task_5_expert_optimization_grader |
How to Run Evaluation
Step 1: Start the Environment Server
cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo
python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000
Step 2A: Quick Baseline Test
Run the heuristic agent to establish a baseline:
python evaluate_inference.py
Output you'll see:
BASELINE TEST: Random Actions
Initial RAM: 80.0%, Energy: 8.0 kWh
Step 1: optimize_energy, Intensity: 0.65 | Reward: +0.065 | RAM: 80.0% | Energy: 6.7 kWh
Step 2: balance_resources, Intensity: 0.42 | Reward: +0.042 | RAM: 77.9% | Energy: 6.3 kWh
...
Baseline Total Reward: 0.341
Baseline Avg Reward: 0.068
Step 2B: Run Full LLM Evaluation
# Set your LLM credentials
$env:HF_TOKEN = "your_hf_token_here"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct" # or your preferred model
# Run inference on Task 1
$env:ENERGY_TASK = "basic_ram_reduction"
python inference.py
Output you'll see:
[CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%'
Step 1: LLM chooses → reduce_ram,0.7
Reward: +0.070
RAM: 80% → 73%
Energy: 8.0 kWh
Step 2: LLM chooses → reduce_ram,0.6
Reward: +0.060
RAM: 73% → 67.0%
Energy: 8.0 kWh
Step 3: LLM chooses → monitor_system,0.5
Reward: +0.050
RAM: 67% (already optimized)
Energy: 8.0 kWh
[Task Completed!]
Task Completion Bonus: +0.5 reward
GRADER EVALUATION:
Task: basic_ram_reduction
Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10
Grader Score: 0.782 ✓
EPISODE SUMMARY:
Total Reward: 0.680
Average Reward/Step: 0.227
Task Completed: YES
Grader Score: 0.782
Evaluating Model Performance
Quality Levels (by Grader Score)
| Score Range | Rating | Interpretation |
|---|---|---|
| 0.9 - 0.999 | ★★★★★ EXCELLENT | Model solved task optimally, excellent resource optimization |
| 0.7 - 0.9 | ★★★★ GOOD | Model completed task efficiently, minor suboptimality |
| 0.5 - 0.7 | ★★★ FAIR | Model made good progress, some inefficiency |
| 0.3 - 0.5 | ★★ POOR | Model struggled, significant suboptimality |
| 0.001 - 0.3 | ★ VERY POOR | Model barely optimized or failed task |
Expected Performance Baselines
Based on environment design:
Random Agent:
- Avg Reward: ~0.05 per step
- Task Completion: ~10% chance
- Grader Score: ~0.2-0.3
- Insight: No structure, just luck
Heuristic Agent (simple if-else rules):
- Avg Reward: ~0.1 per step
- Task Completion: ~60% chance (easy tasks only)
- Grader Score: ~0.5-0.6
- Insight: Follows simple logic, decent for easy tasks
Competent LLM Agent (Qwen, GPT, etc.):
- Avg Reward: ~0.12-0.15 per step
- Task Completion: ~70-80% (easy-medium tasks)
- Grader Score: ~0.65-0.80
- Insight: Understands environment, makes reasonable decisions
Expert LLM Agent (with few-shot examples):
- Avg Reward: ~0.16-0.20 per step
- Task Completion: ~85-95% (even hard tasks)
- Grader Score: ~0.80-0.95
- Insight: Optimized strategy, efficient resource management
Detailed Metrics Collected
The evaluate_inference.py script tracks:
{
"task_name": "basic_ram_reduction",
"difficulty": 1,
"total_steps": 5,
"total_reward": 0.482, # Sum of all step rewards
"avg_reward": 0.0964, # Average per step
"reward_range": [0.05, 0.12], # Min to max reward
"valid_actions": 5, # Actions that parsed correctly
"invalid_actions": 0, # Actions that failed parsing
"action_validity_rate": 1.0, # % of valid actions
"initial_ram": 80.0,
"final_ram": 50.0, # Resource improvement
"initial_energy": 8.0,
"final_energy": 4.5, # Resource improvement
"task_completed": True, # Did it hit targets?
"final_task_progress": 1.0, # 0.0 = no progress, 1.0 = complete
"grader_score": 0.782 # Task-specific grader evaluation
}
What Makes a Good LLM Agent
- Prompt Understanding: Parses the observation correctly
- Action Validity: Produces valid actions in correct format
- Resource Awareness: Tracks RAM/Energy trade-offs
- Goal Orientation: Works toward task targets, not random
- Efficiency: Achieves targets in fewer steps when possible
- Adaptability: Adjusts strategy if initial approach fails
How Task Graders Work
Each task has a task_N_..._grader() function that:
- Takes the final observation
- Calculates how close you are to targets
- Considers step efficiency
- Returns score in [0.001, 0.999]
Example: Task 1 Grader Logic
def task_1_basic_ram_reduction_grader(observation):
# RAM reduction from baseline 80% to target < 70%
ram_score = (100 - observation.ram_usage) / (100 - 70) # 0=bad, 1=good
# Energy reduction from baseline 8.0 to target < 7.5
energy_score = (10 - observation.energy_consumption) / (10 - 7.5)
# Step efficiency (bonus for using few steps)
if observation.steps_taken <= 10:
efficiency = 1.0
else:
efficiency = max(0, 1.0 - (steps - 10) * 0.08)
# Combine with weights
composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2)
# Clamp to valid range and return
return max(0.001, min(0.999, composite))
Commands Reference
# View available graders
python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))"
# Run evaluation on specific task
$env:ENERGY_TASK="basic_ram_reduction"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN="hf_xxx"
python inference.py
# Check grader metadata
python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))"
# Test environment directly
python test_env_direct.py
# Run HTTP endpoint tests
curl http://localhost:8000/graders
curl http://localhost:8000/validate
curl http://localhost:8000/tasks
Interpreting Results
Good Sign:
- ✅ Grader score > 0.6
- ✅ Action validity rate = 100%
- ✅ Task completed = True
- ✅ Avg reward increasing over steps
- ✅ Resource reduction matches task targets
Bad Sign:
- ❌ Grader score < 0.3
- ❌ Many invalid actions
- ❌ Task not completed
- ❌ Random action patterns
- ❌ Resources not improving
Summary
The LLM is evaluated on:
- Can it parse observations? (Action validity)
- Can it make good decisions? (Reward accumulation)
- Can it complete tasks? (Task targets met)
- How efficiently? (Steps taken, resource optimization)
- What's the final score? (Grader evaluation)
An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.