# LLM Inference Evaluation Guide ## Overview This project evaluates how well a Language Model (LLM) can interact with the Energy & Memory RAM Optimization environment to achieve reward signals. --- ## What Gets Evaluated When you run the LLM inference, these components are judged: ### 1. **Action Quality** - **What**: Whether the LLM produces valid actions in format: `action_type,intensity` - **How**: Parser validates action_type ∈ ["reduce_ram", "optimize_energy", "balance_resources", "monitor_system"] and intensity ∈ [0.0, 1.0] - **Grade**: Action validity rate (% of valid vs invalid actions) - **Example Good**: `reduce_ram,0.8` (valid) - **Example Bad**: `invalid_action,2.5` (invalid - gets converted to `monitor_system,0.5`) ### 2. **Reward Accumulation** - **What**: Total reward achieved across all steps - **How**: Each action generates reward = intensity × 0.1 - Plus task completion bonus (difficulty × 0.5) - Clamped to 0-1 range - **Grade Metrics**: - Total reward (sum of all step rewards) - Average reward per step - Reward trend (increasing/decreasing) - Max and min rewards achieved ### 3. **Resource Optimization** - **What**: How much RAM and Energy the LLM actually reduced - **How**: Compare initial vs final state - **Grade Metrics**: - RAM reduction: Initial 80% → Final X% - Energy reduction: Initial 8.0 kWh → Final X kWh - Efficiency: Resources saved ÷ Steps taken ### 4. **Task Completion** - **What**: Whether the LLM completed the assigned task - **How**: Environment checks if current state meets task targets within max_steps - **Example Task**: Task 1 (basic_ram_reduction) - Target: RAM < 70%, Energy < 7.5 kWh - Deadline: 10 steps max - Success: If both targets met within 10 steps → Reward bonus + task marked complete ### 5. **Grader Score (0.001 - 0.999)** - **What**: Task-specific evaluation score from the grader - **How**: Grader function evaluates final observation against task targets - **Formula Example** (Task 1): ``` RAM score = (100 - final_RAM) / (100 - 70) [0-1] Energy score = (10 - final_energy) / (10 - 7.5) [0-1] Efficiency = bonus for steps taken within limit Final = (RAM_score × 0.4) + (Energy_score × 0.4) + (Efficiency × 0.2) Clamped to [0.001, 0.999] ``` ### 6. **Decision-Making Efficiency** - **What**: How quickly and thoughtfully the LLM makes good decisions - **How**: Track history of actions and rewards - **Grade Metrics**: - Time-to-first-good-action - Convergence speed (steps to reach target) - Backtracking frequency (bad decision reversals) - Adaptive behavior (does agent improve over time?) --- ## Understanding the Tasks & Difficulty Each task has explicit graders that measure performance: | Task | Difficulty | RAM Target | Energy Target | Max Steps | Grader Name | |------|-----------|-----------|---------------|-----------|------------| | basic_ram_reduction | 1 | < 70% | < 7.5 kWh | 10 | task_1_basic_ram_reduction_grader | | energy_optimization | 2 | < 75% | < 6.0 kWh | 15 | task_2_energy_optimization_grader | | balanced_optimization | 3 | < 60% | < 5.0 kWh | 20 | task_3_balanced_optimization_grader | | advanced_efficiency | 4 | < 50% | < 4.0 kWh | 25 | task_4_advanced_efficiency_grader | | expert_optimization | 5 | < 40% | < 3.0 kWh | 30 | task_5_expert_optimization_grader | --- ## How to Run Evaluation ### Step 1: Start the Environment Server ```bash cd d:\Projects\Pytorch\ x\ hugging\ face\he_demo python -m uvicorn he_demo.server.app:app --host 0.0.0.0 --port 8000 ``` ### Step 2A: Quick Baseline Test Run the heuristic agent to establish a baseline: ```bash python evaluate_inference.py ``` **Output you'll see:** ``` BASELINE TEST: Random Actions Initial RAM: 80.0%, Energy: 8.0 kWh Step 1: optimize_energy, Intensity: 0.65 | Reward: +0.065 | RAM: 80.0% | Energy: 6.7 kWh Step 2: balance_resources, Intensity: 0.42 | Reward: +0.042 | RAM: 77.9% | Energy: 6.3 kWh ... Baseline Total Reward: 0.341 Baseline Avg Reward: 0.068 ``` ### Step 2B: Run Full LLM Evaluation ```bash # Set your LLM credentials $env:HF_TOKEN = "your_hf_token_here" $env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct" # or your preferred model # Run inference on Task 1 $env:ENERGY_TASK = "basic_ram_reduction" python inference.py ``` **Output you'll see:** ``` [CONFIG] Task-specific grader configured: task=basic_ram_reduction difficulty=1 description='Reduce RAM usage below 70%' Step 1: LLM chooses → reduce_ram,0.7 Reward: +0.070 RAM: 80% → 73% Energy: 8.0 kWh Step 2: LLM chooses → reduce_ram,0.6 Reward: +0.060 RAM: 73% → 67.0% Energy: 8.0 kWh Step 3: LLM chooses → monitor_system,0.5 Reward: +0.050 RAM: 67% (already optimized) Energy: 8.0 kWh [Task Completed!] Task Completion Bonus: +0.5 reward GRADER EVALUATION: Task: basic_ram_reduction Final State: RAM=67.0%, Energy=8.0kWh, Steps=3/10 Grader Score: 0.782 ✓ EPISODE SUMMARY: Total Reward: 0.680 Average Reward/Step: 0.227 Task Completed: YES Grader Score: 0.782 ``` --- ## Evaluating Model Performance ### Quality Levels (by Grader Score) | Score Range | Rating | Interpretation | |------------|--------|----------------| | 0.9 - 0.999 | ★★★★★ EXCELLENT | Model solved task optimally, excellent resource optimization | | 0.7 - 0.9 | ★★★★ GOOD | Model completed task efficiently, minor suboptimality | | 0.5 - 0.7 | ★★★ FAIR | Model made good progress, some inefficiency | | 0.3 - 0.5 | ★★ POOR | Model struggled, significant suboptimality | | 0.001 - 0.3 | ★ VERY POOR | Model barely optimized or failed task | ### Expected Performance Baselines Based on environment design: **Random Agent:** - Avg Reward: ~0.05 per step - Task Completion: ~10% chance - Grader Score: ~0.2-0.3 - Insight: No structure, just luck **Heuristic Agent** (simple if-else rules): - Avg Reward: ~0.1 per step - Task Completion: ~60% chance (easy tasks only) - Grader Score: ~0.5-0.6 - Insight: Follows simple logic, decent for easy tasks **Competent LLM Agent** (Qwen, GPT, etc.): - Avg Reward: ~0.12-0.15 per step - Task Completion: ~70-80% (easy-medium tasks) - Grader Score: ~0.65-0.80 - Insight: Understands environment, makes reasonable decisions **Expert LLM Agent** (with few-shot examples): - Avg Reward: ~0.16-0.20 per step - Task Completion: ~85-95% (even hard tasks) - Grader Score: ~0.80-0.95 - Insight: Optimized strategy, efficient resource management --- ## Detailed Metrics Collected The `evaluate_inference.py` script tracks: ```python { "task_name": "basic_ram_reduction", "difficulty": 1, "total_steps": 5, "total_reward": 0.482, # Sum of all step rewards "avg_reward": 0.0964, # Average per step "reward_range": [0.05, 0.12], # Min to max reward "valid_actions": 5, # Actions that parsed correctly "invalid_actions": 0, # Actions that failed parsing "action_validity_rate": 1.0, # % of valid actions "initial_ram": 80.0, "final_ram": 50.0, # Resource improvement "initial_energy": 8.0, "final_energy": 4.5, # Resource improvement "task_completed": True, # Did it hit targets? "final_task_progress": 1.0, # 0.0 = no progress, 1.0 = complete "grader_score": 0.782 # Task-specific grader evaluation } ``` --- ## What Makes a Good LLM Agent 1. **Prompt Understanding**: Parses the observation correctly 2. **Action Validity**: Produces valid actions in correct format 3. **Resource Awareness**: Tracks RAM/Energy trade-offs 4. **Goal Orientation**: Works toward task targets, not random 5. **Efficiency**: Achieves targets in fewer steps when possible 6. **Adaptability**: Adjusts strategy if initial approach fails --- ## How Task Graders Work Each task has a **task_N_..._grader()** function that: 1. Takes the final observation 2. Calculates how close you are to targets 3. Considers step efficiency 4. Returns score in [0.001, 0.999] **Example: Task 1 Grader Logic** ```python def task_1_basic_ram_reduction_grader(observation): # RAM reduction from baseline 80% to target < 70% ram_score = (100 - observation.ram_usage) / (100 - 70) # 0=bad, 1=good # Energy reduction from baseline 8.0 to target < 7.5 energy_score = (10 - observation.energy_consumption) / (10 - 7.5) # Step efficiency (bonus for using few steps) if observation.steps_taken <= 10: efficiency = 1.0 else: efficiency = max(0, 1.0 - (steps - 10) * 0.08) # Combine with weights composite = (ram_score × 0.4) + (energy_score × 0.4) + (efficiency × 0.2) # Clamp to valid range and return return max(0.001, min(0.999, composite)) ``` --- ## Commands Reference ```bash # View available graders python -c "from he_demo import TASK_GRADERS; print(list(TASK_GRADERS.keys()))" # Run evaluation on specific task $env:ENERGY_TASK="basic_ram_reduction" $env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" $env:HF_TOKEN="hf_xxx" python inference.py # Check grader metadata python -c "from he_demo import get_grader_metadata; import json; print(json.dumps(get_grader_metadata(), indent=2))" # Test environment directly python test_env_direct.py # Run HTTP endpoint tests curl http://localhost:8000/graders curl http://localhost:8000/validate curl http://localhost:8000/tasks ``` --- ## Interpreting Results **Good Sign:** - ✅ Grader score > 0.6 - ✅ Action validity rate = 100% - ✅ Task completed = True - ✅ Avg reward increasing over steps - ✅ Resource reduction matches task targets **Bad Sign:** - ❌ Grader score < 0.3 - ❌ Many invalid actions - ❌ Task not completed - ❌ Random action patterns - ❌ Resources not improving --- ## Summary The LLM is evaluated on: 1. **Can it parse observations?** (Action validity) 2. **Can it make good decisions?** (Reward accumulation) 3. **Can it complete tasks?** (Task targets met) 4. **How efficiently?** (Steps taken, resource optimization) 5. **What's the final score?** (Grader evaluation) An excellent LLM agent should exceed baseline performance and consistently achieve grader scores > 0.75.