V1 without filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 94, 'intermediate': 52, 'hard': 29} easy: 94.00%, intermediate: 52.00%, hard: 29.00% v2 with filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 93, 'intermediate': 67, 'hard': 25} easy: 93.00%, intermediate: 67.00%, hard: 25.00% V2 without filtered (add new data): {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 88, 'intermediate': 71, 'hard': 28} easy: 88.00%, intermediate: 71.00%, hard: 28.00% Without context - inferenceV2.py results (without context): | Config | Easy | Intermediate | Hard | Std Dev | Balanced Ranking | |--------|------|---------------|------|---------|------------------| | **temp1.1_qwen3-14B_finetuned.json** | 88% | 64% | 44% | **18.23** | 🥇 Most balanced | **temp1.0_qwen3-14B_finetuned.json** | 86% | 66% | 42% | **18.71** | 🥈 | **temp0.7_qwen3-14B_finetuned.json** | 92% | 68% | 28% | 26.42 | 🥉 | **temp0.5_qwen3-14B_finetuned.json** | 92% | 62% | 30% | 25.50 | 4th | **temp0.3_qwen3-14B_finetuned.json** | 94% | 54% | 22% | 30.06 | 5th | **temp0.1_qwen3-14B_finetuned.json** | 90% | 62% | 22% | 28.12 | 6th | **temp0.3_qwen3-14B_base.json** | 94% | 46% | 8% | 38.14 | 7th | **temp1.0_qwen3-14B_base.json** | 96% | 52% | 8% | 39.44 | 8th | **temp0.5_qwen3-14B_base.json** | 96% | 48% | 6% | 41.45 | 9th | **temp0.1_qwen3-14B_base.json** | 96% | 46% | 6% | 41.76 | 10th | **temp0.7_qwen3-14B_base.json** | 94% | 38% | 6% | 43.39 | 11th | **temp1.1_qwen3-14B_base.json** | 94.44% | 44.44% | 5.56% | 39.96 | 12th With context - inferenceV3.py results (with context): | Model/Temp | Easy | Intermediate | Hard | **Average Accuracy** | | ----------------------------------------- | ------ | ------------ | ------ | -------------------- | | **temp1.1_qwen3-14B_finetuned_with_defs** | 74.00% | 70.00% | 46.00% | **63.33%** | | **temp1.0_qwen3-14B_finetuned_with_defs** | 88.00% | 66.00% | 44.00% | **66.00%** | | **temp0.7_qwen3-14B_finetuned_with_defs** | 94.00% | 74.00% | 32.00% | **66.67%** | | **temp0.5_qwen3-14B_finetuned_with_defs** | 86.00% | 76.00% | 24.00% | **62.00%** | | **temp0.3_qwen3-14B_finetuned_with_defs** | 86.00% | 70.00% | 24.00% | **60.00%** | | **temp0.1_qwen3-14B_finetuned_with_defs** | 90.00% | 64.00% | 28.00% | **60.67%** | | **temp1.1_qwen3-14B_base_with_defs** | 96.00% | 50.00% | 14.00% | **53.33%** | | **temp1.0_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 12.00% | **55.33%** | | **temp0.7_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 10.00% | **54.67%** | | **temp0.5_qwen3-14B_base_with_defs** | 95.56% | 62.22% | 6.67% | **54.82%** | | **temp0.3_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 10.00% | **54.67%** | | **temp0.1_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 8.00% | **54.00%** | ---