readctrl / experiment.md
shahidul034's picture
Add files using upload-large-folder tool
c29669c verified

V1 without filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 94, 'intermediate': 52, 'hard': 29} easy: 94.00%, intermediate: 52.00%, hard: 29.00%

v2 with filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 93, 'intermediate': 67, 'hard': 25} easy: 93.00%, intermediate: 67.00%, hard: 25.00%

V2 without filtered (add new data): {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 88, 'intermediate': 71, 'hard': 28} easy: 88.00%, intermediate: 71.00%, hard: 28.00%

Without context - inferenceV2.py results (without context):

Config Easy Intermediate Hard Std Dev Balanced Ranking
temp1.1_qwen3-14B_finetuned.json 88% 64% 44% 18.23 🥇 Most balanced
temp1.0_qwen3-14B_finetuned.json 86% 66% 42% 18.71 🥈
temp0.7_qwen3-14B_finetuned.json 92% 68% 28% 26.42 🥉
temp0.5_qwen3-14B_finetuned.json 92% 62% 30% 25.50 4th
temp0.3_qwen3-14B_finetuned.json 94% 54% 22% 30.06 5th
temp0.1_qwen3-14B_finetuned.json 90% 62% 22% 28.12 6th
temp0.3_qwen3-14B_base.json 94% 46% 8% 38.14 7th
temp1.0_qwen3-14B_base.json 96% 52% 8% 39.44 8th
temp0.5_qwen3-14B_base.json 96% 48% 6% 41.45 9th
temp0.1_qwen3-14B_base.json 96% 46% 6% 41.76 10th
temp0.7_qwen3-14B_base.json 94% 38% 6% 43.39 11th
temp1.1_qwen3-14B_base.json 94.44% 44.44% 5.56% 39.96 12th

With context - inferenceV3.py results (with context):

Model/Temp Easy Intermediate Hard Average Accuracy
temp1.1_qwen3-14B_finetuned_with_defs 74.00% 70.00% 46.00% 63.33%
temp1.0_qwen3-14B_finetuned_with_defs 88.00% 66.00% 44.00% 66.00%
temp0.7_qwen3-14B_finetuned_with_defs 94.00% 74.00% 32.00% 66.67%
temp0.5_qwen3-14B_finetuned_with_defs 86.00% 76.00% 24.00% 62.00%
temp0.3_qwen3-14B_finetuned_with_defs 86.00% 70.00% 24.00% 60.00%
temp0.1_qwen3-14B_finetuned_with_defs 90.00% 64.00% 28.00% 60.67%
temp1.1_qwen3-14B_base_with_defs 96.00% 50.00% 14.00% 53.33%
temp1.0_qwen3-14B_base_with_defs 96.00% 58.00% 12.00% 55.33%
temp0.7_qwen3-14B_base_with_defs 96.00% 58.00% 10.00% 54.67%
temp0.5_qwen3-14B_base_with_defs 95.56% 62.22% 6.67% 54.82%
temp0.3_qwen3-14B_base_with_defs 96.00% 58.00% 10.00% 54.67%
temp0.1_qwen3-14B_base_with_defs 96.00% 58.00% 8.00% 54.00%