File size: 3,051 Bytes
c29669c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
V1 without filtered:
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 94, 'intermediate': 52, 'hard': 29}
easy: 94.00%, intermediate: 52.00%, hard: 29.00%

v2 with filtered:
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 93, 'intermediate': 67, 'hard': 25}
easy: 93.00%, intermediate: 67.00%, hard: 25.00%

V2 without filtered (add new data):
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 88, 'intermediate': 71, 'hard': 28}
easy: 88.00%, intermediate: 71.00%, hard: 28.00%


Without context - inferenceV2.py results (without context):

| Config | Easy | Intermediate | Hard | Std Dev | Balanced Ranking |
|--------|------|---------------|------|---------|------------------|
| **temp1.1_qwen3-14B_finetuned.json** | 88% | 64% | 44% | **18.23** | 🥇 Most balanced
| **temp1.0_qwen3-14B_finetuned.json** | 86% | 66% | 42% | **18.71** | 🥈
| **temp0.7_qwen3-14B_finetuned.json** | 92% | 68% | 28% | 26.42 | 🥉
| **temp0.5_qwen3-14B_finetuned.json** | 92% | 62% | 30% | 25.50 | 4th
| **temp0.3_qwen3-14B_finetuned.json** | 94% | 54% | 22% | 30.06 | 5th
| **temp0.1_qwen3-14B_finetuned.json** | 90% | 62% | 22% | 28.12 | 6th
| **temp0.3_qwen3-14B_base.json** | 94% | 46% | 8% | 38.14 | 7th
| **temp1.0_qwen3-14B_base.json** | 96% | 52% | 8% | 39.44 | 8th
| **temp0.5_qwen3-14B_base.json** | 96% | 48% | 6% | 41.45 | 9th
| **temp0.1_qwen3-14B_base.json** | 96% | 46% | 6% | 41.76 | 10th
| **temp0.7_qwen3-14B_base.json** | 94% | 38% | 6% | 43.39 | 11th
| **temp1.1_qwen3-14B_base.json** | 94.44% | 44.44% | 5.56% | 39.96 | 12th


With context - inferenceV3.py results (with context):

| Model/Temp                                | Easy   | Intermediate | Hard   | **Average Accuracy** |
| ----------------------------------------- | ------ | ------------ | ------ | -------------------- |
| **temp1.1_qwen3-14B_finetuned_with_defs** | 74.00% | 70.00%       | 46.00% | **63.33%**           |
| **temp1.0_qwen3-14B_finetuned_with_defs** | 88.00% | 66.00%       | 44.00% | **66.00%**           |
| **temp0.7_qwen3-14B_finetuned_with_defs** | 94.00% | 74.00%       | 32.00% | **66.67%**           |
| **temp0.5_qwen3-14B_finetuned_with_defs** | 86.00% | 76.00%       | 24.00% | **62.00%**           |
| **temp0.3_qwen3-14B_finetuned_with_defs** | 86.00% | 70.00%       | 24.00% | **60.00%**           |
| **temp0.1_qwen3-14B_finetuned_with_defs** | 90.00% | 64.00%       | 28.00% | **60.67%**           |
| **temp1.1_qwen3-14B_base_with_defs**      | 96.00% | 50.00%       | 14.00% | **53.33%**           |
| **temp1.0_qwen3-14B_base_with_defs**      | 96.00% | 58.00%       | 12.00% | **55.33%**           |
| **temp0.7_qwen3-14B_base_with_defs**      | 96.00% | 58.00%       | 10.00% | **54.67%**           |
| **temp0.5_qwen3-14B_base_with_defs**      | 95.56% | 62.22%       | 6.67%  | **54.82%**           |
| **temp0.3_qwen3-14B_base_with_defs**      | 96.00% | 58.00%       | 10.00% | **54.67%**           |
| **temp0.1_qwen3-14B_base_with_defs**      | 96.00% | 58.00%       | 8.00%  | **54.00%**           |

---