readctrl / experiment.md
shahidul034's picture
Add files using upload-large-folder tool
c29669c verified
V1 without filtered:
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 94, 'intermediate': 52, 'hard': 29}
easy: 94.00%, intermediate: 52.00%, hard: 29.00%
v2 with filtered:
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 93, 'intermediate': 67, 'hard': 25}
easy: 93.00%, intermediate: 67.00%, hard: 25.00%
V2 without filtered (add new data):
{'easy': 100, 'intermediate': 100, 'hard': 100}
{'easy': 88, 'intermediate': 71, 'hard': 28}
easy: 88.00%, intermediate: 71.00%, hard: 28.00%
Without context - inferenceV2.py results (without context):
| Config | Easy | Intermediate | Hard | Std Dev | Balanced Ranking |
|--------|------|---------------|------|---------|------------------|
| **temp1.1_qwen3-14B_finetuned.json** | 88% | 64% | 44% | **18.23** | 🥇 Most balanced
| **temp1.0_qwen3-14B_finetuned.json** | 86% | 66% | 42% | **18.71** | 🥈
| **temp0.7_qwen3-14B_finetuned.json** | 92% | 68% | 28% | 26.42 | 🥉
| **temp0.5_qwen3-14B_finetuned.json** | 92% | 62% | 30% | 25.50 | 4th
| **temp0.3_qwen3-14B_finetuned.json** | 94% | 54% | 22% | 30.06 | 5th
| **temp0.1_qwen3-14B_finetuned.json** | 90% | 62% | 22% | 28.12 | 6th
| **temp0.3_qwen3-14B_base.json** | 94% | 46% | 8% | 38.14 | 7th
| **temp1.0_qwen3-14B_base.json** | 96% | 52% | 8% | 39.44 | 8th
| **temp0.5_qwen3-14B_base.json** | 96% | 48% | 6% | 41.45 | 9th
| **temp0.1_qwen3-14B_base.json** | 96% | 46% | 6% | 41.76 | 10th
| **temp0.7_qwen3-14B_base.json** | 94% | 38% | 6% | 43.39 | 11th
| **temp1.1_qwen3-14B_base.json** | 94.44% | 44.44% | 5.56% | 39.96 | 12th
With context - inferenceV3.py results (with context):
| Model/Temp | Easy | Intermediate | Hard | **Average Accuracy** |
| ----------------------------------------- | ------ | ------------ | ------ | -------------------- |
| **temp1.1_qwen3-14B_finetuned_with_defs** | 74.00% | 70.00% | 46.00% | **63.33%** |
| **temp1.0_qwen3-14B_finetuned_with_defs** | 88.00% | 66.00% | 44.00% | **66.00%** |
| **temp0.7_qwen3-14B_finetuned_with_defs** | 94.00% | 74.00% | 32.00% | **66.67%** |
| **temp0.5_qwen3-14B_finetuned_with_defs** | 86.00% | 76.00% | 24.00% | **62.00%** |
| **temp0.3_qwen3-14B_finetuned_with_defs** | 86.00% | 70.00% | 24.00% | **60.00%** |
| **temp0.1_qwen3-14B_finetuned_with_defs** | 90.00% | 64.00% | 28.00% | **60.67%** |
| **temp1.1_qwen3-14B_base_with_defs** | 96.00% | 50.00% | 14.00% | **53.33%** |
| **temp1.0_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 12.00% | **55.33%** |
| **temp0.7_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 10.00% | **54.67%** |
| **temp0.5_qwen3-14B_base_with_defs** | 95.56% | 62.22% | 6.67% | **54.82%** |
| **temp0.3_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 10.00% | **54.67%** |
| **temp0.1_qwen3-14B_base_with_defs** | 96.00% | 58.00% | 8.00% | **54.00%** |
---