readctrl / experiment.md

Add files using upload-large-folder tool

c29669c verified 27 days ago

3.05 kB

	V1 without filtered:
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 94, 'intermediate': 52, 'hard': 29}
	easy: 94.00%, intermediate: 52.00%, hard: 29.00%

	v2 with filtered:
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 93, 'intermediate': 67, 'hard': 25}
	easy: 93.00%, intermediate: 67.00%, hard: 25.00%

	V2 without filtered (add new data):
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 88, 'intermediate': 71, 'hard': 28}
	easy: 88.00%, intermediate: 71.00%, hard: 28.00%


	Without context - inferenceV2.py results (without context):

	\| Config \| Easy \| Intermediate \| Hard \| Std Dev \| Balanced Ranking \|
	\|--------\|------\|---------------\|------\|---------\|------------------\|
	\| temp1.1_qwen3-14B_finetuned.json \| 88% \| 64% \| 44% \| 18.23 \| 🥇 Most balanced
	\| temp1.0_qwen3-14B_finetuned.json \| 86% \| 66% \| 42% \| 18.71 \| 🥈
	\| temp0.7_qwen3-14B_finetuned.json \| 92% \| 68% \| 28% \| 26.42 \| 🥉
	\| temp0.5_qwen3-14B_finetuned.json \| 92% \| 62% \| 30% \| 25.50 \| 4th
	\| temp0.3_qwen3-14B_finetuned.json \| 94% \| 54% \| 22% \| 30.06 \| 5th
	\| temp0.1_qwen3-14B_finetuned.json \| 90% \| 62% \| 22% \| 28.12 \| 6th
	\| temp0.3_qwen3-14B_base.json \| 94% \| 46% \| 8% \| 38.14 \| 7th
	\| temp1.0_qwen3-14B_base.json \| 96% \| 52% \| 8% \| 39.44 \| 8th
	\| temp0.5_qwen3-14B_base.json \| 96% \| 48% \| 6% \| 41.45 \| 9th
	\| temp0.1_qwen3-14B_base.json \| 96% \| 46% \| 6% \| 41.76 \| 10th
	\| temp0.7_qwen3-14B_base.json \| 94% \| 38% \| 6% \| 43.39 \| 11th
	\| temp1.1_qwen3-14B_base.json \| 94.44% \| 44.44% \| 5.56% \| 39.96 \| 12th


	With context - inferenceV3.py results (with context):

	\| Model/Temp \| Easy \| Intermediate \| Hard \| Average Accuracy \|
	\| ----------------------------------------- \| ------ \| ------------ \| ------ \| -------------------- \|
	\| temp1.1_qwen3-14B_finetuned_with_defs \| 74.00% \| 70.00% \| 46.00% \| 63.33% \|
	\| temp1.0_qwen3-14B_finetuned_with_defs \| 88.00% \| 66.00% \| 44.00% \| 66.00% \|
	\| temp0.7_qwen3-14B_finetuned_with_defs \| 94.00% \| 74.00% \| 32.00% \| 66.67% \|
	\| temp0.5_qwen3-14B_finetuned_with_defs \| 86.00% \| 76.00% \| 24.00% \| 62.00% \|
	\| temp0.3_qwen3-14B_finetuned_with_defs \| 86.00% \| 70.00% \| 24.00% \| 60.00% \|
	\| temp0.1_qwen3-14B_finetuned_with_defs \| 90.00% \| 64.00% \| 28.00% \| 60.67% \|
	\| temp1.1_qwen3-14B_base_with_defs \| 96.00% \| 50.00% \| 14.00% \| 53.33% \|
	\| temp1.0_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 12.00% \| 55.33% \|
	\| temp0.7_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 10.00% \| 54.67% \|
	\| temp0.5_qwen3-14B_base_with_defs \| 95.56% \| 62.22% \| 6.67% \| 54.82% \|
	\| temp0.3_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 10.00% \| 54.67% \|
	\| temp0.1_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 8.00% \| 54.00% \|

	---

	V1 without filtered:
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 94, 'intermediate': 52, 'hard': 29}
	easy: 94.00%, intermediate: 52.00%, hard: 29.00%

	v2 with filtered:
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 93, 'intermediate': 67, 'hard': 25}
	easy: 93.00%, intermediate: 67.00%, hard: 25.00%

	V2 without filtered (add new data):
	{'easy': 100, 'intermediate': 100, 'hard': 100}
	{'easy': 88, 'intermediate': 71, 'hard': 28}
	easy: 88.00%, intermediate: 71.00%, hard: 28.00%


	Without context - inferenceV2.py results (without context):

	\| Config \| Easy \| Intermediate \| Hard \| Std Dev \| Balanced Ranking \|
	\|--------\|------\|---------------\|------\|---------\|------------------\|
	\| temp1.1_qwen3-14B_finetuned.json \| 88% \| 64% \| 44% \| 18.23 \| 🥇 Most balanced
	\| temp1.0_qwen3-14B_finetuned.json \| 86% \| 66% \| 42% \| 18.71 \| 🥈
	\| temp0.7_qwen3-14B_finetuned.json \| 92% \| 68% \| 28% \| 26.42 \| 🥉
	\| temp0.5_qwen3-14B_finetuned.json \| 92% \| 62% \| 30% \| 25.50 \| 4th
	\| temp0.3_qwen3-14B_finetuned.json \| 94% \| 54% \| 22% \| 30.06 \| 5th
	\| temp0.1_qwen3-14B_finetuned.json \| 90% \| 62% \| 22% \| 28.12 \| 6th
	\| temp0.3_qwen3-14B_base.json \| 94% \| 46% \| 8% \| 38.14 \| 7th
	\| temp1.0_qwen3-14B_base.json \| 96% \| 52% \| 8% \| 39.44 \| 8th
	\| temp0.5_qwen3-14B_base.json \| 96% \| 48% \| 6% \| 41.45 \| 9th
	\| temp0.1_qwen3-14B_base.json \| 96% \| 46% \| 6% \| 41.76 \| 10th
	\| temp0.7_qwen3-14B_base.json \| 94% \| 38% \| 6% \| 43.39 \| 11th
	\| temp1.1_qwen3-14B_base.json \| 94.44% \| 44.44% \| 5.56% \| 39.96 \| 12th


	With context - inferenceV3.py results (with context):

	\| Model/Temp \| Easy \| Intermediate \| Hard \| Average Accuracy \|
	\| ----------------------------------------- \| ------ \| ------------ \| ------ \| -------------------- \|
	\| temp1.1_qwen3-14B_finetuned_with_defs \| 74.00% \| 70.00% \| 46.00% \| 63.33% \|
	\| temp1.0_qwen3-14B_finetuned_with_defs \| 88.00% \| 66.00% \| 44.00% \| 66.00% \|
	\| temp0.7_qwen3-14B_finetuned_with_defs \| 94.00% \| 74.00% \| 32.00% \| 66.67% \|
	\| temp0.5_qwen3-14B_finetuned_with_defs \| 86.00% \| 76.00% \| 24.00% \| 62.00% \|
	\| temp0.3_qwen3-14B_finetuned_with_defs \| 86.00% \| 70.00% \| 24.00% \| 60.00% \|
	\| temp0.1_qwen3-14B_finetuned_with_defs \| 90.00% \| 64.00% \| 28.00% \| 60.67% \|
	\| temp1.1_qwen3-14B_base_with_defs \| 96.00% \| 50.00% \| 14.00% \| 53.33% \|
	\| temp1.0_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 12.00% \| 55.33% \|
	\| temp0.7_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 10.00% \| 54.67% \|
	\| temp0.5_qwen3-14B_base_with_defs \| 95.56% \| 62.22% \| 6.67% \| 54.82% \|
	\| temp0.3_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 10.00% \| 54.67% \|
	\| temp0.1_qwen3-14B_base_with_defs \| 96.00% \| 58.00% \| 8.00% \| 54.00% \|

	---