himanshunakrani9 commited on 13 days ago

Commit

070e175

verified ·

1 Parent(s): 9e34e99

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +4 -0
eval_results/arc_challenge/outputs__grpo-full__final/results_2026-05-26T11-44-23.900648.json +141 -0
eval_results/arc_challenge/outputs__grpo-full__final/samples_arc_challenge_2026-05-26T11-44-23.900648.jsonl +3 -0
eval_results/arc_easy/outputs__grpo-full__final/results_2026-05-26T11-37-35.592842.json +141 -0
eval_results/arc_easy/outputs__grpo-full__final/samples_arc_easy_2026-05-26T11-37-35.592842.jsonl +0 -0
eval_results/benchmark_summary.md +8 -0
eval_results/combined_results.json +6 -0
eval_results/custom_eval.md +159 -0
eval_results/gsm8k/outputs__grpo-full__final/results_2026-05-26T09-33-14.995899.json +175 -0
eval_results/gsm8k/outputs__grpo-full__final/samples_gsm8k_2026-05-26T09-33-14.995899.jsonl +3 -0
eval_results/hellaswag/outputs__grpo-full__final/results_2026-05-26T12-29-19.954733.json +139 -0
eval_results/hellaswag/outputs__grpo-full__final/samples_hellaswag_2026-05-26T12-29-19.954733.jsonl +3 -0
eval_results/minerva_math_algebra/outputs__grpo-full__final/results_2026-05-26T10-07-08.234107.json +145 -0
eval_results/minerva_math_algebra/outputs__grpo-full__final/samples_minerva_math_algebra_2026-05-26T10-07-08.234107.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/results_2026-05-26T12-51-35.680076.json +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_abstract_algebra_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_anatomy_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_astronomy_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_business_ethics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_clinical_knowledge_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_biology_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_chemistry_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_computer_science_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_mathematics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_medicine_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_physics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_computer_security_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_conceptual_physics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_econometrics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_electrical_engineering_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_elementary_mathematics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_formal_logic_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_global_facts_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_biology_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_chemistry_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_computer_science_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_european_history_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_geography_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_government_and_politics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_macroeconomics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_mathematics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_microeconomics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_physics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_psychology_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_statistics_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_us_history_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_world_history_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_human_aging_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_human_sexuality_2026-05-26T12-51-35.680076.jsonl +0 -0
eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_international_law_2026-05-26T12-51-35.680076.jsonl +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+eval_results/arc_challenge/outputs__grpo-full__final/samples_arc_challenge_2026-05-26T11-44-23.900648.jsonl filter=lfs diff=lfs merge=lfs -text
+eval_results/gsm8k/outputs__grpo-full__final/samples_gsm8k_2026-05-26T09-33-14.995899.jsonl filter=lfs diff=lfs merge=lfs -text
+eval_results/hellaswag/outputs__grpo-full__final/samples_hellaswag_2026-05-26T12-29-19.954733.jsonl filter=lfs diff=lfs merge=lfs -text
+eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_professional_law_2026-05-26T12-51-35.680076.jsonl filter=lfs diff=lfs merge=lfs -text

eval_results/arc_challenge/outputs__grpo-full__final/results_2026-05-26T11-44-23.900648.json ADDED Viewed

	@@ -0,0 +1,141 @@

+{
+  "results": {
+    "arc_challenge": {
+      "name": "arc_challenge",
+      "alias": "arc_challenge",
+      "sample_len": 1172,
+      "acc,none": 0.19197952218430034,
+      "acc_stderr,none": 0.01150959890659822,
+      "acc_norm,none": 0.22781569965870307,
+      "acc_norm_stderr,none": 0.012256708602326964
+    }
+  },
+  "group_subtasks": {},
+  "configs": {
+    "arc_challenge": {
+      "task": "arc_challenge",
+      "dataset_path": "allenai/ai2_arc",
+      "dataset_name": "ARC-Challenge",
+      "training_split": "train",
+      "validation_split": "validation",
+      "test_split": "test",
+      "doc_to_text": "Question: {{question}}\nAnswer:",
+      "doc_to_target": "{{choices.label.index(answerKey)}}",
+      "unsafe_code": false,
+      "doc_to_choice": "{{choices.text}}",
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "fewshot_config": {
+        "sampler": "default",
+        "split": null,
+        "process_docs": null,
+        "fewshot_indices": null,
+        "samples": null,
+        "doc_to_text": "Question: {{question}}\nAnswer:",
+        "doc_to_choice": "{{choices.text}}",
+        "doc_to_target": "{{choices.label.index(answerKey)}}",
+        "gen_prefix": null,
+        "fewshot_delimiter": "\n\n",
+        "target_delimiter": " "
+      },
+      "num_fewshot": 25,
+      "metric_list": [
+        {
+          "metric": "acc",
+          "aggregation": "mean",
+          "higher_is_better": true
+        },
+        {
+          "metric": "acc_norm",
+          "aggregation": "mean",
+          "higher_is_better": true
+        }
+      ],
+      "output_type": "multiple_choice",
+      "repeats": 1,
+      "should_decontaminate": true,
+      "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+      "metadata": {
+        "version": 1.0,
+        "pretrained": "outputs/grpo-full/final",
+        "dtype": "bfloat16",
+        "config_source": "/home/himanshu/TinyMathReason-1B/venv/lib/python3.12/site-packages/lm_eval/tasks/arc/arc_challenge.yaml"
+      }
+    }
+  },
+  "versions": {
+    "arc_challenge": 1.0
+  },
+  "n-shot": {
+    "arc_challenge": 25
+  },
+  "higher_is_better": {
+    "arc_challenge": {
+      "acc": true,
+      "acc_norm": true
+    }
+  },
+  "n-samples": {
+    "arc_challenge": {
+      "original": 1172,
+      "effective": 1172
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": {
+      "pretrained": "outputs/grpo-full/final",
+      "dtype": "bfloat16"
+    },
+    "model_num_parameters": 1123125248,
+    "model_dtype": "torch.bfloat16",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": "auto",
+    "batch_sizes": [
+      64
+    ],
+    "device": "cuda",
+    "use_cache": null,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": {},
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "03fcf23",
+  "date": 1779795467.4101148,
+  "pretty_env_info": "PyTorch version: 2.12.0+cu130\nIs debug build: False\nCUDA used to build PyTorch: 13.0\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 24.04.4 LTS (x86_64)\nGCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0\nClang version: Could not collect\nCMake version: Could not collect\nLibc version: glibc-2.39\n\nPython version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)\nPython platform: Linux-6.17.0-1016-gcp-x86_64-with-glibc2.39\nIs CUDA available: True\nCUDA runtime version: Could not collect\nCUDA_MODULE_LOADING set to: \nGPU models and configuration: GPU 0: NVIDIA L4\nNvidia driver version: 580.126.20\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_adv.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_cnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_graph.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_ops.so.9.13.0\nIs XPU available: False\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\nCaching allocator config: N/A\n\nCPU:\nArchitecture:                            x86_64\nCPU op-mode(s):                          32-bit, 64-bit\nAddress sizes:                           46 bits physical, 48 bits virtual\nByte Order:                              Little Endian\nCPU(s):                                  16\nOn-line CPU(s) list:                     0-15\nVendor ID:                               GenuineIntel\nModel name:                              Intel(R) Xeon(R) CPU @ 2.20GHz\nCPU family:                              6\nModel:                                   85\nThread(s) per core:                      2\nCore(s) per socket:                      8\nSocket(s):                               1\nStepping:                                7\nBogoMIPS:                                4400.43\nFlags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities\nHypervisor vendor:                       KVM\nVirtualization type:                     full\nL1d cache:                               256 KiB (8 instances)\nL1i cache:                               256 KiB (8 instances)\nL2 cache:                                8 MiB (8 instances)\nL3 cache:                                38.5 MiB (1 instance)\nNUMA node(s):                            1\nNUMA node0 CPU(s):                       0-15\nVulnerability Gather data sampling:      Not affected\nVulnerability Ghostwrite:                Not affected\nVulnerability Indirect target selection: Mitigation; Aligned branch/return thunks\nVulnerability Itlb multihit:             Not affected\nVulnerability L1tf:                      Not affected\nVulnerability Mds:                       Not affected\nVulnerability Meltdown:                  Not affected\nVulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Old microcode:             Not affected\nVulnerability Reg file data sampling:    Not affected\nVulnerability Retbleed:                  Mitigation; Enhanced IBRS\nVulnerability Spec rstack overflow:      Not affected\nVulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                     Not affected\nVulnerability Tsa:                       Not affected\nVulnerability Tsx async abort:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Vmscape:                   Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==2.4.6\n[pip3] nvidia-cublas==13.1.1.3\n[pip3] nvidia-cuda-cupti==13.0.85\n[pip3] nvidia-cuda-nvrtc==13.0.88\n[pip3] nvidia-cuda-runtime==13.0.96\n[pip3] nvidia-cudnn-cu13==9.20.0.48\n[pip3] nvidia-cufft==12.0.0.61\n[pip3] nvidia-curand==10.4.0.35\n[pip3] nvidia-cusolver==12.0.4.66\n[pip3] nvidia-cusparse==12.6.3.3\n[pip3] nvidia-cusparselt-cu13==0.8.1\n[pip3] nvidia-nccl-cu13==2.29.7\n[pip3] nvidia-nvjitlink==13.0.88\n[pip3] nvidia-nvtx==13.0.85\n[pip3] torch==2.12.0\n[pip3] triton==3.7.0\n[conda] Could not collect",
+  "transformers_version": "5.9.0",
+  "lm_eval_version": "0.4.12",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<|pad|>",
+    "3"
+  ],
+  "tokenizer_eos_token": [
+    "<|eos|>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    "<|bos|>",
+    "0"
+  ],
+  "eot_token_id": 1,
+  "max_length": 4096,
+  "task_hashes": {
+    "arc_challenge": "55e883475b5650b20d8d9dc1e9cdf59ef645a257fcd74bf43a9dbb5c632c529c"
+  },
+  "model_source": "hf",
+  "model_name": "outputs/grpo-full/final",
+  "model_name_sanitized": "outputs__grpo-full__final",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": null,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "total_evaluation_time_seconds": "405.71644051300245"
+}

eval_results/arc_challenge/outputs__grpo-full__final/samples_arc_challenge_2026-05-26T11-44-23.900648.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd9bfe557b60e908f99194a2f751c31995efbca3ef3d37873954b352b3cceb07
+size 23316200

eval_results/arc_easy/outputs__grpo-full__final/results_2026-05-26T11-37-35.592842.json ADDED Viewed

	@@ -0,0 +1,141 @@

+{
+  "results": {
+    "arc_easy": {
+      "name": "arc_easy",
+      "alias": "arc_easy",
+      "sample_len": 2376,
+      "acc,none": 0.2760942760942761,
+      "acc_stderr,none": 0.00917355987383544,
+      "acc_norm,none": 0.2878787878787879,
+      "acc_norm_stderr,none": 0.009290733161670239
+    }
+  },
+  "group_subtasks": {},
+  "configs": {
+    "arc_easy": {
+      "task": "arc_easy",
+      "dataset_path": "allenai/ai2_arc",
+      "dataset_name": "ARC-Easy",
+      "training_split": "train",
+      "validation_split": "validation",
+      "test_split": "test",
+      "doc_to_text": "Question: {{question}}\nAnswer:",
+      "doc_to_target": "{{choices.label.index(answerKey)}}",
+      "unsafe_code": false,
+      "doc_to_choice": "{{choices.text}}",
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "fewshot_config": {
+        "sampler": "default",
+        "split": null,
+        "process_docs": null,
+        "fewshot_indices": null,
+        "samples": null,
+        "doc_to_text": "Question: {{question}}\nAnswer:",
+        "doc_to_choice": "{{choices.text}}",
+        "doc_to_target": "{{choices.label.index(answerKey)}}",
+        "gen_prefix": null,
+        "fewshot_delimiter": "\n\n",
+        "target_delimiter": " "
+      },
+      "num_fewshot": 0,
+      "metric_list": [
+        {
+          "metric": "acc",
+          "aggregation": "mean",
+          "higher_is_better": true
+        },
+        {
+          "metric": "acc_norm",
+          "aggregation": "mean",
+          "higher_is_better": true
+        }
+      ],
+      "output_type": "multiple_choice",
+      "repeats": 1,
+      "should_decontaminate": true,
+      "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+      "metadata": {
+        "version": 1.0,
+        "pretrained": "outputs/grpo-full/final",
+        "dtype": "bfloat16",
+        "config_source": "/home/himanshu/TinyMathReason-1B/venv/lib/python3.12/site-packages/lm_eval/tasks/arc/arc_easy.yaml"
+      }
+    }
+  },
+  "versions": {
+    "arc_easy": 1.0
+  },
+  "n-shot": {
+    "arc_easy": 0
+  },
+  "higher_is_better": {
+    "arc_easy": {
+      "acc": true,
+      "acc_norm": true
+    }
+  },
+  "n-samples": {
+    "arc_easy": {
+      "original": 2376,
+      "effective": 2376
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": {
+      "pretrained": "outputs/grpo-full/final",
+      "dtype": "bfloat16"
+    },
+    "model_num_parameters": 1123125248,
+    "model_dtype": "torch.bfloat16",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": "auto",
+    "batch_sizes": [
+      64
+    ],
+    "device": "cuda",
+    "use_cache": null,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": {},
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "03fcf23",
+  "date": 1779795408.3004034,
+  "pretty_env_info": "PyTorch version: 2.12.0+cu130\nIs debug build: False\nCUDA used to build PyTorch: 13.0\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 24.04.4 LTS (x86_64)\nGCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0\nClang version: Could not collect\nCMake version: Could not collect\nLibc version: glibc-2.39\n\nPython version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)\nPython platform: Linux-6.17.0-1016-gcp-x86_64-with-glibc2.39\nIs CUDA available: True\nCUDA runtime version: Could not collect\nCUDA_MODULE_LOADING set to: \nGPU models and configuration: GPU 0: NVIDIA L4\nNvidia driver version: 580.126.20\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_adv.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_cnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_graph.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_ops.so.9.13.0\nIs XPU available: False\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\nCaching allocator config: N/A\n\nCPU:\nArchitecture:                            x86_64\nCPU op-mode(s):                          32-bit, 64-bit\nAddress sizes:                           46 bits physical, 48 bits virtual\nByte Order:                              Little Endian\nCPU(s):                                  16\nOn-line CPU(s) list:                     0-15\nVendor ID:                               GenuineIntel\nModel name:                              Intel(R) Xeon(R) CPU @ 2.20GHz\nCPU family:                              6\nModel:                                   85\nThread(s) per core:                      2\nCore(s) per socket:                      8\nSocket(s):                               1\nStepping:                                7\nBogoMIPS:                                4400.43\nFlags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities\nHypervisor vendor:                       KVM\nVirtualization type:                     full\nL1d cache:                               256 KiB (8 instances)\nL1i cache:                               256 KiB (8 instances)\nL2 cache:                                8 MiB (8 instances)\nL3 cache:                                38.5 MiB (1 instance)\nNUMA node(s):                            1\nNUMA node0 CPU(s):                       0-15\nVulnerability Gather data sampling:      Not affected\nVulnerability Ghostwrite:                Not affected\nVulnerability Indirect target selection: Mitigation; Aligned branch/return thunks\nVulnerability Itlb multihit:             Not affected\nVulnerability L1tf:                      Not affected\nVulnerability Mds:                       Not affected\nVulnerability Meltdown:                  Not affected\nVulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Old microcode:             Not affected\nVulnerability Reg file data sampling:    Not affected\nVulnerability Retbleed:                  Mitigation; Enhanced IBRS\nVulnerability Spec rstack overflow:      Not affected\nVulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                     Not affected\nVulnerability Tsa:                       Not affected\nVulnerability Tsx async abort:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Vmscape:                   Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==2.4.6\n[pip3] nvidia-cublas==13.1.1.3\n[pip3] nvidia-cuda-cupti==13.0.85\n[pip3] nvidia-cuda-nvrtc==13.0.88\n[pip3] nvidia-cuda-runtime==13.0.96\n[pip3] nvidia-cudnn-cu13==9.20.0.48\n[pip3] nvidia-cufft==12.0.0.61\n[pip3] nvidia-curand==10.4.0.35\n[pip3] nvidia-cusolver==12.0.4.66\n[pip3] nvidia-cusparse==12.6.3.3\n[pip3] nvidia-cusparselt-cu13==0.8.1\n[pip3] nvidia-nccl-cu13==2.29.7\n[pip3] nvidia-nvjitlink==13.0.88\n[pip3] nvidia-nvtx==13.0.85\n[pip3] torch==2.12.0\n[pip3] triton==3.7.0\n[conda] Could not collect",
+  "transformers_version": "5.9.0",
+  "lm_eval_version": "0.4.12",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<|pad|>",
+    "3"
+  ],
+  "tokenizer_eos_token": [
+    "<|eos|>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    "<|bos|>",
+    "0"
+  ],
+  "eot_token_id": 1,
+  "max_length": 4096,
+  "task_hashes": {
+    "arc_easy": "dce0d9b0f0cecd55bf2ac264042c5e45487df708d13123af3ae9e67bbbefdeb1"
+  },
+  "model_source": "hf",
+  "model_name": "outputs/grpo-full/final",
+  "model_name_sanitized": "outputs__grpo-full__final",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": null,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "total_evaluation_time_seconds": "56.55086992699944"
+}

eval_results/arc_easy/outputs__grpo-full__final/samples_arc_easy_2026-05-26T11-37-35.592842.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/benchmark_summary.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# Benchmark Results for outputs/grpo-full/final
+| Benchmark | Shots | Score |
+| :--- | :---: | :---: |
+| arc_easy | 0 | No results file |
+| arc_challenge | 25 | No results file |
+| hellaswag | 10 | No results file |
+| mmlu | 5 | No results file |

eval_results/combined_results.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "arc_easy": "No results file",
+  "arc_challenge": "No results file",
+  "hellaswag": "No results file",
+  "mmlu": "No results file"
+}

eval_results/custom_eval.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# Custom Evaluation: outputs/grpo-full/final
+### Level: Easy
+**Q: What is 15% of 200?**
+**A:**
+There are 5% of the population of unknown families, which is 3% of the population.
+In the case of the multi-choice, the population is 2%.
+The proportion of the population has an impact on the population of 500 people.
+The proportion of individuals living in rural areas is 4.
+The population of the people living in rural areas in urban areas has 4% of the population.
+The population of the population of West Africa is 2/100, which is the number of people living in rural areas:
+The proportion of the proportion of people in the proportion of people living in the rural regions is 0.6/3,
+The poverty in the rural area is 2/4.
+If the number of children living in the population is 25/2. The population of the population is 2.5.6.
+The population of people in rural areas is the population that is 2/3.6.
+The population of the proportion of the people living in rural areas is 10.6.
+The population of the population of the population is 4.
+The population of the proportion of individuals who live in poverty are 65 years older than the poverty-free population, and in poverty.
+The population of
+---
+### Level: Easy
+**Q: Solve for x: 3x + 5 = 20**
+**A:**
+There are only x 2 = 8
+The value of x is 2x + 3 = 2.
+The value of x is 2.
+The value of x is 2.
+#### 2.
+---
+### Level: Easy
+**Q: If a train travels 60 miles per hour, how far does it go in 2.5 hours?**
+**A:**
+The train tracks can be used to determine the speed of the train that is currently 10 km long, and the train is 10 meters.
+If the train was in a train of 6 km, then the train would be 4 hours in the train is in the train, then the train would be a total of 12.5 km/hr, and a train to drive the train to the train.
+If the train was running, how many miles traveled?
+If the train passed through the train, then the train would be able to take it at the train station.
+If it ran 50 mph, what is the distance from the train was to the train?
+If we know the answer to the above question is 3, what is the average speed of the train after the train is 3 feet, what is the average speed of the train?
+---
+### Level: Medium
+**Q: Find the roots of the quadratic equation: x^2 - 5x + 6 = 0**
+**A:**
+To solve this system of equations, we need to determine the value of $x^2 - 10 = 10$.
+The value of $x^2$ is $x^2 = 0.1$.
+The value of $x^2$ is given by:
+$x^2 - 5 = 5$.
+The value of $x^2$ is $2$ and the roots of the roots of $x^2$.
+The value of x is $0$, which is $x = \frac{1}{2} + 2 = 0$.
+Solving for $x = \frac{1}{2}$.
+---
+### Level: Medium
+**Q: What is the derivative of f(x) = 3x^4 - 2x^2 + x?**
+**A:**
+The derivative of the derivative of f(x) = 2(x^2 - 4x + 2x^2 + 2x + 4x + 3x + 2x^2 + 2x + 2x + 2x^2 + 4x^2 + 2x + 2x^2 + 2x^2) = 4.
+The answer is: 8
+---
+### Level: Medium
+**Q: If A = {1, 2, 3} and B = {3, 4, 5}, what is the union of sets A and B?**
+**A:**
+Let's solve the multi-choice question.
+Then, we have:
+A = {1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 7, 3, 6, 6, 8, 9, 7, 9, 9, 10, 9, 6, 9, 7, 8, 7, 7, 8, 7, 8, 8, 7, 7, 9, 9, 8, 8, 9, 7, 9}.
+We can conclude that the feasible difference in this case is not possible in the multi-choice question is also the smallest integer.
+Since there are two possible cases, the multi-choice question is equivalent to the multi-choice question.
+Answer Choices:
+Number of arrangements: (3, 6, 8, 13, 12), (2, 12), (3, 15), (3, 9, 8, 9) and (3, 15), (4, 8, 9), (2, 6, 13), (5, 9), (3, 9), (4, 9), (3, 9), (5, 5), (4, 1), (3, 2), (6, (3, 3), (4,
+---
+### Level: Hard
+**Q: Prove that the sum of the first n positive integers is n(n+1)/2.**
+**A:**
+n(n+1) = n(n-1)
+n(n+1) = n(n-1) + (n-1) = n(n-1)n(n+1) + n(n+1)n(n-1)n+1
+n1 = n(n-1)
+n2 = n2
+n2 = n2
+n3 = n2(n3)
+n3 = n2 + n3 + 1(n3)
+n4 = n3
+n4 = n4
+n2 = n3 = n1 + n4 + n2 = n2
+where n1 = n1 = n3 = 1.
+n3 = n1(n2) + n3(n3) = n1 + n1(n3) = n2(n1(n3) + n3(n3) = n2(n3)
+Now, we can write:
+n1 = n1(n1) = n3 + n3(n3) + n3 + n3(n3)
+n2 = n1 + (
+---
+### Level: Hard
+**Q: Evaluate the integral of x * e^x dx from 0 to 1.**
+**A:**
+import numpy as np
+# Define a value of x, which represents the value of x, y, and y, and y, which represents the value of x, y, and y, and y
+print(num_x)
+---
+### Level: Hard
+**Q: Find the eigenvalues of the matrix [[2, 1], [1, 2]].**
+**A:**
+import numpy as np
+# define the eigenvalues of the eigenvalues, and to get the eigenvalues of the eigenvalues.
+# Define the eigenvalues
+2.0 = (0.0,0)
+# compute the eigenvalues of the eigenvalues of the eigenvalues of the eigenvalues.
+# calculate the eigenvalues
+# The eigenvalues of the eigenvalues are given by the eigenvalues and eigenvectors of the eigenvalues.fill_1 and 0.1
+# calculate the eigenvalues of the eigenvalues of the eigenvalues
+# calculate the eigenvalues
+# the eigenvalues of the eigenvalues
+print(total_n) #print(# print the eigenvalues
+print(row_1)
+print(total_n)
+print(answer_n)
+---

eval_results/gsm8k/outputs__grpo-full__final/results_2026-05-26T09-33-14.995899.json ADDED Viewed

	@@ -0,0 +1,175 @@

+{
+  "results": {
+    "gsm8k": {
+      "name": "gsm8k",
+      "alias": "gsm8k",
+      "sample_len": 1319,
+      "exact_match,strict-match": 0.009855951478392721,
+      "exact_match_stderr,strict-match": 0.0027210765770416616,
+      "exact_match,flexible-extract": 0.021986353297952996,
+      "exact_match_stderr,flexible-extract": 0.0040391627581100546
+    }
+  },
+  "group_subtasks": {},
+  "configs": {
+    "gsm8k": {
+      "task": "gsm8k",
+      "dataset_path": "openai/gsm8k",
+      "dataset_name": "main",
+      "training_split": "train",
+      "test_split": "test",
+      "fewshot_split": "train",
+      "doc_to_text": "Question: {{question}}\nAnswer:",
+      "doc_to_target": "{{answer}}",
+      "unsafe_code": false,
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "fewshot_config": {
+        "sampler": "default",
+        "split": "train",
+        "process_docs": null,
+        "fewshot_indices": null,
+        "samples": null,
+        "doc_to_text": "Question: {{question}}\nAnswer:",
+        "doc_to_choice": null,
+        "doc_to_target": "{{answer}}",
+        "gen_prefix": null,
+        "fewshot_delimiter": "\n\n",
+        "target_delimiter": " "
+      },
+      "num_fewshot": 8,
+      "metric_list": [
+        {
+          "metric": "exact_match",
+          "aggregation": "mean",
+          "higher_is_better": true,
+          "ignore_case": true,
+          "ignore_punctuation": false,
+          "regexes_to_ignore": [
+            ",",
+            "\\$",
+            "(?s).*#### ",
+            "\\.$"
+          ]
+        }
+      ],
+      "output_type": "generate_until",
+      "generation_kwargs": {
+        "until": [
+          "Question:",
+          "</s>",
+          "<|im_end|>"
+        ],
+        "do_sample": false,
+        "temperature": 0.0
+      },
+      "repeats": 1,
+      "filter_list": [
+        {
+          "name": "strict-match",
+          "filter": [
+            {
+              "function": "regex",
+              "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+            },
+            {
+              "function": "take_first"
+            }
+          ]
+        },
+        {
+          "name": "flexible-extract",
+          "filter": [
+            {
+              "function": "regex",
+              "group_select": -1,
+              "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+            },
+            {
+              "function": "take_first"
+            }
+          ]
+        }
+      ],
+      "should_decontaminate": false,
+      "metadata": {
+        "version": 3.0,
+        "pretrained": "outputs/grpo-full/final",
+        "dtype": "bfloat16",
+        "config_source": "/home/himanshu/TinyMathReason-1B/venv/lib/python3.12/site-packages/lm_eval/tasks/gsm8k/gsm8k.yaml"
+      }
+    }
+  },
+  "versions": {
+    "gsm8k": 3.0
+  },
+  "n-shot": {
+    "gsm8k": 8
+  },
+  "higher_is_better": {
+    "gsm8k": {
+      "exact_match": true
+    }
+  },
+  "n-samples": {
+    "gsm8k": {
+      "original": 1319,
+      "effective": 1319
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": {
+      "pretrained": "outputs/grpo-full/final",
+      "dtype": "bfloat16"
+    },
+    "model_num_parameters": 1123125248,
+    "model_dtype": "torch.bfloat16",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": "auto",
+    "batch_sizes": [],
+    "device": "cuda",
+    "use_cache": null,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": {},
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "03fcf23",
+  "date": 1779784898.324978,
+  "pretty_env_info": "PyTorch version: 2.12.0+cu130\nIs debug build: False\nCUDA used to build PyTorch: 13.0\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 24.04.4 LTS (x86_64)\nGCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0\nClang version: Could not collect\nCMake version: Could not collect\nLibc version: glibc-2.39\n\nPython version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)\nPython platform: Linux-6.17.0-1016-gcp-x86_64-with-glibc2.39\nIs CUDA available: True\nCUDA runtime version: Could not collect\nCUDA_MODULE_LOADING set to: \nGPU models and configuration: GPU 0: NVIDIA L4\nNvidia driver version: 580.126.20\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_adv.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_cnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_graph.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_ops.so.9.13.0\nIs XPU available: False\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\nCaching allocator config: N/A\n\nCPU:\nArchitecture:                            x86_64\nCPU op-mode(s):                          32-bit, 64-bit\nAddress sizes:                           46 bits physical, 48 bits virtual\nByte Order:                              Little Endian\nCPU(s):                                  16\nOn-line CPU(s) list:                     0-15\nVendor ID:                               GenuineIntel\nModel name:                              Intel(R) Xeon(R) CPU @ 2.20GHz\nCPU family:                              6\nModel:                                   85\nThread(s) per core:                      2\nCore(s) per socket:                      8\nSocket(s):                               1\nStepping:                                7\nBogoMIPS:                                4400.43\nFlags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities\nHypervisor vendor:                       KVM\nVirtualization type:                     full\nL1d cache:                               256 KiB (8 instances)\nL1i cache:                               256 KiB (8 instances)\nL2 cache:                                8 MiB (8 instances)\nL3 cache:                                38.5 MiB (1 instance)\nNUMA node(s):                            1\nNUMA node0 CPU(s):                       0-15\nVulnerability Gather data sampling:      Not affected\nVulnerability Ghostwrite:                Not affected\nVulnerability Indirect target selection: Mitigation; Aligned branch/return thunks\nVulnerability Itlb multihit:             Not affected\nVulnerability L1tf:                      Not affected\nVulnerability Mds:                       Not affected\nVulnerability Meltdown:                  Not affected\nVulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Old microcode:             Not affected\nVulnerability Reg file data sampling:    Not affected\nVulnerability Retbleed:                  Mitigation; Enhanced IBRS\nVulnerability Spec rstack overflow:      Not affected\nVulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                     Not affected\nVulnerability Tsa:                       Not affected\nVulnerability Tsx async abort:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Vmscape:                   Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==2.4.6\n[pip3] nvidia-cublas==13.1.1.3\n[pip3] nvidia-cuda-cupti==13.0.85\n[pip3] nvidia-cuda-nvrtc==13.0.88\n[pip3] nvidia-cuda-runtime==13.0.96\n[pip3] nvidia-cudnn-cu13==9.20.0.48\n[pip3] nvidia-cufft==12.0.0.61\n[pip3] nvidia-curand==10.4.0.35\n[pip3] nvidia-cusolver==12.0.4.66\n[pip3] nvidia-cusparse==12.6.3.3\n[pip3] nvidia-cusparselt-cu13==0.8.1\n[pip3] nvidia-nccl-cu13==2.29.7\n[pip3] nvidia-nvjitlink==13.0.88\n[pip3] nvidia-nvtx==13.0.85\n[pip3] torch==2.12.0\n[pip3] triton==3.7.0\n[conda] Could not collect",
+  "transformers_version": "5.9.0",
+  "lm_eval_version": "0.4.12",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<|pad|>",
+    "3"
+  ],
+  "tokenizer_eos_token": [
+    "<|eos|>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    "<|bos|>",
+    "0"
+  ],
+  "eot_token_id": 1,
+  "max_length": 4096,
+  "task_hashes": {
+    "gsm8k": "5edaa24ff4f3d939c3e1c5fd65a53cead84d4a52171818c453ec47099bd2a422"
+  },
+  "model_source": "hf",
+  "model_name": "outputs/grpo-full/final",
+  "model_name_sanitized": "outputs__grpo-full__final",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": null,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "total_evaluation_time_seconds": "3105.920454603998"
+}

eval_results/gsm8k/outputs__grpo-full__final/samples_gsm8k_2026-05-26T09-33-14.995899.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19d5589c2ca07cc8c977a4a341647694d37a6f6e76925e364b283957fb5313fa
+size 17686475

eval_results/hellaswag/outputs__grpo-full__final/results_2026-05-26T12-29-19.954733.json ADDED Viewed

	@@ -0,0 +1,139 @@

+{
+  "results": {
+    "hellaswag": {
+      "name": "hellaswag",
+      "alias": "hellaswag",
+      "sample_len": 10042,
+      "acc,none": 0.2590121489743079,
+      "acc_stderr,none": 0.004371969542814812,
+      "acc_norm,none": 0.2629954192391954,
+      "acc_norm_stderr,none": 0.004393601887506574
+    }
+  },
+  "group_subtasks": {},
+  "configs": {
+    "hellaswag": {
+      "task": "hellaswag",
+      "dataset_path": "Rowan/hellaswag",
+      "training_split": "train",
+      "validation_split": "validation",
+      "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n    def _process_doc(doc):\n        ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n        out_doc = {\n            \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n            \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n            \"gold\": int(doc[\"label\"]),\n        }\n        return out_doc\n\n    return dataset.map(_process_doc)\n",
+      "doc_to_text": "{{query}}",
+      "doc_to_target": "{{label}}",
+      "unsafe_code": false,
+      "doc_to_choice": "choices",
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "fewshot_config": {
+        "sampler": "default",
+        "split": null,
+        "process_docs": "<function process_docs at 0x74288899f7e0>",
+        "fewshot_indices": null,
+        "samples": null,
+        "doc_to_text": "{{query}}",
+        "doc_to_choice": "choices",
+        "doc_to_target": "{{label}}",
+        "gen_prefix": null,
+        "fewshot_delimiter": "\n\n",
+        "target_delimiter": " "
+      },
+      "num_fewshot": 10,
+      "metric_list": [
+        {
+          "metric": "acc",
+          "aggregation": "mean",
+          "higher_is_better": true
+        },
+        {
+          "metric": "acc_norm",
+          "aggregation": "mean",
+          "higher_is_better": true
+        }
+      ],
+      "output_type": "multiple_choice",
+      "repeats": 1,
+      "should_decontaminate": false,
+      "metadata": {
+        "version": 1.0,
+        "pretrained": "outputs/grpo-full/final",
+        "dtype": "bfloat16",
+        "config_source": "/home/himanshu/TinyMathReason-1B/venv/lib/python3.12/site-packages/lm_eval/tasks/hellaswag/hellaswag.yaml"
+      }
+    }
+  },
+  "versions": {
+    "hellaswag": 1.0
+  },
+  "n-shot": {
+    "hellaswag": 10
+  },
+  "higher_is_better": {
+    "hellaswag": {
+      "acc": true,
+      "acc_norm": true
+    }
+  },
+  "n-samples": {
+    "hellaswag": {
+      "original": 10042,
+      "effective": 10042
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": {
+      "pretrained": "outputs/grpo-full/final",
+      "dtype": "bfloat16"
+    },
+    "model_num_parameters": 1123125248,
+    "model_dtype": "torch.bfloat16",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": "auto",
+    "batch_sizes": [
+      64
+    ],
+    "device": "cuda",
+    "use_cache": null,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": {},
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "03fcf23",
+  "date": 1779795876.0037582,
+  "pretty_env_info": "PyTorch version: 2.12.0+cu130\nIs debug build: False\nCUDA used to build PyTorch: 13.0\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 24.04.4 LTS (x86_64)\nGCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0\nClang version: Could not collect\nCMake version: Could not collect\nLibc version: glibc-2.39\n\nPython version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)\nPython platform: Linux-6.17.0-1016-gcp-x86_64-with-glibc2.39\nIs CUDA available: True\nCUDA runtime version: Could not collect\nCUDA_MODULE_LOADING set to: \nGPU models and configuration: GPU 0: NVIDIA L4\nNvidia driver version: 580.126.20\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_adv.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_cnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_graph.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_ops.so.9.13.0\nIs XPU available: False\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\nCaching allocator config: N/A\n\nCPU:\nArchitecture:                            x86_64\nCPU op-mode(s):                          32-bit, 64-bit\nAddress sizes:                           46 bits physical, 48 bits virtual\nByte Order:                              Little Endian\nCPU(s):                                  16\nOn-line CPU(s) list:                     0-15\nVendor ID:                               GenuineIntel\nModel name:                              Intel(R) Xeon(R) CPU @ 2.20GHz\nCPU family:                              6\nModel:                                   85\nThread(s) per core:                      2\nCore(s) per socket:                      8\nSocket(s):                               1\nStepping:                                7\nBogoMIPS:                                4400.43\nFlags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities\nHypervisor vendor:                       KVM\nVirtualization type:                     full\nL1d cache:                               256 KiB (8 instances)\nL1i cache:                               256 KiB (8 instances)\nL2 cache:                                8 MiB (8 instances)\nL3 cache:                                38.5 MiB (1 instance)\nNUMA node(s):                            1\nNUMA node0 CPU(s):                       0-15\nVulnerability Gather data sampling:      Not affected\nVulnerability Ghostwrite:                Not affected\nVulnerability Indirect target selection: Mitigation; Aligned branch/return thunks\nVulnerability Itlb multihit:             Not affected\nVulnerability L1tf:                      Not affected\nVulnerability Mds:                       Not affected\nVulnerability Meltdown:                  Not affected\nVulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Old microcode:             Not affected\nVulnerability Reg file data sampling:    Not affected\nVulnerability Retbleed:                  Mitigation; Enhanced IBRS\nVulnerability Spec rstack overflow:      Not affected\nVulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                     Not affected\nVulnerability Tsa:                       Not affected\nVulnerability Tsx async abort:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Vmscape:                   Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==2.4.6\n[pip3] nvidia-cublas==13.1.1.3\n[pip3] nvidia-cuda-cupti==13.0.85\n[pip3] nvidia-cuda-nvrtc==13.0.88\n[pip3] nvidia-cuda-runtime==13.0.96\n[pip3] nvidia-cudnn-cu13==9.20.0.48\n[pip3] nvidia-cufft==12.0.0.61\n[pip3] nvidia-curand==10.4.0.35\n[pip3] nvidia-cusolver==12.0.4.66\n[pip3] nvidia-cusparse==12.6.3.3\n[pip3] nvidia-cusparselt-cu13==0.8.1\n[pip3] nvidia-nccl-cu13==2.29.7\n[pip3] nvidia-nvjitlink==13.0.88\n[pip3] nvidia-nvtx==13.0.85\n[pip3] torch==2.12.0\n[pip3] triton==3.7.0\n[conda] Could not collect",
+  "transformers_version": "5.9.0",
+  "lm_eval_version": "0.4.12",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<|pad|>",
+    "3"
+  ],
+  "tokenizer_eos_token": [
+    "<|eos|>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    "<|bos|>",
+    "0"
+  ],
+  "eot_token_id": 1,
+  "max_length": 4096,
+  "task_hashes": {
+    "hellaswag": "d4bcb44ec68db2b8a65f050c3c64c48454179b48fd8aee3e73b55e2ec51e6d82"
+  },
+  "model_source": "hf",
+  "model_name": "outputs/grpo-full/final",
+  "model_name_sanitized": "outputs__grpo-full__final",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": null,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "total_evaluation_time_seconds": "2693.2404373019963"
+}

eval_results/hellaswag/outputs__grpo-full__final/samples_hellaswag_2026-05-26T12-29-19.954733.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf9e82964443ca811ee5103dca5c32ec42a4ef3fe4391929b5a6148d22e18613
+size 186310702

eval_results/minerva_math_algebra/outputs__grpo-full__final/results_2026-05-26T10-07-08.234107.json ADDED Viewed

	@@ -0,0 +1,145 @@

+{
+  "results": {
+    "minerva_math_algebra": {
+      "name": "minerva_math_algebra",
+      "alias": "minerva_math_algebra",
+      "sample_len": 1187,
+      "exact_match,none": 0.0,
+      "exact_match_stderr,none": 0.0,
+      "math_verify,none": 0.020219039595619208,
+      "math_verify_stderr,none": 0.00408697908051843
+    }
+  },
+  "group_subtasks": {},
+  "configs": {
+    "minerva_math_algebra": {
+      "task": "minerva_math_algebra",
+      "dataset_path": "EleutherAI/hendrycks_math",
+      "dataset_name": "algebra",
+      "training_split": "train",
+      "test_split": "test",
+      "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n    def _process_doc(doc: dict) -> dict:\n        out_doc = {\n            \"problem\": doc[\"problem\"],\n            \"solution\": doc[\"solution\"],\n            \"answer\": normalize_final_answer(\n                remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n            ),\n        }\n        if getattr(doc, \"few_shot\", None) is not None:\n            out_doc[\"few_shot\"] = True\n        return out_doc\n\n    return dataset.map(_process_doc)\n",
+      "doc_to_text": "def doc_to_text(doc: dict) -> str:\n    return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+      "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+      "unsafe_code": false,
+      "process_results": "def process_results(doc: dict, results: list[str]) -> dict[str, int]:\n    candidates = results[0]\n\n    unnormalized_answer = get_unnormalized_answer(candidates)\n    answer = normalize_final_answer(unnormalized_answer)\n\n    if is_equiv(answer, doc[\"answer\"]):\n        retval = 1\n    else:\n        retval = 0\n\n    # math_verify\n    _mvres = verify(\n        gold=parse(doc[\"solution\"]),\n        target=parse(candidates),\n    )\n    mathval = 1 if _mvres else 0\n\n    res = {\n        \"exact_match\": retval,\n        \"math_verify\": mathval,\n    }\n    return res\n",
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "fewshot_config": {
+        "sampler": "first_n",
+        "split": null,
+        "process_docs": "<function process_docs at 0x7dbd24753740>",
+        "fewshot_indices": null,
+        "samples": "<function list_fewshot_samples at 0x7dbd1874ec00>",
+        "doc_to_text": "<function doc_to_text at 0x7dbd246d1580>",
+        "doc_to_choice": null,
+        "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+        "gen_prefix": null,
+        "fewshot_delimiter": "\n\n",
+        "target_delimiter": " "
+      },
+      "num_fewshot": 4,
+      "metric_list": [
+        {
+          "metric": "exact_match",
+          "aggregation": "mean",
+          "higher_is_better": true
+        },
+        {
+          "metric": "math_verify",
+          "aggregation": "mean",
+          "higher_is_better": true
+        }
+      ],
+      "output_type": "generate_until",
+      "generation_kwargs": {
+        "until": [
+          "Problem:"
+        ],
+        "do_sample": false,
+        "temperature": 0.0
+      },
+      "repeats": 1,
+      "should_decontaminate": false,
+      "metadata": {
+        "version": 3.0,
+        "pretrained": "outputs/grpo-full/final",
+        "dtype": "bfloat16",
+        "config_source": "/home/himanshu/TinyMathReason-1B/venv/lib/python3.12/site-packages/lm_eval/tasks/minerva_math/minerva_math_algebra.yaml"
+      }
+    }
+  },
+  "versions": {
+    "minerva_math_algebra": 3.0
+  },
+  "n-shot": {
+    "minerva_math_algebra": 4
+  },
+  "higher_is_better": {
+    "minerva_math_algebra": {
+      "exact_match": true,
+      "math_verify": true
+    }
+  },
+  "n-samples": {
+    "minerva_math_algebra": {
+      "original": 1187,
+      "effective": 1187
+    }
+  },
+  "config": {
+    "model": "hf",
+    "model_args": {
+      "pretrained": "outputs/grpo-full/final",
+      "dtype": "bfloat16"
+    },
+    "model_num_parameters": 1123125248,
+    "model_dtype": "torch.bfloat16",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": "auto",
+    "batch_sizes": [],
+    "device": "cuda",
+    "use_cache": null,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": {},
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "03fcf23",
+  "date": 1779789023.103882,
+  "pretty_env_info": "PyTorch version: 2.12.0+cu130\nIs debug build: False\nCUDA used to build PyTorch: 13.0\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 24.04.4 LTS (x86_64)\nGCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0\nClang version: Could not collect\nCMake version: Could not collect\nLibc version: glibc-2.39\n\nPython version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)\nPython platform: Linux-6.17.0-1016-gcp-x86_64-with-glibc2.39\nIs CUDA available: True\nCUDA runtime version: Could not collect\nCUDA_MODULE_LOADING set to: \nGPU models and configuration: GPU 0: NVIDIA L4\nNvidia driver version: 580.126.20\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.13.0\n/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_adv.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_cnn.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_graph.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.13.0\n/usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudnn_ops.so.9.13.0\nIs XPU available: False\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\nCaching allocator config: N/A\n\nCPU:\nArchitecture:                            x86_64\nCPU op-mode(s):                          32-bit, 64-bit\nAddress sizes:                           46 bits physical, 48 bits virtual\nByte Order:                              Little Endian\nCPU(s):                                  16\nOn-line CPU(s) list:                     0-15\nVendor ID:                               GenuineIntel\nModel name:                              Intel(R) Xeon(R) CPU @ 2.20GHz\nCPU family:                              6\nModel:                                   85\nThread(s) per core:                      2\nCore(s) per socket:                      8\nSocket(s):                               1\nStepping:                                7\nBogoMIPS:                                4400.43\nFlags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities\nHypervisor vendor:                       KVM\nVirtualization type:                     full\nL1d cache:                               256 KiB (8 instances)\nL1i cache:                               256 KiB (8 instances)\nL2 cache:                                8 MiB (8 instances)\nL3 cache:                                38.5 MiB (1 instance)\nNUMA node(s):                            1\nNUMA node0 CPU(s):                       0-15\nVulnerability Gather data sampling:      Not affected\nVulnerability Ghostwrite:                Not affected\nVulnerability Indirect target selection: Mitigation; Aligned branch/return thunks\nVulnerability Itlb multihit:             Not affected\nVulnerability L1tf:                      Not affected\nVulnerability Mds:                       Not affected\nVulnerability Meltdown:                  Not affected\nVulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Old microcode:             Not affected\nVulnerability Reg file data sampling:    Not affected\nVulnerability Retbleed:                  Mitigation; Enhanced IBRS\nVulnerability Spec rstack overflow:      Not affected\nVulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                     Not affected\nVulnerability Tsa:                       Not affected\nVulnerability Tsx async abort:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\nVulnerability Vmscape:                   Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==2.4.6\n[pip3] nvidia-cublas==13.1.1.3\n[pip3] nvidia-cuda-cupti==13.0.85\n[pip3] nvidia-cuda-nvrtc==13.0.88\n[pip3] nvidia-cuda-runtime==13.0.96\n[pip3] nvidia-cudnn-cu13==9.20.0.48\n[pip3] nvidia-cufft==12.0.0.61\n[pip3] nvidia-curand==10.4.0.35\n[pip3] nvidia-cusolver==12.0.4.66\n[pip3] nvidia-cusparse==12.6.3.3\n[pip3] nvidia-cusparselt-cu13==0.8.1\n[pip3] nvidia-nccl-cu13==2.29.7\n[pip3] nvidia-nvjitlink==13.0.88\n[pip3] nvidia-nvtx==13.0.85\n[pip3] torch==2.12.0\n[pip3] triton==3.7.0\n[conda] Could not collect",
+  "transformers_version": "5.9.0",
+  "lm_eval_version": "0.4.12",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<|pad|>",
+    "3"
+  ],
+  "tokenizer_eos_token": [
+    "<|eos|>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    "<|bos|>",
+    "0"
+  ],
+  "eot_token_id": 1,
+  "max_length": 4096,
+  "task_hashes": {
+    "minerva_math_algebra": "5c955bbc89ad645142d61b1594b7c36b552b722edf416ae40fcc71a4c50bd24b"
+  },
+  "model_source": "hf",
+  "model_name": "outputs/grpo-full/final",
+  "model_name_sanitized": "outputs__grpo-full__final",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": null,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "total_evaluation_time_seconds": "1014.535830836001"
+}

eval_results/minerva_math_algebra/outputs__grpo-full__final/samples_minerva_math_algebra_2026-05-26T10-07-08.234107.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/results_2026-05-26T12-51-35.680076.json ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_abstract_algebra_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_anatomy_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_astronomy_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_business_ethics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_clinical_knowledge_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_biology_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_chemistry_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_computer_science_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_mathematics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_medicine_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_college_physics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_computer_security_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_conceptual_physics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_econometrics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_electrical_engineering_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_elementary_mathematics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_formal_logic_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_global_facts_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_biology_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_chemistry_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_computer_science_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_european_history_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_geography_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_government_and_politics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_macroeconomics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_mathematics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_microeconomics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_physics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_psychology_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_statistics_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_us_history_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_high_school_world_history_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_human_aging_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_human_sexuality_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_results/mmlu/outputs__grpo-full__final/samples_mmlu_international_law_2026-05-26T12-51-35.680076.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff