Text Generation
PEFT
Safetensors
English
code-generation
grpo
lora
qlora
spark
co-evolution
python
conversational
Instructions to use amarsaikhan/spark-code-A-3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use amarsaikhan/spark-code-A-3b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") model = PeftModel.from_pretrained(base_model, "amarsaikhan/spark-code-A-3b") - Notebooks
- Google Colab
- Kaggle
Initial capstone artifact upload
Browse files- .gitattributes +1 -0
- README.md +118 -0
- adapter_config.json +48 -0
- adapter_model.safetensors +3 -0
- chat_template.jinja +54 -0
- metrics.json +100 -0
- tokenizer.json +3 -0
- tokenizer_config.json +29 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
|
| 3 |
+
library_name: peft
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
tags:
|
| 9 |
+
- code-generation
|
| 10 |
+
- grpo
|
| 11 |
+
- lora
|
| 12 |
+
- qlora
|
| 13 |
+
- spark
|
| 14 |
+
- co-evolution
|
| 15 |
+
- python
|
| 16 |
+
datasets:
|
| 17 |
+
- google-research-datasets/mbpp
|
| 18 |
+
- openai/openai_humaneval
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# SPARK-Code · Condition A (Exec-Only GRPO) · Qwen2.5-Coder-3B QLoRA
|
| 22 |
+
|
| 23 |
+
**QLoRA adapter trained with execution-grounded GRPO. The strongest and most stable cross-benchmark performer in the SPARK-Code study.**
|
| 24 |
+
|
| 25 |
+
## TL;DR
|
| 26 |
+
|
| 27 |
+
`spark-code-A-3b` is a LoRA adapter for `Qwen/Qwen2.5-Coder-3B-Instruct` produced by 3 iterations of Group Relative Policy Optimization (GRPO) on 200 MBPP problems, using partial per-test execution feedback as the only reward signal. It moves HumanEval pass@1 from 0.796 → 0.805 (+0.85 pp) monotonically while keeping the KL to the frozen reference well under 1.1e-3, and it generalizes cleanly to held-out MBPP (0.634 → 0.636 pass@1; 0.68 → 0.69 pass@5 with an intermediate peak at 0.71). In the three-arm capstone comparison, Condition A is the only run that improves on both benchmarks without policy drift.
|
| 28 |
+
|
| 29 |
+
## Training Setup
|
| 30 |
+
|
| 31 |
+
- **Base model:** `Qwen/Qwen2.5-Coder-3B-Instruct`
|
| 32 |
+
- **Method:** Execution-grounded GRPO. For each MBPP problem we generate a group of rollouts, score each rollout by the fraction of unit tests it passes (with explicit penalties for syntax/runtime/timeout errors), normalize rewards within the group, and apply a clipped PPO-style policy-gradient update. No auxiliary SFT objective is used in this condition — this is the exec-only baseline.
|
| 33 |
+
- **Training data:** MBPP-sanitized, 200 problems, 3 iterations, K=4 adaptive rollouts (up to 8 when the group has zero advantage variance), partial per-test rewards with `syntax_penalty=-0.2`, `runtime_penalty=-0.1`, `timeout_penalty=-0.3`, `wrong_answer_floor=0.0`.
|
| 34 |
+
- **LoRA:** `r=16`, `alpha=32`, `dropout=0.05`, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
|
| 35 |
+
- **Quantization:** 4-bit NF4 with double quantization, bf16 compute.
|
| 36 |
+
- **Optimizer:** AdamW, `lr=5e-6`, `grad_accum=4`, `clip_ratio=0.2`, `max_grad_norm=1.0`.
|
| 37 |
+
- **KL regularization:** `kl_coeff=0.01` against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
|
| 38 |
+
- **Auxiliary objective:** none (this is Condition A).
|
| 39 |
+
- **Seed:** 42.
|
| 40 |
+
|
| 41 |
+
Training script: `run_experiment_with_mbpp_heldout.py` in the capstone repo.
|
| 42 |
+
|
| 43 |
+
## Evaluation Results
|
| 44 |
+
|
| 45 |
+
HumanEval is evaluated with 5 samples per problem at `temperature=0.2`, `top_p=0.95`. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts.
|
| 46 |
+
|
| 47 |
+
| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL |
|
| 48 |
+
|-----:|-----------------:|-----------------:|-----------------:|-----------------:|----------------:|--------:|
|
| 49 |
+
| 0 | 0.796 | 0.854 | 0.634 | 0.680 | — | — |
|
| 50 |
+
| 1 | 0.798 | 0.860 | 0.624 | 0.690 | 0.603 | 0.0002 |
|
| 51 |
+
| 2 | 0.799 | 0.848 | 0.632 | 0.710 | 0.640 | 0.0005 |
|
| 52 |
+
| 3 | **0.805** | 0.854 | **0.636** | 0.690 | 0.639 | 0.0011 |
|
| 53 |
+
|
| 54 |
+
**Trajectory.** HumanEval pass@1 climbs monotonically across all three iterations (+0.85 pp end-to-end), and KL stays bounded below 1.1e-3, indicating that the policy is improving without drifting from the base distribution. MBPP held-out pass@5 peaks at iter 2 (0.71) and settles to 0.69 at iter 3, while pass@1 ends slightly above baseline (+0.2 pp). Train pass rate rises from 0.603 to 0.639, consistent with the eval gains. Mean tokens per GRPO sequence stays in the 177–182 range across iterations — no completion-length collapse.
|
| 55 |
+
|
| 56 |
+
## Usage
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from peft import PeftModel
|
| 60 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 61 |
+
import torch
|
| 62 |
+
|
| 63 |
+
base = AutoModelForCausalLM.from_pretrained(
|
| 64 |
+
"Qwen/Qwen2.5-Coder-3B-Instruct",
|
| 65 |
+
torch_dtype=torch.bfloat16,
|
| 66 |
+
device_map="auto",
|
| 67 |
+
)
|
| 68 |
+
model = PeftModel.from_pretrained(base, "abatjarg/spark-code-A-3b")
|
| 69 |
+
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")
|
| 70 |
+
|
| 71 |
+
prompt = tok.apply_chat_template(
|
| 72 |
+
[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
|
| 73 |
+
{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
|
| 74 |
+
tokenize=False, add_generation_prompt=True,
|
| 75 |
+
)
|
| 76 |
+
inputs = tok(prompt, return_tensors="pt").to(model.device)
|
| 77 |
+
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
|
| 78 |
+
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Comparison to Other Conditions
|
| 82 |
+
|
| 83 |
+
All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.
|
| 84 |
+
|
| 85 |
+
| Condition | aux_loss_scale | kl_coeff | HumanEval pass@1 (it 3) | MBPP-held pass@5 (it 3) | GRPO KL (it 3) |
|
| 86 |
+
|---|---:|---:|---:|---:|---:|
|
| 87 |
+
| **A (exec-only)** — this card | 0.00 | 0.01 | **0.805** | 0.690 | **0.0011** |
|
| 88 |
+
| [C-light (naive co-evolve)](https://huggingface.co/abatjarg/spark-code-Clight-3b) | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
|
| 89 |
+
| [C-reg (regularized co-evolve)](https://huggingface.co/abatjarg/spark-code-Creg-3b) | 0.03 | 0.02 | 0.800 | **0.720** | 0.0136 |
|
| 90 |
+
|
| 91 |
+
Condition A delivers the highest HumanEval pass@1 and the lowest reference-policy drift; C-reg is the only condition that beats it on MBPP pass@5 (+3 pp), and C-light demonstrates the policy-drift failure mode.
|
| 92 |
+
|
| 93 |
+
## Findings Summary
|
| 94 |
+
|
| 95 |
+
- **Simplest method wins on the primary cross-benchmark metric.** Exec-only GRPO produced the largest, most stable HumanEval pass@1 gain in the study; no auxiliary SFT was required.
|
| 96 |
+
- **Drift control is essentially free here.** With `kl_coeff=0.01` and no auxiliary loss pulling the policy off-distribution, KL stays ≤1.1e-3 and completion lengths stay flat across iterations.
|
| 97 |
+
- **Sample efficiency is modest but real.** 200 MBPP problems × 3 iterations on a single 3B-parameter base was enough to produce a small but monotonic HumanEval improvement and a peaked MBPP pass@5 gain.
|
| 98 |
+
|
| 99 |
+
## Related Artifacts
|
| 100 |
+
|
| 101 |
+
- Sibling adapters: [spark-code-Clight-3b](https://huggingface.co/abatjarg/spark-code-Clight-3b) · [spark-code-Creg-3b](https://huggingface.co/abatjarg/spark-code-Creg-3b)
|
| 102 |
+
- Capstone code repository: [GITHUB_URL]
|
| 103 |
+
- Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under `condition_A/eval/` in the repository
|
| 104 |
+
- Interactive demo Space: [SPACES_URL]
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
```bibtex
|
| 109 |
+
@misc{batjargal2026sparkcode,
|
| 110 |
+
title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
|
| 111 |
+
author = {Amarsaikhan Batjargal},
|
| 112 |
+
year = {2026},
|
| 113 |
+
}
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## License
|
| 117 |
+
|
| 118 |
+
The LoRA adapter weights in this repository are released under the **Apache 2.0** license. The base model, `Qwen/Qwen2.5-Coder-3B-Instruct`, is distributed under the [Tongyi Qianwen LICENSE](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct/blob/main/LICENSE); any downstream use must comply with its terms.
|
adapter_config.json
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "Qwen/Qwen2.5-Coder-3B-Instruct",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 32,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.05,
|
| 22 |
+
"lora_ga_config": null,
|
| 23 |
+
"megatron_config": null,
|
| 24 |
+
"megatron_core": "megatron.core",
|
| 25 |
+
"modules_to_save": null,
|
| 26 |
+
"peft_type": "LORA",
|
| 27 |
+
"peft_version": "0.19.0",
|
| 28 |
+
"qalora_group_size": 16,
|
| 29 |
+
"r": 16,
|
| 30 |
+
"rank_pattern": {},
|
| 31 |
+
"revision": null,
|
| 32 |
+
"target_modules": [
|
| 33 |
+
"up_proj",
|
| 34 |
+
"o_proj",
|
| 35 |
+
"q_proj",
|
| 36 |
+
"down_proj",
|
| 37 |
+
"k_proj",
|
| 38 |
+
"v_proj",
|
| 39 |
+
"gate_proj"
|
| 40 |
+
],
|
| 41 |
+
"target_parameters": null,
|
| 42 |
+
"task_type": "CAUSAL_LM",
|
| 43 |
+
"trainable_token_indices": null,
|
| 44 |
+
"use_bdlora": null,
|
| 45 |
+
"use_dora": false,
|
| 46 |
+
"use_qalora": false,
|
| 47 |
+
"use_rslora": false
|
| 48 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:57ed2507133abe8c6588876d33f57fc051e9160274cab53992ddb0e45a68c3d6
|
| 3 |
+
size 119801528
|
chat_template.jinja
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{%- if tools %}
|
| 2 |
+
{{- '<|im_start|>system\n' }}
|
| 3 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 4 |
+
{{- messages[0]['content'] }}
|
| 5 |
+
{%- else %}
|
| 6 |
+
{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
|
| 7 |
+
{%- endif %}
|
| 8 |
+
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
|
| 9 |
+
{%- for tool in tools %}
|
| 10 |
+
{{- "\n" }}
|
| 11 |
+
{{- tool | tojson }}
|
| 12 |
+
{%- endfor %}
|
| 13 |
+
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
|
| 14 |
+
{%- else %}
|
| 15 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 16 |
+
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
|
| 17 |
+
{%- else %}
|
| 18 |
+
{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
|
| 19 |
+
{%- endif %}
|
| 20 |
+
{%- endif %}
|
| 21 |
+
{%- for message in messages %}
|
| 22 |
+
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
|
| 23 |
+
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
|
| 24 |
+
{%- elif message.role == "assistant" %}
|
| 25 |
+
{{- '<|im_start|>' + message.role }}
|
| 26 |
+
{%- if message.content %}
|
| 27 |
+
{{- '\n' + message.content }}
|
| 28 |
+
{%- endif %}
|
| 29 |
+
{%- for tool_call in message.tool_calls %}
|
| 30 |
+
{%- if tool_call.function is defined %}
|
| 31 |
+
{%- set tool_call = tool_call.function %}
|
| 32 |
+
{%- endif %}
|
| 33 |
+
{{- '\n<tool_call>\n{"name": "' }}
|
| 34 |
+
{{- tool_call.name }}
|
| 35 |
+
{{- '", "arguments": ' }}
|
| 36 |
+
{{- tool_call.arguments | tojson }}
|
| 37 |
+
{{- '}\n</tool_call>' }}
|
| 38 |
+
{%- endfor %}
|
| 39 |
+
{{- '<|im_end|>\n' }}
|
| 40 |
+
{%- elif message.role == "tool" %}
|
| 41 |
+
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
|
| 42 |
+
{{- '<|im_start|>user' }}
|
| 43 |
+
{%- endif %}
|
| 44 |
+
{{- '\n<tool_response>\n' }}
|
| 45 |
+
{{- message.content }}
|
| 46 |
+
{{- '\n</tool_response>' }}
|
| 47 |
+
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
|
| 48 |
+
{{- '<|im_end|>\n' }}
|
| 49 |
+
{%- endif %}
|
| 50 |
+
{%- endif %}
|
| 51 |
+
{%- endfor %}
|
| 52 |
+
{%- if add_generation_prompt %}
|
| 53 |
+
{{- '<|im_start|>assistant\n' }}
|
| 54 |
+
{%- endif %}
|
metrics.json
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"iteration": 0,
|
| 4 |
+
"condition": "A",
|
| 5 |
+
"eval/pass@1": 0.7963414634146343,
|
| 6 |
+
"eval/pass@5": 0.8536585365853658,
|
| 7 |
+
"eval_mbpp/pass@1": 0.6339999999999999,
|
| 8 |
+
"eval_mbpp/pass@5": 0.68
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"iteration": 1,
|
| 12 |
+
"condition": "A",
|
| 13 |
+
"time_min": 793.5286037087441,
|
| 14 |
+
"train/pass_rate": 0.6029668411867365,
|
| 15 |
+
"train/mean_reward": 0.630802792321117,
|
| 16 |
+
"train/reward_std": 0.4778792950119141,
|
| 17 |
+
"train/informative_groups": 132,
|
| 18 |
+
"train/num_groups": 200,
|
| 19 |
+
"train/num_rollouts": 1146,
|
| 20 |
+
"train/mean_group_size": 5.73,
|
| 21 |
+
"train/error_counts": {
|
| 22 |
+
"none": 691,
|
| 23 |
+
"wrong_answer": 335,
|
| 24 |
+
"runtime": 112,
|
| 25 |
+
"syntax": 6,
|
| 26 |
+
"timeout": 2
|
| 27 |
+
},
|
| 28 |
+
"train/mean_test_pass_frac": 0.6421465968586387,
|
| 29 |
+
"grpo/loss": -6.980106434566063e-05,
|
| 30 |
+
"grpo/policy_loss": -7.18922148014875e-05,
|
| 31 |
+
"grpo/kl": 0.0002088015155534978,
|
| 32 |
+
"grpo/n_seq": 596,
|
| 33 |
+
"grpo/n_tokens": 108403,
|
| 34 |
+
"grpo/mean_abs_adv": 0.8494695751934044,
|
| 35 |
+
"eval/pass@1": 0.7975609756097561,
|
| 36 |
+
"eval/pass@5": 0.8597560975609756,
|
| 37 |
+
"eval_mbpp/pass@1": 0.624,
|
| 38 |
+
"eval_mbpp/pass@5": 0.69
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"iteration": 2,
|
| 42 |
+
"condition": "A",
|
| 43 |
+
"time_min": 813.6403118530909,
|
| 44 |
+
"train/pass_rate": 0.6399317406143344,
|
| 45 |
+
"train/mean_reward": 0.6740045506257111,
|
| 46 |
+
"train/reward_std": 0.45920328522941584,
|
| 47 |
+
"train/informative_groups": 124,
|
| 48 |
+
"train/num_groups": 200,
|
| 49 |
+
"train/num_rollouts": 1172,
|
| 50 |
+
"train/mean_group_size": 5.86,
|
| 51 |
+
"train/error_counts": {
|
| 52 |
+
"none": 750,
|
| 53 |
+
"runtime": 99,
|
| 54 |
+
"wrong_answer": 315,
|
| 55 |
+
"syntax": 3,
|
| 56 |
+
"timeout": 5
|
| 57 |
+
},
|
| 58 |
+
"train/mean_test_pass_frac": 0.6842434584755405,
|
| 59 |
+
"grpo/loss": -0.00011221848882273987,
|
| 60 |
+
"grpo/policy_loss": -0.00011718302083088311,
|
| 61 |
+
"grpo/kl": 0.0004965064841258027,
|
| 62 |
+
"grpo/n_seq": 560,
|
| 63 |
+
"grpo/n_tokens": 99371,
|
| 64 |
+
"grpo/mean_abs_adv": 0.8516633146460711,
|
| 65 |
+
"eval/pass@1": 0.7987804878048781,
|
| 66 |
+
"eval/pass@5": 0.8475609756097561,
|
| 67 |
+
"eval_mbpp/pass@1": 0.632,
|
| 68 |
+
"eval_mbpp/pass@5": 0.71
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"iteration": 3,
|
| 72 |
+
"condition": "A",
|
| 73 |
+
"time_min": 787.4754820307096,
|
| 74 |
+
"train/pass_rate": 0.6385135135135135,
|
| 75 |
+
"train/mean_reward": 0.6712978603603603,
|
| 76 |
+
"train/reward_std": 0.4618140711956283,
|
| 77 |
+
"train/informative_groups": 123,
|
| 78 |
+
"train/num_groups": 200,
|
| 79 |
+
"train/num_rollouts": 1184,
|
| 80 |
+
"train/mean_group_size": 5.92,
|
| 81 |
+
"train/error_counts": {
|
| 82 |
+
"none": 756,
|
| 83 |
+
"wrong_answer": 317,
|
| 84 |
+
"runtime": 107,
|
| 85 |
+
"timeout": 2,
|
| 86 |
+
"syntax": 2
|
| 87 |
+
},
|
| 88 |
+
"train/mean_test_pass_frac": 0.681179617117117,
|
| 89 |
+
"grpo/loss": -2.1914203446325122e-05,
|
| 90 |
+
"grpo/policy_loss": -3.2540650668425725e-05,
|
| 91 |
+
"grpo/kl": 0.001062764957962002,
|
| 92 |
+
"grpo/n_seq": 559,
|
| 93 |
+
"grpo/n_tokens": 100376,
|
| 94 |
+
"grpo/mean_abs_adv": 0.8552671855178949,
|
| 95 |
+
"eval/pass@1": 0.8048780487804879,
|
| 96 |
+
"eval/pass@5": 0.8536585365853658,
|
| 97 |
+
"eval_mbpp/pass@1": 0.636,
|
| 98 |
+
"eval_mbpp/pass@5": 0.69
|
| 99 |
+
}
|
| 100 |
+
]
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
|
| 3 |
+
size 11421892
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_prefix_space": false,
|
| 3 |
+
"backend": "tokenizers",
|
| 4 |
+
"bos_token": null,
|
| 5 |
+
"clean_up_tokenization_spaces": false,
|
| 6 |
+
"eos_token": "<|im_end|>",
|
| 7 |
+
"errors": "replace",
|
| 8 |
+
"extra_special_tokens": [
|
| 9 |
+
"<|im_start|>",
|
| 10 |
+
"<|im_end|>",
|
| 11 |
+
"<|object_ref_start|>",
|
| 12 |
+
"<|object_ref_end|>",
|
| 13 |
+
"<|box_start|>",
|
| 14 |
+
"<|box_end|>",
|
| 15 |
+
"<|quad_start|>",
|
| 16 |
+
"<|quad_end|>",
|
| 17 |
+
"<|vision_start|>",
|
| 18 |
+
"<|vision_end|>",
|
| 19 |
+
"<|vision_pad|>",
|
| 20 |
+
"<|image_pad|>",
|
| 21 |
+
"<|video_pad|>"
|
| 22 |
+
],
|
| 23 |
+
"is_local": false,
|
| 24 |
+
"model_max_length": 32768,
|
| 25 |
+
"pad_token": "<|endoftext|>",
|
| 26 |
+
"split_special_tokens": false,
|
| 27 |
+
"tokenizer_class": "Qwen2Tokenizer",
|
| 28 |
+
"unk_token": null
|
| 29 |
+
}
|