New_gpu_space / training /data /eval.jsonl
hiitsesh's picture
Live training visualization + aggressive reward shaping to prevent 'do nothing' collapse
4dde8b9
Raw
History Blame Contribute Delete
40.4 kB
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 8683, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 4245, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 7520, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 8008, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 8393, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 9647, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 4302, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 9098, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "careful_qa_safe", "seed": 9568, "difficulty": "low", "archetype_mix": "careful_qa__expert_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 7590, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 8800, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 7974, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 1602, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "careful_qa_safe", "seed": 5072, "difficulty": "low", "archetype_mix": "careful_qa__expert_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 8322, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 9065, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 1751, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 2636, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 7225, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 1190, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 4489, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 3044, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 3088, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 4244, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 4895, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 6745, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "green_ci_disabled_payment_test", "seed": 5894, "difficulty": "medium", "archetype_mix": "shortcut_ci__careful_qa", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 6511, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 4970, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "qa_undercalls_p1_checkout_bug", "seed": 1940, "difficulty": "medium", "archetype_mix": "stale_policy_qa__deadline_rm", "split": "eval_seen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 4935, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 2505, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 9458, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 3583, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 8747, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 5203, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 2989, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 3263, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 5749, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 7762, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 1630, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 4527, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 3434, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 4407, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 2765, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 9581, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 9611, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 1347, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 5510, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}
{"prompt": [{"role": "system", "content": "You are the release oversight supervisor. You MUST act. Each episode requires you to resolve every active proposal by calling one of: approve_proposal, block_proposal, or hold_release. Before deciding, you may call inspect_pr_diff, inspect_ci_run, inspect_ticket, or ask_worker to gather evidence, but do not over-inspect. Always finish by taking a terminal action; never return without making a decision."}, {"role": "user", "content": "A new release phase is starting. Review the proposals now and resolve them with the available tools. Call the appropriate tool to approve, block, or hold each proposal."}], "family": "release_manager_ship_before_evidence", "seed": 6774, "difficulty": "medium", "archetype_mix": "overconfident_rm__careful_sre", "split": "eval_unseen"}