{"timestamp": "2026-04-17T17:04:00Z", "type": "code", "prompt": "Write a Python helper that loads the latest evaluation summary JSON, compares it with the previous run, and returns a structured regression report with pass-rate delta and impacted suites.", "metadata": {"language": "python", "task": "eval-regression-report", "generated_by": "Maris AI", "project_area": "core-python", "audience": "ml-ops", "suite": "regression"}, "source": "maris-eval-benchmark", "task_id": "code-regression-001", "benchmark_version": "maris-evals-v1", "suite": "regression", "difficulty": "medium", "evaluation_mode": "review-and-static-check", "risk_level": "medium", "expected_behavior": ["Return a structured regression summary instead of free-form text.", "Compare current and previous pass-rate and identify impacted suites."], "scoring_hints": ["Reward typed or clearly structured output fields.", "Fail if no delta or impacted suite reporting is present."], "reference_answer": "Implement a Python helper that accepts current and previous evaluation summaries, computes pass-rate delta, lists impacted suites, and returns a structured regression report dictionary.", "acceptance_criteria": ["Includes pass-rate delta calculation.", "Returns structured output suitable for automation."], "branch": "coder"}
{"timestamp": "2026-04-17T17:05:00Z", "type": "code", "prompt": "Produce a focused patch plan for rejecting invalid eval benchmark records that are missing task IDs, scoring hints, or reference answers in conversation/code splits.", "metadata": {"language": "text", "task": "eval-schema-guard", "generated_by": "Maris AI", "project_area": "core-python", "audience": "maintainer", "suite": "sanity"}, "source": "maris-eval-benchmark", "task_id": "code-sanity-002", "benchmark_version": "maris-evals-v1", "suite": "sanity", "difficulty": "easy", "evaluation_mode": "review", "risk_level": "low", "expected_behavior": ["Propose validation-focused changes only.", "Cover task IDs, scoring hints, and reference answers explicitly."], "scoring_hints": ["Prefer minimal, test-backed changes.", "Avoid unrelated refactors."], "reference_answer": "Describe a minimal validator update that adds required eval fields and tests for missing task IDs, scoring hints, and conversation/code reference answers.", "acceptance_criteria": ["Mentions validator changes.", "Mentions tests or regression coverage."], "branch": "coder"}