{"timestamp": "2026-04-23T20:04:00Z", "type": "code", "prompt": "Write a Python function that compares current and approved benchmark manifests, returns overall score delta, and lists any metrics that dropped below target.", "metadata": {"language": "python", "task": "benchmark-regression-report", "generated_by": "Maris AI", "project_area": "core-python", "audience": "ml-ops", "suite": "release"}, "source": "maris-release-benchmark", "task_id": "benchmark-code-release-001", "benchmark_version": "maris-benchmark-v1", "suite": "release", "difficulty": "medium", "evaluation_mode": "review-and-static-check", "risk_level": "high", "expected_behavior": ["Computes an overall score delta.", "Returns structured information about regressed metrics."], "scoring_hints": ["Reward typed or clearly structured return values.", "Fail if the answer only prints free-form text without machine-readable regression data."], "reference_answer": "Implement a helper that accepts current and approved benchmark manifests, computes overall delta, filters metrics that dropped below target, and returns a structured dictionary for release gating.", "acceptance_criteria": ["Includes score delta logic.", "Returns structured regression data suitable for automation."], "branch": "coder"}
{"timestamp": "2026-04-23T20:06:00Z", "type": "code", "prompt": "Produce a minimal patch plan for preventing benchmark publication when the local benchmark-data directory fails schema validation or contains duplicate task IDs.", "metadata": {"language": "text", "task": "benchmark-publication-guard", "generated_by": "Maris AI", "project_area": "huggingface", "audience": "maintainer", "suite": "production-like"}, "source": "maris-release-benchmark", "task_id": "benchmark-code-release-002", "benchmark_version": "maris-benchmark-v1", "suite": "production-like", "difficulty": "easy", "evaluation_mode": "review", "risk_level": "medium", "expected_behavior": ["Focuses on validation-first publication flow.", "Explicitly covers schema failure and duplicate benchmark IDs."], "scoring_hints": ["Prefer precise sync/validator updates backed by tests.", "Avoid unrelated refactors or changes outside the benchmark publication path."], "reference_answer": "Describe a sync-path update that validates benchmark-data before upload, rejects duplicates or schema failures, and adds focused regression tests for the benchmark publication flow.", "acceptance_criteria": ["Mentions validation before upload.", "Mentions duplicate detection or regression coverage."], "branch": "coder"}