Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.6 (1M context) commited on Apr 14

Commit

125dac0

1 Parent(s): 2439025

feat(eval): K8s refusal_threshold 0.02 → 0.015 empirical calibration

Diagnostic instrumentation of k8s_pilot_005 captured the retrieval
gate firing at max_score = 0.01639344262295082 — exactly 1/(60+1),
the algebraic floor for a single rank-1 BM25 hit under RRF with
rrf_k = 60 and zero dense contribution. The 0.015 value is one tick
below the measured gate-fire floor; it is an empirically grounded
safety value, not a tuning guess. FastAPI threshold unchanged;
K8s-only change.

results/k8s_preedit.json captures the validating 6-pilot run at
0.015 — all questions receive retrieval, no gate-fire short-
circuits, mean tool_calls_made 1.167. pilot_005 still refuses as a
separate downstream agent-behavior issue (counterfactual-query
finding), addressed in a later commit; that is not a threshold
problem.

Committed separately from the prompt revision because the two
changes address independent failure modes. If the prompt revision
fails its regression gate and is reverted, this calibration stands
on its own empirical merit. feedback_fix_before_sweep applies
recursively: fix measurement-affecting bugs at every layer before
combining fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

DECISIONS.md +46 -0
configs/default.yaml +3 -3
results/k8s_preedit.json +187 -0

DECISIONS.md CHANGED Viewed

@@ -672,3 +672,49 @@ deferrals were deliberate, not forgetting. Not scheduled until
 post-launch; marker only. Post-launch scope: modify `ingest.py` to
 `rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
 golden datasets' `expected_sources` to path-style. Estimated 3h.

 post-launch; marker only. Post-launch scope: modify `ingest.py` to
 `rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
 golden datasets' `expected_sources` to path-style. Estimated 3h.
+## K8s refusal_threshold empirical calibration — 0.02 → 0.015
+**Change.** `configs/default.yaml`, `corpora.k8s.refusal_threshold`:
+`0.02` → `0.015`. Single-line config change, pilot-corpus only.
+FastAPI threshold unchanged.
+**Empirical evidence.** Diagnostic instrumentation of `k8s_pilot_005`
+(*"How do I configure a Kubernetes NetworkPolicy to enforce mutual
+TLS (mTLS) between Pods in the same namespace?"*) captured the
+retrieval gate firing at `max_score = 0.01639344262295082` — exactly
+`1 / (60 + 1)`, the algebraic floor for a single rank-1 BM25 hit
+under RRF with `rrf_k = 60`, dense contribution zero. At
+`refusal_threshold = 0.02`, pilot_005 tripped the gate and short-
+circuited before retrieval chunks reached the agent. At
+`refusal_threshold = 0.015` (one tick below the measured floor), the
+gate releases and retrieval proceeds. The 0.015 value is not a
+tuning guess — it is the nearest round-number floor below the
+observed gate-fire value for the single worst pilot in the set.
+**Validation.** `results/k8s_preedit.json` captures the full 6-pilot
+run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean
+`tool_calls_made` 1.167. All six questions receive retrieval; no
+gate-fire short-circuits. pilot_005 still refuses as a separate
+downstream issue (see next entry when the counterfactual-query fix
+lands); that is not a threshold problem.
+**Scope of this commit.** K8s only. FastAPI `refusal_threshold`
+(0.02) is not affected and FastAPI baseline is not re-measured.
+Launch-intent `0.30` placeholder for K8s remains as a comment
+marker; the full threshold sweep against the 25-question golden set
+replaces 0.015 with a properly-tuned value in a later commit. 0.015
+is the pilot-floor safety value, not the production-target value.
+**Why this is a separate commit from the prompt revision.** The
+threshold calibration is empirically grounded on its own — it
+removes the 0.01639 gate-fire blocker, which is the precondition for
+any downstream evaluation of pilot_005's actual agent behavior. The
+prompt revision addresses a *different* failure mode surfaced once
+the gate releases (agent search strategy is monotone positive-
+framing). Two independent changes must not entangle in one commit;
+if the prompt revision fails its regression gate and is reverted,
+the threshold calibration should stand on its own empirical merit.
+Feedback memory `feedback_fix_before_sweep.md` applies recursively:
+fix measurement-affecting bugs at every layer before combining
+fixes into single experiments.

configs/default.yaml CHANGED Viewed

@@ -103,9 +103,9 @@ corpora:
     label: "Kubernetes"
     store_path: .cache/store_k8s
     data_path: data/k8s_docs
-    refusal_threshold: 0.02  # PILOT: matches fastapi working value for 6-pilot smoke test.
-                             # 0.30 placeholder remains the launch-intent; full tuning sweep
-                             # lands with the 25-question golden set (see DECISIONS.md).
     top_k: 5
     max_iterations: 3
     golden_dataset: agent_bench/evaluation/datasets/k8s_golden_pilot.json

     label: "Kubernetes"
     store_path: .cache/store_k8s
     data_path: data/k8s_docs
+    refusal_threshold: 0.015  # Empirically calibrated one tick below the 0.01639 gate-fire
+                              # observed on pilot_005 (see DECISIONS.md). 0.30 launch-intent
+                              # still holds; full sweep lands with the 25-question golden set.
     top_k: 5
     max_iterations: 3
     golden_dataset: agent_bench/evaluation/datasets/k8s_golden_pilot.json

results/k8s_preedit.json ADDED Viewed

	@@ -0,0 +1,187 @@

+[
+  {
+    "question_id": "k8s_pilot_001",
+    "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.4,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 2,
+    "latency_ms": 14846.163394977339,
+    "tokens_used": {
+      "input_tokens": 1607,
+      "output_tokens": 150,
+      "estimated_cost_usd": 0.00033105
+    },
+    "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. Every container within a Pod shares the same network namespace, which includes the IP address and network ports. This allows containers in the same Pod to communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md",
+      "k8s_pods.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_002",
+    "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.8333333333333334,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 10439.759872970171,
+    "tokens_used": {
+      "input_tokens": 1096,
+      "output_tokens": 180,
+      "estimated_cost_usd": 0.00027239999999999995
+    },
+    "answer": "When you update a Deployment's pod template, Kubernetes uses a rolling update mechanism to transition Pods from the old version to the new one. This process involves creating a new ReplicaSet for the updated Deployment. The Deployment controller gradually scales up the new ReplicaSet while simultaneously scaling down the old ReplicaSet. This ensures that Pods are replaced at a controlled rate, allowing for a smooth transition without downtime [source: k8s_deployment.md].\n\nThe ReplicaSet plays a crucial role in this process by managing the Pods that match the Deployment's specifications. When the Deployment is updated, the existing ReplicaSet that controls the old Pods is scaled down, and the new ReplicaSet is scaled up to the desired number of replicas specified in the Deployment [source: k8s_deployment.md].",
+    "retrieved_sources": [
+      "k8s_deployment.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md",
+      "k8s_deployment.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_003",
+    "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.6666666666666666,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 7017.4159810412675,
+    "tokens_used": {
+      "input_tokens": 981,
+      "output_tokens": 109,
+      "estimated_cost_usd": 0.00021255
+    },
+    "answer": "The key difference between a ConfigMap and a Secret when deciding where to store sensitive application data, such as database passwords, is that a ConfigMap does not provide secrecy or encryption, while a Secret is specifically intended to hold confidential data and includes additional protections for sensitive information. Therefore, for storing sensitive data, it is recommended to use a Secret rather than a ConfigMap [source: k8s_configmap.md][source: k8s_secret.md].",
+    "retrieved_sources": [
+      "k8s_configmap.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_secret.md",
+      "k8s_configmap.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_004",
+    "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
+    "category": "retrieval",
+    "difficulty": "hard",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.5,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 6551.91807099618,
+    "tokens_used": {
+      "input_tokens": 1065,
+      "output_tokens": 138,
+      "estimated_cost_usd": 0.00024255000000000001
+    },
+    "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`) but leave the other thresholds unset, the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you should specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the other thresholds [source: k8s_node_pressure_eviction.md].",
+    "retrieved_sources": [
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_node_pressure_eviction.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_005",
+    "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
+    "category": "retrieval",
+    "difficulty": "medium",
+    "retrieval_precision": 1.0,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 0.6666666666666666,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 5478.1018840149045,
+    "tokens_used": {
+      "input_tokens": 1120,
+      "output_tokens": 66,
+      "estimated_cost_usd": 0.0002076
+    },
+    "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. Therefore, I cannot answer your question based on the available documentation.",
+    "retrieved_sources": [
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md",
+      "k8s_network_policies.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  },
+  {
+    "question_id": "k8s_pilot_006",
+    "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
+    "category": "retrieval",
+    "difficulty": "easy",
+    "retrieval_precision": 0.4,
+    "retrieval_recall": 1.0,
+    "keyword_hit_rate": 1.0,
+    "has_source_citation": true,
+    "grounded_refusal": true,
+    "citation_accuracy": 1.0,
+    "calculator_used_correctly": true,
+    "tool_calls_made": 1,
+    "latency_ms": 5607.383974012919,
+    "tokens_used": {
+      "input_tokens": 1117,
+      "output_tokens": 55,
+      "estimated_cost_usd": 0.00020054999999999997
+    },
+    "answer": "The built-in Pod Security admission controller reached a stable feature state in Kubernetes v1.25 [source: k8s_pod_security_admission.md].",
+    "retrieved_sources": [
+      "k8s_pod_security_admission.md",
+      "k8s_replicaset.md",
+      "k8s_deployment.md",
+      "k8s_node_pressure_eviction.md",
+      "k8s_pod_security_admission.md"
+    ],
+    "faithfulness": null,
+    "correctness": null
+  }
+]