Nomearod Claude Opus 4.6 (1M context) commited on
Commit
125dac0
·
1 Parent(s): 2439025

feat(eval): K8s refusal_threshold 0.02 → 0.015 empirical calibration

Browse files

Diagnostic instrumentation of k8s_pilot_005 captured the retrieval
gate firing at max_score = 0.01639344262295082 — exactly 1/(60+1),
the algebraic floor for a single rank-1 BM25 hit under RRF with
rrf_k = 60 and zero dense contribution. The 0.015 value is one tick
below the measured gate-fire floor; it is an empirically grounded
safety value, not a tuning guess. FastAPI threshold unchanged;
K8s-only change.

results/k8s_preedit.json captures the validating 6-pilot run at
0.015 — all questions receive retrieval, no gate-fire short-
circuits, mean tool_calls_made 1.167. pilot_005 still refuses as a
separate downstream agent-behavior issue (counterfactual-query
finding), addressed in a later commit; that is not a threshold
problem.

Committed separately from the prompt revision because the two
changes address independent failure modes. If the prompt revision
fails its regression gate and is reverted, this calibration stands
on its own empirical merit. feedback_fix_before_sweep applies
recursively: fix measurement-affecting bugs at every layer before
combining fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. DECISIONS.md +46 -0
  2. configs/default.yaml +3 -3
  3. results/k8s_preedit.json +187 -0
DECISIONS.md CHANGED
@@ -672,3 +672,49 @@ deferrals were deliberate, not forgetting. Not scheduled until
672
  post-launch; marker only. Post-launch scope: modify `ingest.py` to
673
  `rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
674
  golden datasets' `expected_sources` to path-style. Estimated 3h.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
672
  post-launch; marker only. Post-launch scope: modify `ingest.py` to
673
  `rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
674
  golden datasets' `expected_sources` to path-style. Estimated 3h.
675
+
676
+ ## K8s refusal_threshold empirical calibration — 0.02 → 0.015
677
+
678
+ **Change.** `configs/default.yaml`, `corpora.k8s.refusal_threshold`:
679
+ `0.02` → `0.015`. Single-line config change, pilot-corpus only.
680
+ FastAPI threshold unchanged.
681
+
682
+ **Empirical evidence.** Diagnostic instrumentation of `k8s_pilot_005`
683
+ (*"How do I configure a Kubernetes NetworkPolicy to enforce mutual
684
+ TLS (mTLS) between Pods in the same namespace?"*) captured the
685
+ retrieval gate firing at `max_score = 0.01639344262295082` — exactly
686
+ `1 / (60 + 1)`, the algebraic floor for a single rank-1 BM25 hit
687
+ under RRF with `rrf_k = 60`, dense contribution zero. At
688
+ `refusal_threshold = 0.02`, pilot_005 tripped the gate and short-
689
+ circuited before retrieval chunks reached the agent. At
690
+ `refusal_threshold = 0.015` (one tick below the measured floor), the
691
+ gate releases and retrieval proceeds. The 0.015 value is not a
692
+ tuning guess — it is the nearest round-number floor below the
693
+ observed gate-fire value for the single worst pilot in the set.
694
+
695
+ **Validation.** `results/k8s_preedit.json` captures the full 6-pilot
696
+ run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean
697
+ `tool_calls_made` 1.167. All six questions receive retrieval; no
698
+ gate-fire short-circuits. pilot_005 still refuses as a separate
699
+ downstream issue (see next entry when the counterfactual-query fix
700
+ lands); that is not a threshold problem.
701
+
702
+ **Scope of this commit.** K8s only. FastAPI `refusal_threshold`
703
+ (0.02) is not affected and FastAPI baseline is not re-measured.
704
+ Launch-intent `0.30` placeholder for K8s remains as a comment
705
+ marker; the full threshold sweep against the 25-question golden set
706
+ replaces 0.015 with a properly-tuned value in a later commit. 0.015
707
+ is the pilot-floor safety value, not the production-target value.
708
+
709
+ **Why this is a separate commit from the prompt revision.** The
710
+ threshold calibration is empirically grounded on its own — it
711
+ removes the 0.01639 gate-fire blocker, which is the precondition for
712
+ any downstream evaluation of pilot_005's actual agent behavior. The
713
+ prompt revision addresses a *different* failure mode surfaced once
714
+ the gate releases (agent search strategy is monotone positive-
715
+ framing). Two independent changes must not entangle in one commit;
716
+ if the prompt revision fails its regression gate and is reverted,
717
+ the threshold calibration should stand on its own empirical merit.
718
+ Feedback memory `feedback_fix_before_sweep.md` applies recursively:
719
+ fix measurement-affecting bugs at every layer before combining
720
+ fixes into single experiments.
configs/default.yaml CHANGED
@@ -103,9 +103,9 @@ corpora:
103
  label: "Kubernetes"
104
  store_path: .cache/store_k8s
105
  data_path: data/k8s_docs
106
- refusal_threshold: 0.02 # PILOT: matches fastapi working value for 6-pilot smoke test.
107
- # 0.30 placeholder remains the launch-intent; full tuning sweep
108
- # lands with the 25-question golden set (see DECISIONS.md).
109
  top_k: 5
110
  max_iterations: 3
111
  golden_dataset: agent_bench/evaluation/datasets/k8s_golden_pilot.json
 
103
  label: "Kubernetes"
104
  store_path: .cache/store_k8s
105
  data_path: data/k8s_docs
106
+ refusal_threshold: 0.015 # Empirically calibrated one tick below the 0.01639 gate-fire
107
+ # observed on pilot_005 (see DECISIONS.md). 0.30 launch-intent
108
+ # still holds; full sweep lands with the 25-question golden set.
109
  top_k: 5
110
  max_iterations: 3
111
  golden_dataset: agent_bench/evaluation/datasets/k8s_golden_pilot.json
results/k8s_preedit.json ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "question_id": "k8s_pilot_001",
4
+ "question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
5
+ "category": "retrieval",
6
+ "difficulty": "easy",
7
+ "retrieval_precision": 0.4,
8
+ "retrieval_recall": 1.0,
9
+ "keyword_hit_rate": 1.0,
10
+ "has_source_citation": true,
11
+ "grounded_refusal": true,
12
+ "citation_accuracy": 1.0,
13
+ "calculator_used_correctly": true,
14
+ "tool_calls_made": 2,
15
+ "latency_ms": 14846.163394977339,
16
+ "tokens_used": {
17
+ "input_tokens": 1607,
18
+ "output_tokens": 150,
19
+ "estimated_cost_usd": 0.00033105
20
+ },
21
+ "answer": "In Kubernetes, each Pod is assigned a unique IP address for each address family. Every container within a Pod shares the same network namespace, which includes the IP address and network ports. This allows containers in the same Pod to communicate with each other using `localhost` [source: k8s_pods.md]. \n\nAdditionally, containers within a Pod can also communicate using standard inter-process communications like SystemV semaphores or POSIX shared memory [source: k8s_pods.md].",
22
+ "retrieved_sources": [
23
+ "k8s_network_policies.md",
24
+ "k8s_pods.md",
25
+ "k8s_pods.md",
26
+ "k8s_network_policies.md",
27
+ "k8s_network_policies.md",
28
+ "k8s_pods.md",
29
+ "k8s_pods.md",
30
+ "k8s_pods.md",
31
+ "k8s_pods.md",
32
+ "k8s_pods.md"
33
+ ],
34
+ "faithfulness": null,
35
+ "correctness": null
36
+ },
37
+ {
38
+ "question_id": "k8s_pilot_002",
39
+ "question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
40
+ "category": "retrieval",
41
+ "difficulty": "hard",
42
+ "retrieval_precision": 1.0,
43
+ "retrieval_recall": 1.0,
44
+ "keyword_hit_rate": 0.8333333333333334,
45
+ "has_source_citation": true,
46
+ "grounded_refusal": true,
47
+ "citation_accuracy": 1.0,
48
+ "calculator_used_correctly": true,
49
+ "tool_calls_made": 1,
50
+ "latency_ms": 10439.759872970171,
51
+ "tokens_used": {
52
+ "input_tokens": 1096,
53
+ "output_tokens": 180,
54
+ "estimated_cost_usd": 0.00027239999999999995
55
+ },
56
+ "answer": "When you update a Deployment's pod template, Kubernetes uses a rolling update mechanism to transition Pods from the old version to the new one. This process involves creating a new ReplicaSet for the updated Deployment. The Deployment controller gradually scales up the new ReplicaSet while simultaneously scaling down the old ReplicaSet. This ensures that Pods are replaced at a controlled rate, allowing for a smooth transition without downtime [source: k8s_deployment.md].\n\nThe ReplicaSet plays a crucial role in this process by managing the Pods that match the Deployment's specifications. When the Deployment is updated, the existing ReplicaSet that controls the old Pods is scaled down, and the new ReplicaSet is scaled up to the desired number of replicas specified in the Deployment [source: k8s_deployment.md].",
57
+ "retrieved_sources": [
58
+ "k8s_deployment.md",
59
+ "k8s_replicaset.md",
60
+ "k8s_deployment.md",
61
+ "k8s_deployment.md",
62
+ "k8s_deployment.md"
63
+ ],
64
+ "faithfulness": null,
65
+ "correctness": null
66
+ },
67
+ {
68
+ "question_id": "k8s_pilot_003",
69
+ "question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
70
+ "category": "retrieval",
71
+ "difficulty": "medium",
72
+ "retrieval_precision": 1.0,
73
+ "retrieval_recall": 1.0,
74
+ "keyword_hit_rate": 0.6666666666666666,
75
+ "has_source_citation": true,
76
+ "grounded_refusal": true,
77
+ "citation_accuracy": 1.0,
78
+ "calculator_used_correctly": true,
79
+ "tool_calls_made": 1,
80
+ "latency_ms": 7017.4159810412675,
81
+ "tokens_used": {
82
+ "input_tokens": 981,
83
+ "output_tokens": 109,
84
+ "estimated_cost_usd": 0.00021255
85
+ },
86
+ "answer": "The key difference between a ConfigMap and a Secret when deciding where to store sensitive application data, such as database passwords, is that a ConfigMap does not provide secrecy or encryption, while a Secret is specifically intended to hold confidential data and includes additional protections for sensitive information. Therefore, for storing sensitive data, it is recommended to use a Secret rather than a ConfigMap [source: k8s_configmap.md][source: k8s_secret.md].",
87
+ "retrieved_sources": [
88
+ "k8s_configmap.md",
89
+ "k8s_secret.md",
90
+ "k8s_secret.md",
91
+ "k8s_secret.md",
92
+ "k8s_configmap.md"
93
+ ],
94
+ "faithfulness": null,
95
+ "correctness": null
96
+ },
97
+ {
98
+ "question_id": "k8s_pilot_004",
99
+ "question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
100
+ "category": "retrieval",
101
+ "difficulty": "hard",
102
+ "retrieval_precision": 1.0,
103
+ "retrieval_recall": 1.0,
104
+ "keyword_hit_rate": 0.5,
105
+ "has_source_citation": true,
106
+ "grounded_refusal": true,
107
+ "citation_accuracy": 1.0,
108
+ "calculator_used_correctly": true,
109
+ "tool_calls_made": 1,
110
+ "latency_ms": 6551.91807099618,
111
+ "tokens_used": {
112
+ "input_tokens": 1065,
113
+ "output_tokens": 138,
114
+ "estimated_cost_usd": 0.00024255000000000001
115
+ },
116
+ "answer": "If you set a custom value for one hard eviction threshold on the kubelet (e.g., `memory.available`) but leave the other thresholds unset, the default values for the thresholds you didn't override will not be inherited. Instead, the values of the other parameters will be set to zero. To provide custom values, you should specify all the thresholds respectively. Alternatively, you can set the kubelet config `MergeDefaultEvictionSettings` to true in the kubelet configuration file to retain the default values for the other thresholds [source: k8s_node_pressure_eviction.md].",
117
+ "retrieved_sources": [
118
+ "k8s_node_pressure_eviction.md",
119
+ "k8s_node_pressure_eviction.md",
120
+ "k8s_node_pressure_eviction.md",
121
+ "k8s_node_pressure_eviction.md",
122
+ "k8s_node_pressure_eviction.md"
123
+ ],
124
+ "faithfulness": null,
125
+ "correctness": null
126
+ },
127
+ {
128
+ "question_id": "k8s_pilot_005",
129
+ "question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
130
+ "category": "retrieval",
131
+ "difficulty": "medium",
132
+ "retrieval_precision": 1.0,
133
+ "retrieval_recall": 1.0,
134
+ "keyword_hit_rate": 0.6666666666666666,
135
+ "has_source_citation": true,
136
+ "grounded_refusal": true,
137
+ "citation_accuracy": 1.0,
138
+ "calculator_used_correctly": true,
139
+ "tool_calls_made": 1,
140
+ "latency_ms": 5478.1018840149045,
141
+ "tokens_used": {
142
+ "input_tokens": 1120,
143
+ "output_tokens": 66,
144
+ "estimated_cost_usd": 0.0002076
145
+ },
146
+ "answer": "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. Therefore, I cannot answer your question based on the available documentation.",
147
+ "retrieved_sources": [
148
+ "k8s_network_policies.md",
149
+ "k8s_network_policies.md",
150
+ "k8s_network_policies.md",
151
+ "k8s_network_policies.md",
152
+ "k8s_network_policies.md"
153
+ ],
154
+ "faithfulness": null,
155
+ "correctness": null
156
+ },
157
+ {
158
+ "question_id": "k8s_pilot_006",
159
+ "question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
160
+ "category": "retrieval",
161
+ "difficulty": "easy",
162
+ "retrieval_precision": 0.4,
163
+ "retrieval_recall": 1.0,
164
+ "keyword_hit_rate": 1.0,
165
+ "has_source_citation": true,
166
+ "grounded_refusal": true,
167
+ "citation_accuracy": 1.0,
168
+ "calculator_used_correctly": true,
169
+ "tool_calls_made": 1,
170
+ "latency_ms": 5607.383974012919,
171
+ "tokens_used": {
172
+ "input_tokens": 1117,
173
+ "output_tokens": 55,
174
+ "estimated_cost_usd": 0.00020054999999999997
175
+ },
176
+ "answer": "The built-in Pod Security admission controller reached a stable feature state in Kubernetes v1.25 [source: k8s_pod_security_admission.md].",
177
+ "retrieved_sources": [
178
+ "k8s_pod_security_admission.md",
179
+ "k8s_replicaset.md",
180
+ "k8s_deployment.md",
181
+ "k8s_node_pressure_eviction.md",
182
+ "k8s_pod_security_admission.md"
183
+ ],
184
+ "faithfulness": null,
185
+ "correctness": null
186
+ }
187
+ ]