File size: 13,546 Bytes
3241b7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05bf702
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# K8s Golden Dataset β€” Question Plan

**Status:** Structural guide for Week 1 step 5 authoring (v1.1 plan).
This document defines the 25-question target distribution, per-type
source-page mapping, and authoring constraints. It does NOT contain
the 25 specific question texts β€” those are authored during step 5 in
a fresh session, per cross-cutting #8 pilot-first discipline.

**Upstream contracts:**
- Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) β€” see DECISIONS.md
  "K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
- Source pages: see `SOURCES.md` (28 pages, category-locked; 8 already
  pulled, 20 to pull at step 4).
- Schema: see `agent_bench/evaluation/harness.py` `GoldenQuestion`
  plus the v1.1 plan's methodology #3 source-attribution fields.
- Flavor A/B for `false_premise`: see DECISIONS.md "False-premise
  questions come in two flavors".

---

## Target distribution (25 questions total)

| CRAG type | Count | Schema field | Notes |
|---|---|---|---|
| `simple` | 5–6 | `question_type: "simple"` | Baseline retrieval: direct lookup in 1 page, 1–2 sentence answer. |
| `simple_w_condition` | 3–4 | `question_type: "simple_w_condition"` | Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase). |
| `comparison` | 3–4 | `question_type: "comparison"` | Answer compares two concepts across 2 pages; reranker stress. |
| `multi_hop` | 5–6 | `question_type: "multi_hop"` | Answer synthesizes 2–4 pages; reranker-stressing by construction. |
| `false_premise` | 3–4 | `question_type: "false_premise"` | Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative). |
| `set` / `aggregation` / `post_processing_heavy` | 0–3 | respective values | Optional. Include only if natural from corpus content. |
| **Total** | **25** | | |

**Orthogonal flag:** `time_sensitive: bool` on 2–3 questions. Does
NOT replace `question_type` β€” it's an independent property for
version-bounded content (feature state, API version migration,
deprecations).

---

## Per-type source-page mapping

Each row identifies the K8s concept pages a question of that type
should draw from. Multi-hop and comparison questions list multiple
pages intentionally.

### simple (5–6 slots)

Pool questions where a 1–2 sentence answer lives inside a single page.

| Candidate source | CRAG slot justification |
|---|---|
| `k8s_pods.md` | Pod IP semantics, container sharing, ephemeral containers |
| `k8s_deployment.md` | What a Deployment is, declarative update mechanic |
| `k8s_configmap.md` | What a ConfigMap is, immutable field |
| `k8s_secret.md` | What a Secret is, volume mount modes |
| RBAC Authorization *(step 4 page)* | RBAC primitive definitions (Role, RoleBinding, ClusterRole) |
| StatefulSet *(step 4 page)* | StatefulSet identity guarantees |
| DaemonSet *(step 4 page)* | One-per-node scheduling contract |
| Namespaces *(step 4 page)* | Namespace scoping for resources |

**Authoring rule:** Each `simple` question must have exactly one
expected source page and 1–2 source snippets. KHR target β‰₯ 0.60 on
the authored keywords.

### simple_w_condition (3–4 slots)

Pool questions where the answer explicitly depends on a condition
named in the question.

| Candidate source | Condition that shapes the answer |
|---|---|
| `k8s_pod_security_admission.md` | enforcement level: `enforce` / `audit` / `warn` |
| `k8s_secret.md` | mount mode: environment variable vs file in volume |
| Liveness/Readiness/Startup Probes *(step 4)* | probe type: liveness vs readiness vs startup |
| Volumes *(step 4)* | volume type: emptyDir vs configMap vs persistentVolumeClaim |
| Node-pressure Eviction (`k8s_node_pressure_eviction.md`) | resource under pressure: memory vs disk vs inodes |

**Authoring rule:** The condition must be named in the question
stem, not implied. The expected answer must change materially if the
condition flips. Example: "How is a Secret mounted as a volume
versus consumed as an environment variable?" is a valid
`simple_w_condition`; "How is a Secret mounted?" is `simple`.

### comparison (3–4 slots)

Pool questions where the answer explicitly compares two K8s concepts
that span 2 pages.

| Page pair | Concept compared |
|---|---|
| Deployment vs StatefulSet *(step 4)* | stateless vs stateful workload semantics |
| Deployment vs DaemonSet *(step 4)* | replica-count vs one-per-node scheduling |
| ConfigMap vs Secret | non-confidential vs confidential data, mount parity |
| Service vs Ingress *(step 4)* | L4 vs L7 exposure |
| Taints/Tolerations vs Node Affinity *(step 4)* | opt-out vs opt-in placement |
| Liveness vs Readiness probes *(step 4)* | restart vs traffic-routing semantics |

**Authoring rule:** The question must force retrieval from both
pages. Reranker stress is intentional β€” questions where BM25 would
find one side but miss the other are the target. Expected sources:
2 pages minimum.

### multi_hop (5–6 slots)

Pool questions where the answer synthesizes 2–4 pages. These are
the primary reranker stressors.

| Page set (example) | Hop path |
|---|---|
| Pod + Service + Ingress *(step 4)* | How external traffic reaches a Pod through Service β†’ Ingress |
| Deployment + ReplicaSet + Pod | How a Deployment rollout changes the underlying ReplicaSet and Pod set |
| ConfigMap + Deployment | How a ConfigMap update propagates to Pods via env vars or mounted volume |
| HPA + Deployment + Metrics Server *(partial step 4)* | How HPA reads metrics and scales a Deployment |
| NetworkPolicy + Pod + Namespace *(partial step 4)* | How NetworkPolicy selectors resolve across namespaces |
| Job + Pod + Container lifecycle *(partial step 4)* | How a Job's completions and parallelism interact with Pod restart policy |

**Authoring rule:** Expected sources β‰₯ 2 pages. The question must
not be answerable from any single page alone. `source_chunk_ids`
must list at least one chunk from each expected page; partial
credit is granted in the evaluator if at least one expected chunk is
cited (see `agent_bench/evaluation/harness.py`).

### false_premise (3–4 slots)

Pool questions whose premise is wrong. Split across two flavors:

**Flavor A β€” pure refusal** (at least 1 slot):
- Premise targets a capability that does not exist in the K8s corpus
  (not in any pulled page).
- Example seed: "How do I configure Claude API rate limits in a
  Kubernetes Deployment?" (wrong domain β€” Claude API is not a K8s
  concept)
- Schema: `category: "out_of_scope"`, `expected_sources: []`,
  `source_snippets: []`.
- Evaluator expectation: answer contains refusal phrasing AND cites
  zero sources.

**Flavor B β€” documented negative** (at least 1 slot, ideally 2):
- Corpus contains an explicit negative statement (e.g.
  NetworkPolicy "Anything TLS related" limitation at chunk 63 of
  `k8s_network_policies.md`).
- Example already in pilot: `k8s_pilot_005` (NetworkPolicy mTLS).
- Schema: `category: "retrieval"`, `question_type: "false_premise"`,
  `expected_sources: [<negative-answer page>]`,
  `source_snippets: [<verbatim negative statement>]`.
- Evaluator expectation: answer reports the documented negative
  with citation, does NOT open with "the documentation does not
  provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2
  revert analysis).

**Other flavor-B candidate pages for authoring:**
- Pod Security Standards β€” explicit statements about what each
  profile does NOT permit
- RBAC Authorization β€” explicit statements about what RBAC does NOT
  provide (e.g. no deny rules)
- NetworkPolicy β€” additional negative clauses beyond the pilot_005
  mTLS one

### set / aggregation / post_processing_heavy (0–3 slots)

Include only if a K8s page naturally supports the pattern:

- `set`: "Which Kubernetes resources can expose a Service?" (answer
  is a set drawn from the Service page). Include 0–1 of this type
  if a clean example emerges; otherwise leave slot empty.
- `aggregation`: Unlikely to fit K8s docs (docs describe concepts,
  not tabular data). Likely leave empty.
- `post_processing_heavy`: Unlikely to fit K8s docs. Likely leave
  empty.

**Default:** Leave 0–3 as **0**. Only author these if a question
emerges organically during step 5. Do not force-author to hit a
target count; the plan explicitly says "0–3, included only where
corpus content naturally supports".

---

## `time_sensitive` flag placement (2–3 questions)

Flag questions whose correct answer depends on K8s version state:

| Candidate | Why time-sensitive |
|---|---|
| HPA API version | `autoscaling/v1` vs `autoscaling/v2` β€” v2 stable since 1.23 |
| Pod Security Admission stability | "stable as of v1.25" β€” feature state in the page |
| PodSecurityPolicy removal | PSP removed in 1.25; migration path to PSA |

**Authoring rule:** Set `time_sensitive: true` on exactly 2–3
questions. Distribute across β‰₯2 different CRAG types (e.g. one
`simple`, one `simple_w_condition`) so the flag is not concentrated
in a single type. Each `time_sensitive` question must cite a
specific K8s version or feature state in the source snippet,
otherwise the flag is not load-bearing.

---

## Difficulty distribution

Loose guidance, not a hard constraint:

- `easy`: 8–10 questions β€” mostly `simple` and single-page
  `simple_w_condition`
- `medium`: 10–12 questions β€” `comparison`, most `multi_hop`,
  straightforward `false_premise`
- `hard`: 4–6 questions β€” deep `multi_hop`, flavor-B `false_premise`,
  `time_sensitive` + `multi_hop` combinations

The pilot's 6-question set is all `easy`/`medium`. Step 5 should add
the `hard` tier.

---

## Authoring checklist (per question)

For each of the 25 questions, the step 5 author must fill:

| Field | Required | Notes |
|---|---|---|
| `id` | yes | `k8s_<NNN>` zero-padded (e.g. `k8s_001`) |
| `question` | yes | Natural-language question in the voice of a recruiter or developer |
| `expected_answer_keywords` | yes | 3–6 keywords that MUST appear in a correct answer; drives `keyword_hit_rate` |
| `expected_sources` | yes | List of `.md` filenames from `SOURCES.md`; β‰₯1 for scoped questions, `[]` for flavor-A false-premise |
| `category` | yes | `retrieval` / `calculation` / `out_of_scope` |
| `difficulty` | yes | `easy` / `medium` / `hard` |
| `requires_calculator` | yes | `false` for all K8s questions (no calc tool use expected) |
| `reference_answer` | yes | 1–3 sentence answer used by the optional LLM judge |
| `question_type` | yes | CRAG taxonomy value (exactly one of the 8 canonical strings) |
| `time_sensitive` | yes | `bool`; `true` on exactly 2–3 questions |
| `source_chunk_ids` | yes | Content-hashed chunk IDs (stable across reindex); must be `[]` for flavor-A false-premise |
| `source_snippets` | yes | ~20 words verbatim per chunk; drift-detection field |
| `source_pages` | yes | Human-readable page anchor (e.g. `"concepts/workloads/pods"`) |
| `source_sections` | yes | Deepest heading containing the snippet |

**Deprecation note:** The pilot schema has `is_multi_hop: bool`.
Step 5 may retire this field in favor of `question_type == "multi_hop"`,
but only after confirming the evaluator's partial-credit logic
(`agent_bench/evaluation/harness.py:38`) is updated to read from
`question_type`. Do NOT remove `is_multi_hop` without the
corresponding harness update, or existing pilot questions will
break partial-credit scoring.

---

## Pilot-first validation before step 5 authoring

Before writing the 25 questions, step 5 author must:

1. Confirm the 20 new pages from step 4 are ingested and reachable
   via the pipeline (smoke-query test per `SOURCES.md`'s post-ingest
   validation).
2. Re-run `make evaluate` on the existing 6-question pilot dataset
   against the newly-expanded corpus. The pilot's existing questions
   must still pass their per-question gates β€” if adding 20 new
   pages drops pilot P@5 materially, investigate before adding more
   questions on top.
3. Hand-draft 2–3 questions first, run them through the pipeline,
   and confirm retrieval surfaces the expected chunks. This is the
   final pilot-first checkpoint before bulk authoring.

Only after these three checks pass does the step 5 author proceed
to the full 25-question authoring session.

## Post-authoring observations (step 5 shipped 2026-04-14)

Pilot→full generalization numbers: pilot (6Q) P@5=0.80, R@5=1.00,
KHR=0.81 β†’ full (25Q post-fix) P@5=0.83, R@5=0.96, KHR=0.90. R@5
movement is within expected variance when corpus breadth expands
from 8 β†’ 28 pages; KHR jump from 0.81β†’0.90 is an open question β€”
the 25Q distribution may skew toward questions where the golden
keyword set is more readily satisfied (simple + simple_w_condition
+ set together = 11/25 questions with short, high-precision expected
answers), vs the pilot's retrieval-heavy mix. Worth revisiting if
KHR drifts on future corpora β€” if consistent across datasets, it's
authoring signal that the keyword set should be tightened for CRAG
type parity.

Flavor-B reproducibility finding: k8s_022 (RBAC deny rules) and
pilot_005 (NetworkPolicy mTLS) both produce refusal-phrased answers
when the documented negative is in retrieved context. Two independent
reproductions confirm the LLM-hedges-on-documented-negative pattern
is a class of failure mode, not a one-off β€” strengthens the case
for the deferred Fix 2 + targeted prompt guidance stacked experiment.
Authoring itself is clean on both: retrieval surfaces the expected
chunks, citation accuracy 1.00, snippets verify against chunk IDs.