Spaces:
Running
Running
File size: 13,546 Bytes
3241b7c 05bf702 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | # K8s Golden Dataset β Question Plan
**Status:** Structural guide for Week 1 step 5 authoring (v1.1 plan).
This document defines the 25-question target distribution, per-type
source-page mapping, and authoring constraints. It does NOT contain
the 25 specific question texts β those are authored during step 5 in
a fresh session, per cross-cutting #8 pilot-first discipline.
**Upstream contracts:**
- Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) β see DECISIONS.md
"K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
- Source pages: see `SOURCES.md` (28 pages, category-locked; 8 already
pulled, 20 to pull at step 4).
- Schema: see `agent_bench/evaluation/harness.py` `GoldenQuestion`
plus the v1.1 plan's methodology #3 source-attribution fields.
- Flavor A/B for `false_premise`: see DECISIONS.md "False-premise
questions come in two flavors".
---
## Target distribution (25 questions total)
| CRAG type | Count | Schema field | Notes |
|---|---|---|---|
| `simple` | 5β6 | `question_type: "simple"` | Baseline retrieval: direct lookup in 1 page, 1β2 sentence answer. |
| `simple_w_condition` | 3β4 | `question_type: "simple_w_condition"` | Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase). |
| `comparison` | 3β4 | `question_type: "comparison"` | Answer compares two concepts across 2 pages; reranker stress. |
| `multi_hop` | 5β6 | `question_type: "multi_hop"` | Answer synthesizes 2β4 pages; reranker-stressing by construction. |
| `false_premise` | 3β4 | `question_type: "false_premise"` | Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative). |
| `set` / `aggregation` / `post_processing_heavy` | 0β3 | respective values | Optional. Include only if natural from corpus content. |
| **Total** | **25** | | |
**Orthogonal flag:** `time_sensitive: bool` on 2β3 questions. Does
NOT replace `question_type` β it's an independent property for
version-bounded content (feature state, API version migration,
deprecations).
---
## Per-type source-page mapping
Each row identifies the K8s concept pages a question of that type
should draw from. Multi-hop and comparison questions list multiple
pages intentionally.
### simple (5β6 slots)
Pool questions where a 1β2 sentence answer lives inside a single page.
| Candidate source | CRAG slot justification |
|---|---|
| `k8s_pods.md` | Pod IP semantics, container sharing, ephemeral containers |
| `k8s_deployment.md` | What a Deployment is, declarative update mechanic |
| `k8s_configmap.md` | What a ConfigMap is, immutable field |
| `k8s_secret.md` | What a Secret is, volume mount modes |
| RBAC Authorization *(step 4 page)* | RBAC primitive definitions (Role, RoleBinding, ClusterRole) |
| StatefulSet *(step 4 page)* | StatefulSet identity guarantees |
| DaemonSet *(step 4 page)* | One-per-node scheduling contract |
| Namespaces *(step 4 page)* | Namespace scoping for resources |
**Authoring rule:** Each `simple` question must have exactly one
expected source page and 1β2 source snippets. KHR target β₯ 0.60 on
the authored keywords.
### simple_w_condition (3β4 slots)
Pool questions where the answer explicitly depends on a condition
named in the question.
| Candidate source | Condition that shapes the answer |
|---|---|
| `k8s_pod_security_admission.md` | enforcement level: `enforce` / `audit` / `warn` |
| `k8s_secret.md` | mount mode: environment variable vs file in volume |
| Liveness/Readiness/Startup Probes *(step 4)* | probe type: liveness vs readiness vs startup |
| Volumes *(step 4)* | volume type: emptyDir vs configMap vs persistentVolumeClaim |
| Node-pressure Eviction (`k8s_node_pressure_eviction.md`) | resource under pressure: memory vs disk vs inodes |
**Authoring rule:** The condition must be named in the question
stem, not implied. The expected answer must change materially if the
condition flips. Example: "How is a Secret mounted as a volume
versus consumed as an environment variable?" is a valid
`simple_w_condition`; "How is a Secret mounted?" is `simple`.
### comparison (3β4 slots)
Pool questions where the answer explicitly compares two K8s concepts
that span 2 pages.
| Page pair | Concept compared |
|---|---|
| Deployment vs StatefulSet *(step 4)* | stateless vs stateful workload semantics |
| Deployment vs DaemonSet *(step 4)* | replica-count vs one-per-node scheduling |
| ConfigMap vs Secret | non-confidential vs confidential data, mount parity |
| Service vs Ingress *(step 4)* | L4 vs L7 exposure |
| Taints/Tolerations vs Node Affinity *(step 4)* | opt-out vs opt-in placement |
| Liveness vs Readiness probes *(step 4)* | restart vs traffic-routing semantics |
**Authoring rule:** The question must force retrieval from both
pages. Reranker stress is intentional β questions where BM25 would
find one side but miss the other are the target. Expected sources:
2 pages minimum.
### multi_hop (5β6 slots)
Pool questions where the answer synthesizes 2β4 pages. These are
the primary reranker stressors.
| Page set (example) | Hop path |
|---|---|
| Pod + Service + Ingress *(step 4)* | How external traffic reaches a Pod through Service β Ingress |
| Deployment + ReplicaSet + Pod | How a Deployment rollout changes the underlying ReplicaSet and Pod set |
| ConfigMap + Deployment | How a ConfigMap update propagates to Pods via env vars or mounted volume |
| HPA + Deployment + Metrics Server *(partial step 4)* | How HPA reads metrics and scales a Deployment |
| NetworkPolicy + Pod + Namespace *(partial step 4)* | How NetworkPolicy selectors resolve across namespaces |
| Job + Pod + Container lifecycle *(partial step 4)* | How a Job's completions and parallelism interact with Pod restart policy |
**Authoring rule:** Expected sources β₯ 2 pages. The question must
not be answerable from any single page alone. `source_chunk_ids`
must list at least one chunk from each expected page; partial
credit is granted in the evaluator if at least one expected chunk is
cited (see `agent_bench/evaluation/harness.py`).
### false_premise (3β4 slots)
Pool questions whose premise is wrong. Split across two flavors:
**Flavor A β pure refusal** (at least 1 slot):
- Premise targets a capability that does not exist in the K8s corpus
(not in any pulled page).
- Example seed: "How do I configure Claude API rate limits in a
Kubernetes Deployment?" (wrong domain β Claude API is not a K8s
concept)
- Schema: `category: "out_of_scope"`, `expected_sources: []`,
`source_snippets: []`.
- Evaluator expectation: answer contains refusal phrasing AND cites
zero sources.
**Flavor B β documented negative** (at least 1 slot, ideally 2):
- Corpus contains an explicit negative statement (e.g.
NetworkPolicy "Anything TLS related" limitation at chunk 63 of
`k8s_network_policies.md`).
- Example already in pilot: `k8s_pilot_005` (NetworkPolicy mTLS).
- Schema: `category: "retrieval"`, `question_type: "false_premise"`,
`expected_sources: [<negative-answer page>]`,
`source_snippets: [<verbatim negative statement>]`.
- Evaluator expectation: answer reports the documented negative
with citation, does NOT open with "the documentation does not
provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2
revert analysis).
**Other flavor-B candidate pages for authoring:**
- Pod Security Standards β explicit statements about what each
profile does NOT permit
- RBAC Authorization β explicit statements about what RBAC does NOT
provide (e.g. no deny rules)
- NetworkPolicy β additional negative clauses beyond the pilot_005
mTLS one
### set / aggregation / post_processing_heavy (0β3 slots)
Include only if a K8s page naturally supports the pattern:
- `set`: "Which Kubernetes resources can expose a Service?" (answer
is a set drawn from the Service page). Include 0β1 of this type
if a clean example emerges; otherwise leave slot empty.
- `aggregation`: Unlikely to fit K8s docs (docs describe concepts,
not tabular data). Likely leave empty.
- `post_processing_heavy`: Unlikely to fit K8s docs. Likely leave
empty.
**Default:** Leave 0β3 as **0**. Only author these if a question
emerges organically during step 5. Do not force-author to hit a
target count; the plan explicitly says "0β3, included only where
corpus content naturally supports".
---
## `time_sensitive` flag placement (2β3 questions)
Flag questions whose correct answer depends on K8s version state:
| Candidate | Why time-sensitive |
|---|---|
| HPA API version | `autoscaling/v1` vs `autoscaling/v2` β v2 stable since 1.23 |
| Pod Security Admission stability | "stable as of v1.25" β feature state in the page |
| PodSecurityPolicy removal | PSP removed in 1.25; migration path to PSA |
**Authoring rule:** Set `time_sensitive: true` on exactly 2β3
questions. Distribute across β₯2 different CRAG types (e.g. one
`simple`, one `simple_w_condition`) so the flag is not concentrated
in a single type. Each `time_sensitive` question must cite a
specific K8s version or feature state in the source snippet,
otherwise the flag is not load-bearing.
---
## Difficulty distribution
Loose guidance, not a hard constraint:
- `easy`: 8β10 questions β mostly `simple` and single-page
`simple_w_condition`
- `medium`: 10β12 questions β `comparison`, most `multi_hop`,
straightforward `false_premise`
- `hard`: 4β6 questions β deep `multi_hop`, flavor-B `false_premise`,
`time_sensitive` + `multi_hop` combinations
The pilot's 6-question set is all `easy`/`medium`. Step 5 should add
the `hard` tier.
---
## Authoring checklist (per question)
For each of the 25 questions, the step 5 author must fill:
| Field | Required | Notes |
|---|---|---|
| `id` | yes | `k8s_<NNN>` zero-padded (e.g. `k8s_001`) |
| `question` | yes | Natural-language question in the voice of a recruiter or developer |
| `expected_answer_keywords` | yes | 3β6 keywords that MUST appear in a correct answer; drives `keyword_hit_rate` |
| `expected_sources` | yes | List of `.md` filenames from `SOURCES.md`; β₯1 for scoped questions, `[]` for flavor-A false-premise |
| `category` | yes | `retrieval` / `calculation` / `out_of_scope` |
| `difficulty` | yes | `easy` / `medium` / `hard` |
| `requires_calculator` | yes | `false` for all K8s questions (no calc tool use expected) |
| `reference_answer` | yes | 1β3 sentence answer used by the optional LLM judge |
| `question_type` | yes | CRAG taxonomy value (exactly one of the 8 canonical strings) |
| `time_sensitive` | yes | `bool`; `true` on exactly 2β3 questions |
| `source_chunk_ids` | yes | Content-hashed chunk IDs (stable across reindex); must be `[]` for flavor-A false-premise |
| `source_snippets` | yes | ~20 words verbatim per chunk; drift-detection field |
| `source_pages` | yes | Human-readable page anchor (e.g. `"concepts/workloads/pods"`) |
| `source_sections` | yes | Deepest heading containing the snippet |
**Deprecation note:** The pilot schema has `is_multi_hop: bool`.
Step 5 may retire this field in favor of `question_type == "multi_hop"`,
but only after confirming the evaluator's partial-credit logic
(`agent_bench/evaluation/harness.py:38`) is updated to read from
`question_type`. Do NOT remove `is_multi_hop` without the
corresponding harness update, or existing pilot questions will
break partial-credit scoring.
---
## Pilot-first validation before step 5 authoring
Before writing the 25 questions, step 5 author must:
1. Confirm the 20 new pages from step 4 are ingested and reachable
via the pipeline (smoke-query test per `SOURCES.md`'s post-ingest
validation).
2. Re-run `make evaluate` on the existing 6-question pilot dataset
against the newly-expanded corpus. The pilot's existing questions
must still pass their per-question gates β if adding 20 new
pages drops pilot P@5 materially, investigate before adding more
questions on top.
3. Hand-draft 2β3 questions first, run them through the pipeline,
and confirm retrieval surfaces the expected chunks. This is the
final pilot-first checkpoint before bulk authoring.
Only after these three checks pass does the step 5 author proceed
to the full 25-question authoring session.
## Post-authoring observations (step 5 shipped 2026-04-14)
Pilotβfull generalization numbers: pilot (6Q) P@5=0.80, R@5=1.00,
KHR=0.81 β full (25Q post-fix) P@5=0.83, R@5=0.96, KHR=0.90. R@5
movement is within expected variance when corpus breadth expands
from 8 β 28 pages; KHR jump from 0.81β0.90 is an open question β
the 25Q distribution may skew toward questions where the golden
keyword set is more readily satisfied (simple + simple_w_condition
+ set together = 11/25 questions with short, high-precision expected
answers), vs the pilot's retrieval-heavy mix. Worth revisiting if
KHR drifts on future corpora β if consistent across datasets, it's
authoring signal that the keyword set should be tightened for CRAG
type parity.
Flavor-B reproducibility finding: k8s_022 (RBAC deny rules) and
pilot_005 (NetworkPolicy mTLS) both produce refusal-phrased answers
when the documented negative is in retrieved context. Two independent
reproductions confirm the LLM-hedges-on-documented-negative pattern
is a class of failure mode, not a one-off β strengthens the case
for the deferred Fix 2 + targeted prompt guidance stacked experiment.
Authoring itself is clean on both: retrieval surfaces the expected
chunks, citation accuracy 1.00, snippets verify against chunk IDs.
|