Spaces:
Running
K8s Golden Dataset β Question Plan
Status: Structural guide for Week 1 step 5 authoring (v1.1 plan). This document defines the 25-question target distribution, per-type source-page mapping, and authoring constraints. It does NOT contain the 25 specific question texts β those are authored during step 5 in a fresh session, per cross-cutting #8 pilot-first discipline.
Upstream contracts:
- Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) β see DECISIONS.md "K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
- Source pages: see
SOURCES.md(28 pages, category-locked; 8 already pulled, 20 to pull at step 4). - Schema: see
agent_bench/evaluation/harness.pyGoldenQuestionplus the v1.1 plan's methodology #3 source-attribution fields. - Flavor A/B for
false_premise: see DECISIONS.md "False-premise questions come in two flavors".
Target distribution (25 questions total)
| CRAG type | Count | Schema field | Notes |
|---|---|---|---|
simple |
5β6 | question_type: "simple" |
Baseline retrieval: direct lookup in 1 page, 1β2 sentence answer. |
simple_w_condition |
3β4 | question_type: "simple_w_condition" |
Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase). |
comparison |
3β4 | question_type: "comparison" |
Answer compares two concepts across 2 pages; reranker stress. |
multi_hop |
5β6 | question_type: "multi_hop" |
Answer synthesizes 2β4 pages; reranker-stressing by construction. |
false_premise |
3β4 | question_type: "false_premise" |
Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative). |
set / aggregation / post_processing_heavy |
0β3 | respective values | Optional. Include only if natural from corpus content. |
| Total | 25 |
Orthogonal flag: time_sensitive: bool on 2β3 questions. Does
NOT replace question_type β it's an independent property for
version-bounded content (feature state, API version migration,
deprecations).
Per-type source-page mapping
Each row identifies the K8s concept pages a question of that type should draw from. Multi-hop and comparison questions list multiple pages intentionally.
simple (5β6 slots)
Pool questions where a 1β2 sentence answer lives inside a single page.
| Candidate source | CRAG slot justification |
|---|---|
k8s_pods.md |
Pod IP semantics, container sharing, ephemeral containers |
k8s_deployment.md |
What a Deployment is, declarative update mechanic |
k8s_configmap.md |
What a ConfigMap is, immutable field |
k8s_secret.md |
What a Secret is, volume mount modes |
| RBAC Authorization (step 4 page) | RBAC primitive definitions (Role, RoleBinding, ClusterRole) |
| StatefulSet (step 4 page) | StatefulSet identity guarantees |
| DaemonSet (step 4 page) | One-per-node scheduling contract |
| Namespaces (step 4 page) | Namespace scoping for resources |
Authoring rule: Each simple question must have exactly one
expected source page and 1β2 source snippets. KHR target β₯ 0.60 on
the authored keywords.
simple_w_condition (3β4 slots)
Pool questions where the answer explicitly depends on a condition named in the question.
| Candidate source | Condition that shapes the answer |
|---|---|
k8s_pod_security_admission.md |
enforcement level: enforce / audit / warn |
k8s_secret.md |
mount mode: environment variable vs file in volume |
| Liveness/Readiness/Startup Probes (step 4) | probe type: liveness vs readiness vs startup |
| Volumes (step 4) | volume type: emptyDir vs configMap vs persistentVolumeClaim |
Node-pressure Eviction (k8s_node_pressure_eviction.md) |
resource under pressure: memory vs disk vs inodes |
Authoring rule: The condition must be named in the question
stem, not implied. The expected answer must change materially if the
condition flips. Example: "How is a Secret mounted as a volume
versus consumed as an environment variable?" is a valid
simple_w_condition; "How is a Secret mounted?" is simple.
comparison (3β4 slots)
Pool questions where the answer explicitly compares two K8s concepts that span 2 pages.
| Page pair | Concept compared |
|---|---|
| Deployment vs StatefulSet (step 4) | stateless vs stateful workload semantics |
| Deployment vs DaemonSet (step 4) | replica-count vs one-per-node scheduling |
| ConfigMap vs Secret | non-confidential vs confidential data, mount parity |
| Service vs Ingress (step 4) | L4 vs L7 exposure |
| Taints/Tolerations vs Node Affinity (step 4) | opt-out vs opt-in placement |
| Liveness vs Readiness probes (step 4) | restart vs traffic-routing semantics |
Authoring rule: The question must force retrieval from both pages. Reranker stress is intentional β questions where BM25 would find one side but miss the other are the target. Expected sources: 2 pages minimum.
multi_hop (5β6 slots)
Pool questions where the answer synthesizes 2β4 pages. These are the primary reranker stressors.
| Page set (example) | Hop path |
|---|---|
| Pod + Service + Ingress (step 4) | How external traffic reaches a Pod through Service β Ingress |
| Deployment + ReplicaSet + Pod | How a Deployment rollout changes the underlying ReplicaSet and Pod set |
| ConfigMap + Deployment | How a ConfigMap update propagates to Pods via env vars or mounted volume |
| HPA + Deployment + Metrics Server (partial step 4) | How HPA reads metrics and scales a Deployment |
| NetworkPolicy + Pod + Namespace (partial step 4) | How NetworkPolicy selectors resolve across namespaces |
| Job + Pod + Container lifecycle (partial step 4) | How a Job's completions and parallelism interact with Pod restart policy |
Authoring rule: Expected sources β₯ 2 pages. The question must
not be answerable from any single page alone. source_chunk_ids
must list at least one chunk from each expected page; partial
credit is granted in the evaluator if at least one expected chunk is
cited (see agent_bench/evaluation/harness.py).
false_premise (3β4 slots)
Pool questions whose premise is wrong. Split across two flavors:
Flavor A β pure refusal (at least 1 slot):
- Premise targets a capability that does not exist in the K8s corpus (not in any pulled page).
- Example seed: "How do I configure Claude API rate limits in a Kubernetes Deployment?" (wrong domain β Claude API is not a K8s concept)
- Schema:
category: "out_of_scope",expected_sources: [],source_snippets: []. - Evaluator expectation: answer contains refusal phrasing AND cites zero sources.
Flavor B β documented negative (at least 1 slot, ideally 2):
- Corpus contains an explicit negative statement (e.g.
NetworkPolicy "Anything TLS related" limitation at chunk 63 of
k8s_network_policies.md). - Example already in pilot:
k8s_pilot_005(NetworkPolicy mTLS). - Schema:
category: "retrieval",question_type: "false_premise",expected_sources: [<negative-answer page>],source_snippets: [<verbatim negative statement>]. - Evaluator expectation: answer reports the documented negative with citation, does NOT open with "the documentation does not provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2 revert analysis).
Other flavor-B candidate pages for authoring:
- Pod Security Standards β explicit statements about what each profile does NOT permit
- RBAC Authorization β explicit statements about what RBAC does NOT provide (e.g. no deny rules)
- NetworkPolicy β additional negative clauses beyond the pilot_005 mTLS one
set / aggregation / post_processing_heavy (0β3 slots)
Include only if a K8s page naturally supports the pattern:
set: "Which Kubernetes resources can expose a Service?" (answer is a set drawn from the Service page). Include 0β1 of this type if a clean example emerges; otherwise leave slot empty.aggregation: Unlikely to fit K8s docs (docs describe concepts, not tabular data). Likely leave empty.post_processing_heavy: Unlikely to fit K8s docs. Likely leave empty.
Default: Leave 0β3 as 0. Only author these if a question emerges organically during step 5. Do not force-author to hit a target count; the plan explicitly says "0β3, included only where corpus content naturally supports".
time_sensitive flag placement (2β3 questions)
Flag questions whose correct answer depends on K8s version state:
| Candidate | Why time-sensitive |
|---|---|
| HPA API version | autoscaling/v1 vs autoscaling/v2 β v2 stable since 1.23 |
| Pod Security Admission stability | "stable as of v1.25" β feature state in the page |
| PodSecurityPolicy removal | PSP removed in 1.25; migration path to PSA |
Authoring rule: Set time_sensitive: true on exactly 2β3
questions. Distribute across β₯2 different CRAG types (e.g. one
simple, one simple_w_condition) so the flag is not concentrated
in a single type. Each time_sensitive question must cite a
specific K8s version or feature state in the source snippet,
otherwise the flag is not load-bearing.
Difficulty distribution
Loose guidance, not a hard constraint:
easy: 8β10 questions β mostlysimpleand single-pagesimple_w_conditionmedium: 10β12 questions βcomparison, mostmulti_hop, straightforwardfalse_premisehard: 4β6 questions β deepmulti_hop, flavor-Bfalse_premise,time_sensitive+multi_hopcombinations
The pilot's 6-question set is all easy/medium. Step 5 should add
the hard tier.
Authoring checklist (per question)
For each of the 25 questions, the step 5 author must fill:
| Field | Required | Notes |
|---|---|---|
id |
yes | k8s_<NNN> zero-padded (e.g. k8s_001) |
question |
yes | Natural-language question in the voice of a recruiter or developer |
expected_answer_keywords |
yes | 3β6 keywords that MUST appear in a correct answer; drives keyword_hit_rate |
expected_sources |
yes | List of .md filenames from SOURCES.md; β₯1 for scoped questions, [] for flavor-A false-premise |
category |
yes | retrieval / calculation / out_of_scope |
difficulty |
yes | easy / medium / hard |
requires_calculator |
yes | false for all K8s questions (no calc tool use expected) |
reference_answer |
yes | 1β3 sentence answer used by the optional LLM judge |
question_type |
yes | CRAG taxonomy value (exactly one of the 8 canonical strings) |
time_sensitive |
yes | bool; true on exactly 2β3 questions |
source_chunk_ids |
yes | Content-hashed chunk IDs (stable across reindex); must be [] for flavor-A false-premise |
source_snippets |
yes | ~20 words verbatim per chunk; drift-detection field |
source_pages |
yes | Human-readable page anchor (e.g. "concepts/workloads/pods") |
source_sections |
yes | Deepest heading containing the snippet |
Deprecation note: The pilot schema has is_multi_hop: bool.
Step 5 may retire this field in favor of question_type == "multi_hop",
but only after confirming the evaluator's partial-credit logic
(agent_bench/evaluation/harness.py:38) is updated to read from
question_type. Do NOT remove is_multi_hop without the
corresponding harness update, or existing pilot questions will
break partial-credit scoring.
Pilot-first validation before step 5 authoring
Before writing the 25 questions, step 5 author must:
- Confirm the 20 new pages from step 4 are ingested and reachable
via the pipeline (smoke-query test per
SOURCES.md's post-ingest validation). - Re-run
make evaluateon the existing 6-question pilot dataset against the newly-expanded corpus. The pilot's existing questions must still pass their per-question gates β if adding 20 new pages drops pilot P@5 materially, investigate before adding more questions on top. - Hand-draft 2β3 questions first, run them through the pipeline, and confirm retrieval surfaces the expected chunks. This is the final pilot-first checkpoint before bulk authoring.
Only after these three checks pass does the step 5 author proceed to the full 25-question authoring session.
Post-authoring observations (step 5 shipped 2026-04-14)
Pilotβfull generalization numbers: pilot (6Q) P@5=0.80, R@5=1.00, KHR=0.81 β full (25Q post-fix) P@5=0.83, R@5=0.96, KHR=0.90. R@5 movement is within expected variance when corpus breadth expands from 8 β 28 pages; KHR jump from 0.81β0.90 is an open question β the 25Q distribution may skew toward questions where the golden keyword set is more readily satisfied (simple + simple_w_condition
- set together = 11/25 questions with short, high-precision expected answers), vs the pilot's retrieval-heavy mix. Worth revisiting if KHR drifts on future corpora β if consistent across datasets, it's authoring signal that the keyword set should be tightened for CRAG type parity.
Flavor-B reproducibility finding: k8s_022 (RBAC deny rules) and pilot_005 (NetworkPolicy mTLS) both produce refusal-phrased answers when the documented negative is in retrieved context. Two independent reproductions confirm the LLM-hedges-on-documented-negative pattern is a class of failure mode, not a one-off β strengthens the case for the deferred Fix 2 + targeted prompt guidance stacked experiment. Authoring itself is clean on both: retrieval surfaces the expected chunks, citation accuracy 1.00, snippets verify against chunk IDs.