Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / data /k8s_docs /QUESTION_PLAN.md

Nomearod

docs: step 5 follow-up — parallel-tracks list + post-authoring observations

05bf702 about 1 month ago

preview code

raw

history blame contribute delete

13.5 kB

K8s Golden Dataset — Question Plan

Status: Structural guide for Week 1 step 5 authoring (v1.1 plan). This document defines the 25-question target distribution, per-type source-page mapping, and authoring constraints. It does NOT contain the 25 specific question texts — those are authored during step 5 in a fresh session, per cross-cutting #8 pilot-first discipline.

Upstream contracts:

Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) — see DECISIONS.md "K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
Source pages: see SOURCES.md (28 pages, category-locked; 8 already pulled, 20 to pull at step 4).
Schema: see agent_bench/evaluation/harness.py GoldenQuestion plus the v1.1 plan's methodology #3 source-attribution fields.
Flavor A/B for false_premise: see DECISIONS.md "False-premise questions come in two flavors".

Target distribution (25 questions total)

CRAG type	Count	Schema field	Notes
`simple`	5–6	`question_type: "simple"`	Baseline retrieval: direct lookup in 1 page, 1–2 sentence answer.
`simple_w_condition`	3–4	`question_type: "simple_w_condition"`	Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase).
`comparison`	3–4	`question_type: "comparison"`	Answer compares two concepts across 2 pages; reranker stress.
`multi_hop`	5–6	`question_type: "multi_hop"`	Answer synthesizes 2–4 pages; reranker-stressing by construction.
`false_premise`	3–4	`question_type: "false_premise"`	Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative).
`set` / `aggregation` / `post_processing_heavy`	0–3	respective values	Optional. Include only if natural from corpus content.
Total	25

Orthogonal flag: time_sensitive: bool on 2–3 questions. Does NOT replace question_type — it's an independent property for version-bounded content (feature state, API version migration, deprecations).

Per-type source-page mapping

Each row identifies the K8s concept pages a question of that type should draw from. Multi-hop and comparison questions list multiple pages intentionally.

simple (5–6 slots)

Pool questions where a 1–2 sentence answer lives inside a single page.

Candidate source	CRAG slot justification
`k8s_pods.md`	Pod IP semantics, container sharing, ephemeral containers
`k8s_deployment.md`	What a Deployment is, declarative update mechanic
`k8s_configmap.md`	What a ConfigMap is, immutable field
`k8s_secret.md`	What a Secret is, volume mount modes
RBAC Authorization (step 4 page)	RBAC primitive definitions (Role, RoleBinding, ClusterRole)
StatefulSet (step 4 page)	StatefulSet identity guarantees
DaemonSet (step 4 page)	One-per-node scheduling contract
Namespaces (step 4 page)	Namespace scoping for resources

Authoring rule: Each simple question must have exactly one expected source page and 1–2 source snippets. KHR target ≥ 0.60 on the authored keywords.

simple_w_condition (3–4 slots)

Pool questions where the answer explicitly depends on a condition named in the question.

Candidate source	Condition that shapes the answer
`k8s_pod_security_admission.md`	enforcement level: `enforce` / `audit` / `warn`
`k8s_secret.md`	mount mode: environment variable vs file in volume
Liveness/Readiness/Startup Probes (step 4)	probe type: liveness vs readiness vs startup
Volumes (step 4)	volume type: emptyDir vs configMap vs persistentVolumeClaim
Node-pressure Eviction (`k8s_node_pressure_eviction.md`)	resource under pressure: memory vs disk vs inodes

Authoring rule: The condition must be named in the question stem, not implied. The expected answer must change materially if the condition flips. Example: "How is a Secret mounted as a volume versus consumed as an environment variable?" is a valid simple_w_condition; "How is a Secret mounted?" is simple.

comparison (3–4 slots)

Pool questions where the answer explicitly compares two K8s concepts that span 2 pages.

Page pair	Concept compared
Deployment vs StatefulSet (step 4)	stateless vs stateful workload semantics
Deployment vs DaemonSet (step 4)	replica-count vs one-per-node scheduling
ConfigMap vs Secret	non-confidential vs confidential data, mount parity
Service vs Ingress (step 4)	L4 vs L7 exposure
Taints/Tolerations vs Node Affinity (step 4)	opt-out vs opt-in placement
Liveness vs Readiness probes (step 4)	restart vs traffic-routing semantics

Authoring rule: The question must force retrieval from both pages. Reranker stress is intentional — questions where BM25 would find one side but miss the other are the target. Expected sources: 2 pages minimum.

multi_hop (5–6 slots)

Pool questions where the answer synthesizes 2–4 pages. These are the primary reranker stressors.

Page set (example)	Hop path
Pod + Service + Ingress (step 4)	How external traffic reaches a Pod through Service → Ingress
Deployment + ReplicaSet + Pod	How a Deployment rollout changes the underlying ReplicaSet and Pod set
ConfigMap + Deployment	How a ConfigMap update propagates to Pods via env vars or mounted volume
HPA + Deployment + Metrics Server (partial step 4)	How HPA reads metrics and scales a Deployment
NetworkPolicy + Pod + Namespace (partial step 4)	How NetworkPolicy selectors resolve across namespaces
Job + Pod + Container lifecycle (partial step 4)	How a Job's completions and parallelism interact with Pod restart policy

Authoring rule: Expected sources ≥ 2 pages. The question must not be answerable from any single page alone. source_chunk_ids must list at least one chunk from each expected page; partial credit is granted in the evaluator if at least one expected chunk is cited (see agent_bench/evaluation/harness.py).

false_premise (3–4 slots)

Pool questions whose premise is wrong. Split across two flavors:

Flavor A — pure refusal (at least 1 slot):

Premise targets a capability that does not exist in the K8s corpus (not in any pulled page).
Example seed: "How do I configure Claude API rate limits in a Kubernetes Deployment?" (wrong domain — Claude API is not a K8s concept)
Schema: category: "out_of_scope", expected_sources: [], source_snippets: [].
Evaluator expectation: answer contains refusal phrasing AND cites zero sources.

Flavor B — documented negative (at least 1 slot, ideally 2):

Corpus contains an explicit negative statement (e.g. NetworkPolicy "Anything TLS related" limitation at chunk 63 of k8s_network_policies.md).
Example already in pilot: k8s_pilot_005 (NetworkPolicy mTLS).
Schema: category: "retrieval", question_type: "false_premise", expected_sources: [<negative-answer page>], source_snippets: [<verbatim negative statement>].
Evaluator expectation: answer reports the documented negative with citation, does NOT open with "the documentation does not provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2 revert analysis).

Other flavor-B candidate pages for authoring:

Pod Security Standards — explicit statements about what each profile does NOT permit
RBAC Authorization — explicit statements about what RBAC does NOT provide (e.g. no deny rules)
NetworkPolicy — additional negative clauses beyond the pilot_005 mTLS one

set / aggregation / post_processing_heavy (0–3 slots)

Include only if a K8s page naturally supports the pattern:

set: "Which Kubernetes resources can expose a Service?" (answer is a set drawn from the Service page). Include 0–1 of this type if a clean example emerges; otherwise leave slot empty.
aggregation: Unlikely to fit K8s docs (docs describe concepts, not tabular data). Likely leave empty.
post_processing_heavy: Unlikely to fit K8s docs. Likely leave empty.

Default: Leave 0–3 as 0. Only author these if a question emerges organically during step 5. Do not force-author to hit a target count; the plan explicitly says "0–3, included only where corpus content naturally supports".

`time_sensitive` flag placement (2–3 questions)

Flag questions whose correct answer depends on K8s version state:

Candidate	Why time-sensitive
HPA API version	`autoscaling/v1` vs `autoscaling/v2` — v2 stable since 1.23
Pod Security Admission stability	"stable as of v1.25" — feature state in the page
PodSecurityPolicy removal	PSP removed in 1.25; migration path to PSA

Authoring rule: Set time_sensitive: true on exactly 2–3 questions. Distribute across ≥2 different CRAG types (e.g. one simple, one simple_w_condition) so the flag is not concentrated in a single type. Each time_sensitive question must cite a specific K8s version or feature state in the source snippet, otherwise the flag is not load-bearing.

Difficulty distribution

Loose guidance, not a hard constraint:

easy: 8–10 questions — mostly simple and single-page simple_w_condition
medium: 10–12 questions — comparison, most multi_hop, straightforward false_premise
hard: 4–6 questions — deep multi_hop, flavor-B false_premise, time_sensitive + multi_hop combinations

The pilot's 6-question set is all easy/medium. Step 5 should add the hard tier.

Authoring checklist (per question)

For each of the 25 questions, the step 5 author must fill:

Field	Required	Notes
`id`	yes	`k8s_<NNN>` zero-padded (e.g. `k8s_001`)
`question`	yes	Natural-language question in the voice of a recruiter or developer
`expected_answer_keywords`	yes	3–6 keywords that MUST appear in a correct answer; drives `keyword_hit_rate`
`expected_sources`	yes	List of `.md` filenames from `SOURCES.md`; ≥1 for scoped questions, `[]` for flavor-A false-premise
`category`	yes	`retrieval` / `calculation` / `out_of_scope`
`difficulty`	yes	`easy` / `medium` / `hard`
`requires_calculator`	yes	`false` for all K8s questions (no calc tool use expected)
`reference_answer`	yes	1–3 sentence answer used by the optional LLM judge
`question_type`	yes	CRAG taxonomy value (exactly one of the 8 canonical strings)
`time_sensitive`	yes	`bool`; `true` on exactly 2–3 questions
`source_chunk_ids`	yes	Content-hashed chunk IDs (stable across reindex); must be `[]` for flavor-A false-premise
`source_snippets`	yes	~20 words verbatim per chunk; drift-detection field
`source_pages`	yes	Human-readable page anchor (e.g. `"concepts/workloads/pods"`)
`source_sections`	yes	Deepest heading containing the snippet

Deprecation note: The pilot schema has is_multi_hop: bool. Step 5 may retire this field in favor of question_type == "multi_hop", but only after confirming the evaluator's partial-credit logic (agent_bench/evaluation/harness.py:38) is updated to read from question_type. Do NOT remove is_multi_hop without the corresponding harness update, or existing pilot questions will break partial-credit scoring.

Pilot-first validation before step 5 authoring

Before writing the 25 questions, step 5 author must:

Confirm the 20 new pages from step 4 are ingested and reachable via the pipeline (smoke-query test per SOURCES.md's post-ingest validation).
Re-run make evaluate on the existing 6-question pilot dataset against the newly-expanded corpus. The pilot's existing questions must still pass their per-question gates — if adding 20 new pages drops pilot P@5 materially, investigate before adding more questions on top.
Hand-draft 2–3 questions first, run them through the pipeline, and confirm retrieval surfaces the expected chunks. This is the final pilot-first checkpoint before bulk authoring.

Only after these three checks pass does the step 5 author proceed to the full 25-question authoring session.

Post-authoring observations (step 5 shipped 2026-04-14)

Pilot→full generalization numbers: pilot (6Q) P@5=0.80, R@5=1.00, KHR=0.81 → full (25Q post-fix) P@5=0.83, R@5=0.96, KHR=0.90. R@5 movement is within expected variance when corpus breadth expands from 8 → 28 pages; KHR jump from 0.81→0.90 is an open question — the 25Q distribution may skew toward questions where the golden keyword set is more readily satisfied (simple + simple_w_condition

set together = 11/25 questions with short, high-precision expected answers), vs the pilot's retrieval-heavy mix. Worth revisiting if KHR drifts on future corpora — if consistent across datasets, it's authoring signal that the keyword set should be tightened for CRAG type parity.

Flavor-B reproducibility finding: k8s_022 (RBAC deny rules) and pilot_005 (NetworkPolicy mTLS) both produce refusal-phrased answers when the documented negative is in retrieved context. Two independent reproductions confirm the LLM-hedges-on-documented-negative pattern is a class of failure mode, not a one-off — strengthens the case for the deferred Fix 2 + targeted prompt guidance stacked experiment. Authoring itself is clean on both: retrieval surfaces the expected chunks, citation accuracy 1.00, snippets verify against chunk IDs.