agentbench / data /k8s_docs /SOURCES.md
Nomearod's picture
feat(k8s): Week 1 step 4 β€” pull 20 corpus pages, expand to 28-page locked list
fc93535
# Kubernetes Corpus Sources
**Status:** Locked. 28 pages pulled via `defuddle parse` and verified
against the 25-question `QUESTION_PLAN.md` mapping. Pilot-first
smoke-query validation on the rebuilt store confirmed retrieval returns
expected chunks for 5 representative queries (StatefulSet, HPA,
node-pressure eviction, Service routing, Pod Security enforcement).
**Target:** ~25–30 markdown files from kubernetes.io/docs β€” achieved
at 28 pages. Supports 25 golden questions at ~1 question per page
with 3 pages of headroom for multi-hop fan-out.
**Content license:** All kubernetes.io/docs content is licensed under
[CC BY 4.0](https://git.k8s.io/website/LICENSE). All 28 pulled pages
fall under the site default license; no per-page exceptions encountered.
## Scope
**Include:**
- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes,
Namespaces
- Scheduling + resources: Resource Management, Node Assignment,
Taints and Tolerations, Node-pressure Eviction
- Access control: RBAC Authorization
- Health + autoscaling: Liveness/Readiness/Startup Probes,
Horizontal Pod Autoscaling
- Security: Pod Security Admission, Pod Security Standards
**Exclude:**
- Cluster administration deep-dives (etcd, kubelet, kube-apiserver
internals) β€” wrong audience for a recruiter-facing demo
- Tutorials (long-form, chunk poorly, hurt retrieval precision)
- kubectl command reference and API reference β€” wrong shape for RAG,
better served by `--help`
- Release notes and version history β€” no lasting value for Q&A
## Curation policy
This corpus targets **recruiter-likely questions**, not coverage. A
question about etcd raft internals will be correctly refused β€” the
refusal mechanism is part of the demo story, not a failure mode.
Each ingested page has:
- A canonical kubernetes.io/docs URL (source of truth, for re-scraping
if content drifts)
- A date pulled (provenance, for audit)
- A one-line rationale (why this page is in scope)
- License confirmation (default CC BY 4.0)
## Locked category breakdown
| Category | Pages | Rationale |
|---|---|---|
| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. Reranker-stressing multi-hop questions draw on 2–4 of these per question. |
| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is the pilot_005 flavor-B false_premise target. |
| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context. |
| Scheduling + resources | 4 | Resource Management, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction. Good source for `comparison` and `time_sensitive` questions. |
| Access control | 1 | RBAC Authorization. Supports 1–2 `simple` questions about RBAC primitives. |
| Health + autoscaling | 2 | Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). |
| Security | 2 | Pod Security Admission, Pod Security Standards. PSA is the `simple_w_condition` stressor where the answer depends on enforcement level. |
| **Total** | **28** | Supports 25 questions with 3 pages of headroom. |
## Pulled pages (all 28)
All pages pulled via `defuddle parse <url> --md -o data/k8s_docs/<file>.md`.
| File | Category | URL | Date pulled | Pilot evidence |
|---|---|---|---|---|
| `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | 2026-03-24 (pilot) | β€” |
| `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | 2026-03-24 (pilot) | β€” |
| `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | 2026-03-24 (pilot) | **pilot_005 flavor-B target** β€” chunk_index 63 contains "Anything TLS related (use a service mesh or ingress controller for this)" |
| `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | 2026-03-24 (pilot) | β€” |
| `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | 2026-03-24 (pilot) | β€” |
| `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | 2026-03-24 (pilot) | pilot_001 target (Pod IP + localhost communication) |
| `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | 2026-03-24 (pilot) | β€” |
| `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | 2026-03-24 (pilot) | β€” |
| `k8s_pod_lifecycle.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/` | 2026-04-14 | step 4 |
| `k8s_statefulset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/` | 2026-04-14 | step 4 |
| `k8s_daemonset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/` | 2026-04-14 | step 4 |
| `k8s_job.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/job/` | 2026-04-14 | step 4 |
| `k8s_cronjob.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/` | 2026-04-14 | step 4 |
| `k8s_init_containers.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/init-containers/` | 2026-04-14 | step 4 |
| `k8s_service.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/service/` | 2026-04-14 | step 4 |
| `k8s_ingress.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/ingress/` | 2026-04-14 | step 4 |
| `k8s_endpoint_slices.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/` | 2026-04-14 | step 4 |
| `k8s_dns.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/` | 2026-04-14 | step 4 |
| `k8s_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/volumes/` | 2026-04-14 | step 4 |
| `k8s_persistent_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/persistent-volumes/` | 2026-04-14 | step 4 |
| `k8s_namespaces.md` | Config + state | `https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/` | 2026-04-14 | step 4 |
| `k8s_resource_management.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/` | 2026-04-14 | step 4 |
| `k8s_assign_pod_node.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/` | 2026-04-14 | step 4 |
| `k8s_taints_tolerations.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/` | 2026-04-14 | step 4 |
| `k8s_rbac.md` | Access control | `https://kubernetes.io/docs/reference/access-authn-authz/rbac/` | 2026-04-14 | step 4 |
| `k8s_probes.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/` | 2026-04-14 | step 4 |
| `k8s_hpa.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/` | 2026-04-14 | step 4 |
| `k8s_pod_security_standards.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-standards/` | 2026-04-14 | step 4 |
**Pull tool:** [defuddle](https://github.com/kepano/defuddle) CLI v0.16.0
(`defuddle parse <url> --md -o <file>`). Defuddle extracts the main
content region of kubernetes.io/docs pages and renders clean markdown
with inline links preserved β€” output format matches the pilot 8 pages
exactly, so no per-file normalization was needed.
**URL verification:** All 20 step-4 URLs resolved without redirect
(defuddle followed the URL as given and produced non-empty output;
any 404 or redirect would have produced a 0-byte file, which none
did β€” file sizes range 115–917 lines).
## Ingestion
```bash
make ingest-k8s
```
This populates `.cache/store_k8s/` with embeddings + BM25 index
matching the FastAPI corpus's chunker settings (recursive, 512-token
chunks, 64-token overlap). Current state: **2447 chunks across 28
unique sources**.
**Ingest hygiene:** `scripts/ingest.py` excludes `SOURCES.md`,
`QUESTION_PLAN.md`, and `README.md` from the corpus β€” these are
version-controlled curation artifacts, not content.
## Post-ingest smoke-query validation
Per cross-cutting #8 pilot-first discipline, 5 representative queries
were run against the rebuilt store to confirm retrieval works before
step 5 golden-set authoring:
| Query | Top-1 source | Expected | Verdict |
|---|---|---|---|
| "what is a StatefulSet" | `k8s_statefulset.md` | `k8s_statefulset.md` | βœ“ |
| "how does HPA scale replicas" | `k8s_hpa.md` | `k8s_hpa.md` | βœ“ |
| "Pod evicted node pressure" | `k8s_pod_lifecycle.md` | `k8s_node_pressure_eviction.md` or `k8s_pod_lifecycle.md` | βœ“ (either acceptable β€” eviction is covered in both) |
| "Service route traffic to Pods" | `k8s_service.md` | `k8s_service.md` | βœ“ |
| "enforce Pod Security Standards" | `k8s_pod_security_admission.md` | `k8s_pod_security_admission.md` or `k8s_pod_security_standards.md` | βœ“ (PSA is the enforcement mechanism; PSS defines the levels β€” both valid hits) |
All 5 return top-1 from an expected page. No unexpected refusals.
No noise from irrelevant pages. The store is ready for step 5.