Spaces:
Running
Kubernetes Corpus Sources
Status: Locked. 28 pages pulled via defuddle parse and verified
against the 25-question QUESTION_PLAN.md mapping. Pilot-first
smoke-query validation on the rebuilt store confirmed retrieval returns
expected chunks for 5 representative queries (StatefulSet, HPA,
node-pressure eviction, Service routing, Pod Security enforcement).
Target: ~25β30 markdown files from kubernetes.io/docs β achieved at 28 pages. Supports 25 golden questions at ~1 question per page with 3 pages of headroom for multi-hop fan-out.
Content license: All kubernetes.io/docs content is licensed under CC BY 4.0. All 28 pulled pages fall under the site default license; no per-page exceptions encountered.
Scope
Include:
- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet, Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes, Namespaces
- Scheduling + resources: Resource Management, Node Assignment, Taints and Tolerations, Node-pressure Eviction
- Access control: RBAC Authorization
- Health + autoscaling: Liveness/Readiness/Startup Probes, Horizontal Pod Autoscaling
- Security: Pod Security Admission, Pod Security Standards
Exclude:
- Cluster administration deep-dives (etcd, kubelet, kube-apiserver internals) β wrong audience for a recruiter-facing demo
- Tutorials (long-form, chunk poorly, hurt retrieval precision)
- kubectl command reference and API reference β wrong shape for RAG,
better served by
--help - Release notes and version history β no lasting value for Q&A
Curation policy
This corpus targets recruiter-likely questions, not coverage. A question about etcd raft internals will be correctly refused β the refusal mechanism is part of the demo story, not a failure mode.
Each ingested page has:
- A canonical kubernetes.io/docs URL (source of truth, for re-scraping if content drifts)
- A date pulled (provenance, for audit)
- A one-line rationale (why this page is in scope)
- License confirmation (default CC BY 4.0)
Locked category breakdown
| Category | Pages | Rationale |
|---|---|---|
| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. Reranker-stressing multi-hop questions draw on 2β4 of these per question. |
| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is the pilot_005 flavor-B false_premise target. |
| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports simple_w_condition questions where the answer depends on configuration context. |
| Scheduling + resources | 4 | Resource Management, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction. Good source for comparison and time_sensitive questions. |
| Access control | 1 | RBAC Authorization. Supports 1β2 simple questions about RBAC primitives. |
| Health + autoscaling | 2 | Probes, Horizontal Pod Autoscaling. HPA is a time_sensitive candidate (autoscaling/v2 stable state). |
| Security | 2 | Pod Security Admission, Pod Security Standards. PSA is the simple_w_condition stressor where the answer depends on enforcement level. |
| Total | 28 | Supports 25 questions with 3 pages of headroom. |
Pulled pages (all 28)
All pages pulled via defuddle parse <url> --md -o data/k8s_docs/<file>.md.
| File | Category | URL | Date pulled | Pilot evidence |
|---|---|---|---|---|
k8s_configmap.md |
Config + state | https://kubernetes.io/docs/concepts/configuration/configmap/ |
2026-03-24 (pilot) | β |
k8s_deployment.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ |
2026-03-24 (pilot) | β |
k8s_network_policies.md |
Networking | https://kubernetes.io/docs/concepts/services-networking/network-policies/ |
2026-03-24 (pilot) | pilot_005 flavor-B target β chunk_index 63 contains "Anything TLS related (use a service mesh or ingress controller for this)" |
k8s_node_pressure_eviction.md |
Scheduling + resources | https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/ |
2026-03-24 (pilot) | β |
k8s_pod_security_admission.md |
Security | https://kubernetes.io/docs/concepts/security/pod-security-admission/ |
2026-03-24 (pilot) | β |
k8s_pods.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/pods/ |
2026-03-24 (pilot) | pilot_001 target (Pod IP + localhost communication) |
k8s_replicaset.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/ |
2026-03-24 (pilot) | β |
k8s_secret.md |
Config + state | https://kubernetes.io/docs/concepts/configuration/secret/ |
2026-03-24 (pilot) | β |
k8s_pod_lifecycle.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/ |
2026-04-14 | step 4 |
k8s_statefulset.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ |
2026-04-14 | step 4 |
k8s_daemonset.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ |
2026-04-14 | step 4 |
k8s_job.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/job/ |
2026-04-14 | step 4 |
k8s_cronjob.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ |
2026-04-14 | step 4 |
k8s_init_containers.md |
Core workloads | https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ |
2026-04-14 | step 4 |
k8s_service.md |
Networking | https://kubernetes.io/docs/concepts/services-networking/service/ |
2026-04-14 | step 4 |
k8s_ingress.md |
Networking | https://kubernetes.io/docs/concepts/services-networking/ingress/ |
2026-04-14 | step 4 |
k8s_endpoint_slices.md |
Networking | https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/ |
2026-04-14 | step 4 |
k8s_dns.md |
Networking | https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ |
2026-04-14 | step 4 |
k8s_volumes.md |
Config + state | https://kubernetes.io/docs/concepts/storage/volumes/ |
2026-04-14 | step 4 |
k8s_persistent_volumes.md |
Config + state | https://kubernetes.io/docs/concepts/storage/persistent-volumes/ |
2026-04-14 | step 4 |
k8s_namespaces.md |
Config + state | https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/ |
2026-04-14 | step 4 |
k8s_resource_management.md |
Scheduling + resources | https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
2026-04-14 | step 4 |
k8s_assign_pod_node.md |
Scheduling + resources | https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ |
2026-04-14 | step 4 |
k8s_taints_tolerations.md |
Scheduling + resources | https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ |
2026-04-14 | step 4 |
k8s_rbac.md |
Access control | https://kubernetes.io/docs/reference/access-authn-authz/rbac/ |
2026-04-14 | step 4 |
k8s_probes.md |
Health + autoscaling | https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ |
2026-04-14 | step 4 |
k8s_hpa.md |
Health + autoscaling | https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ |
2026-04-14 | step 4 |
k8s_pod_security_standards.md |
Security | https://kubernetes.io/docs/concepts/security/pod-security-standards/ |
2026-04-14 | step 4 |
Pull tool: defuddle CLI v0.16.0
(defuddle parse <url> --md -o <file>). Defuddle extracts the main
content region of kubernetes.io/docs pages and renders clean markdown
with inline links preserved β output format matches the pilot 8 pages
exactly, so no per-file normalization was needed.
URL verification: All 20 step-4 URLs resolved without redirect (defuddle followed the URL as given and produced non-empty output; any 404 or redirect would have produced a 0-byte file, which none did β file sizes range 115β917 lines).
Ingestion
make ingest-k8s
This populates .cache/store_k8s/ with embeddings + BM25 index
matching the FastAPI corpus's chunker settings (recursive, 512-token
chunks, 64-token overlap). Current state: 2447 chunks across 28
unique sources.
Ingest hygiene: scripts/ingest.py excludes SOURCES.md,
QUESTION_PLAN.md, and README.md from the corpus β these are
version-controlled curation artifacts, not content.
Post-ingest smoke-query validation
Per cross-cutting #8 pilot-first discipline, 5 representative queries were run against the rebuilt store to confirm retrieval works before step 5 golden-set authoring:
| Query | Top-1 source | Expected | Verdict |
|---|---|---|---|
| "what is a StatefulSet" | k8s_statefulset.md |
k8s_statefulset.md |
β |
| "how does HPA scale replicas" | k8s_hpa.md |
k8s_hpa.md |
β |
| "Pod evicted node pressure" | k8s_pod_lifecycle.md |
k8s_node_pressure_eviction.md or k8s_pod_lifecycle.md |
β (either acceptable β eviction is covered in both) |
| "Service route traffic to Pods" | k8s_service.md |
k8s_service.md |
β |
| "enforce Pod Security Standards" | k8s_pod_security_admission.md |
k8s_pod_security_admission.md or k8s_pod_security_standards.md |
β (PSA is the enforcement mechanism; PSS defines the levels β both valid hits) |
All 5 return top-1 from an expected page. No unexpected refusals. No noise from irrelevant pages. The store is ready for step 5.