Spaces:
Running
Running
| # Kubernetes Corpus Sources | |
| **Status:** Locked. 28 pages pulled via `defuddle parse` and verified | |
| against the 25-question `QUESTION_PLAN.md` mapping. Pilot-first | |
| smoke-query validation on the rebuilt store confirmed retrieval returns | |
| expected chunks for 5 representative queries (StatefulSet, HPA, | |
| node-pressure eviction, Service routing, Pod Security enforcement). | |
| **Target:** ~25β30 markdown files from kubernetes.io/docs β achieved | |
| at 28 pages. Supports 25 golden questions at ~1 question per page | |
| with 3 pages of headroom for multi-hop fan-out. | |
| **Content license:** All kubernetes.io/docs content is licensed under | |
| [CC BY 4.0](https://git.k8s.io/website/LICENSE). All 28 pulled pages | |
| fall under the site default license; no per-page exceptions encountered. | |
| ## Scope | |
| **Include:** | |
| - Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet, | |
| Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle | |
| - Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS | |
| - Config + state: ConfigMap, Secret, Volumes, PersistentVolumes, | |
| Namespaces | |
| - Scheduling + resources: Resource Management, Node Assignment, | |
| Taints and Tolerations, Node-pressure Eviction | |
| - Access control: RBAC Authorization | |
| - Health + autoscaling: Liveness/Readiness/Startup Probes, | |
| Horizontal Pod Autoscaling | |
| - Security: Pod Security Admission, Pod Security Standards | |
| **Exclude:** | |
| - Cluster administration deep-dives (etcd, kubelet, kube-apiserver | |
| internals) β wrong audience for a recruiter-facing demo | |
| - Tutorials (long-form, chunk poorly, hurt retrieval precision) | |
| - kubectl command reference and API reference β wrong shape for RAG, | |
| better served by `--help` | |
| - Release notes and version history β no lasting value for Q&A | |
| ## Curation policy | |
| This corpus targets **recruiter-likely questions**, not coverage. A | |
| question about etcd raft internals will be correctly refused β the | |
| refusal mechanism is part of the demo story, not a failure mode. | |
| Each ingested page has: | |
| - A canonical kubernetes.io/docs URL (source of truth, for re-scraping | |
| if content drifts) | |
| - A date pulled (provenance, for audit) | |
| - A one-line rationale (why this page is in scope) | |
| - License confirmation (default CC BY 4.0) | |
| ## Locked category breakdown | |
| | Category | Pages | Rationale | | |
| |---|---|---| | |
| | Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. Reranker-stressing multi-hop questions draw on 2β4 of these per question. | | |
| | Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is the pilot_005 flavor-B false_premise target. | | |
| | Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context. | | |
| | Scheduling + resources | 4 | Resource Management, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction. Good source for `comparison` and `time_sensitive` questions. | | |
| | Access control | 1 | RBAC Authorization. Supports 1β2 `simple` questions about RBAC primitives. | | |
| | Health + autoscaling | 2 | Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). | | |
| | Security | 2 | Pod Security Admission, Pod Security Standards. PSA is the `simple_w_condition` stressor where the answer depends on enforcement level. | | |
| | **Total** | **28** | Supports 25 questions with 3 pages of headroom. | | |
| ## Pulled pages (all 28) | |
| All pages pulled via `defuddle parse <url> --md -o data/k8s_docs/<file>.md`. | |
| | File | Category | URL | Date pulled | Pilot evidence | | |
| |---|---|---|---|---| | |
| | `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | 2026-03-24 (pilot) | β | | |
| | `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | 2026-03-24 (pilot) | β | | |
| | `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | 2026-03-24 (pilot) | **pilot_005 flavor-B target** β chunk_index 63 contains "Anything TLS related (use a service mesh or ingress controller for this)" | | |
| | `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | 2026-03-24 (pilot) | β | | |
| | `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | 2026-03-24 (pilot) | β | | |
| | `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | 2026-03-24 (pilot) | pilot_001 target (Pod IP + localhost communication) | | |
| | `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | 2026-03-24 (pilot) | β | | |
| | `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | 2026-03-24 (pilot) | β | | |
| | `k8s_pod_lifecycle.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/` | 2026-04-14 | step 4 | | |
| | `k8s_statefulset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/` | 2026-04-14 | step 4 | | |
| | `k8s_daemonset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/` | 2026-04-14 | step 4 | | |
| | `k8s_job.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/job/` | 2026-04-14 | step 4 | | |
| | `k8s_cronjob.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/` | 2026-04-14 | step 4 | | |
| | `k8s_init_containers.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/init-containers/` | 2026-04-14 | step 4 | | |
| | `k8s_service.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/service/` | 2026-04-14 | step 4 | | |
| | `k8s_ingress.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/ingress/` | 2026-04-14 | step 4 | | |
| | `k8s_endpoint_slices.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/` | 2026-04-14 | step 4 | | |
| | `k8s_dns.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/` | 2026-04-14 | step 4 | | |
| | `k8s_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/volumes/` | 2026-04-14 | step 4 | | |
| | `k8s_persistent_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/persistent-volumes/` | 2026-04-14 | step 4 | | |
| | `k8s_namespaces.md` | Config + state | `https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/` | 2026-04-14 | step 4 | | |
| | `k8s_resource_management.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/` | 2026-04-14 | step 4 | | |
| | `k8s_assign_pod_node.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/` | 2026-04-14 | step 4 | | |
| | `k8s_taints_tolerations.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/` | 2026-04-14 | step 4 | | |
| | `k8s_rbac.md` | Access control | `https://kubernetes.io/docs/reference/access-authn-authz/rbac/` | 2026-04-14 | step 4 | | |
| | `k8s_probes.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/` | 2026-04-14 | step 4 | | |
| | `k8s_hpa.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/` | 2026-04-14 | step 4 | | |
| | `k8s_pod_security_standards.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-standards/` | 2026-04-14 | step 4 | | |
| **Pull tool:** [defuddle](https://github.com/kepano/defuddle) CLI v0.16.0 | |
| (`defuddle parse <url> --md -o <file>`). Defuddle extracts the main | |
| content region of kubernetes.io/docs pages and renders clean markdown | |
| with inline links preserved β output format matches the pilot 8 pages | |
| exactly, so no per-file normalization was needed. | |
| **URL verification:** All 20 step-4 URLs resolved without redirect | |
| (defuddle followed the URL as given and produced non-empty output; | |
| any 404 or redirect would have produced a 0-byte file, which none | |
| did β file sizes range 115β917 lines). | |
| ## Ingestion | |
| ```bash | |
| make ingest-k8s | |
| ``` | |
| This populates `.cache/store_k8s/` with embeddings + BM25 index | |
| matching the FastAPI corpus's chunker settings (recursive, 512-token | |
| chunks, 64-token overlap). Current state: **2447 chunks across 28 | |
| unique sources**. | |
| **Ingest hygiene:** `scripts/ingest.py` excludes `SOURCES.md`, | |
| `QUESTION_PLAN.md`, and `README.md` from the corpus β these are | |
| version-controlled curation artifacts, not content. | |
| ## Post-ingest smoke-query validation | |
| Per cross-cutting #8 pilot-first discipline, 5 representative queries | |
| were run against the rebuilt store to confirm retrieval works before | |
| step 5 golden-set authoring: | |
| | Query | Top-1 source | Expected | Verdict | | |
| |---|---|---|---| | |
| | "what is a StatefulSet" | `k8s_statefulset.md` | `k8s_statefulset.md` | β | | |
| | "how does HPA scale replicas" | `k8s_hpa.md` | `k8s_hpa.md` | β | | |
| | "Pod evicted node pressure" | `k8s_pod_lifecycle.md` | `k8s_node_pressure_eviction.md` or `k8s_pod_lifecycle.md` | β (either acceptable β eviction is covered in both) | | |
| | "Service route traffic to Pods" | `k8s_service.md` | `k8s_service.md` | β | | |
| | "enforce Pod Security Standards" | `k8s_pod_security_admission.md` | `k8s_pod_security_admission.md` or `k8s_pod_security_standards.md` | β (PSA is the enforcement mechanism; PSS defines the levels β both valid hits) | | |
| All 5 return top-1 from an expected page. No unexpected refusals. | |
| No noise from irrelevant pages. The store is ready for step 5. | |