Spaces:
Running
Running
File size: 9,793 Bytes
3c0089e fc93535 3241b7c fc93535 3241b7c fc93535 3c0089e 3241b7c 3c0089e fc93535 3c0089e 3241b7c fc93535 3c0089e fc93535 3241b7c fc93535 3241b7c fc93535 3c0089e 3241b7c fc93535 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | # Kubernetes Corpus Sources
**Status:** Locked. 28 pages pulled via `defuddle parse` and verified
against the 25-question `QUESTION_PLAN.md` mapping. Pilot-first
smoke-query validation on the rebuilt store confirmed retrieval returns
expected chunks for 5 representative queries (StatefulSet, HPA,
node-pressure eviction, Service routing, Pod Security enforcement).
**Target:** ~25β30 markdown files from kubernetes.io/docs β achieved
at 28 pages. Supports 25 golden questions at ~1 question per page
with 3 pages of headroom for multi-hop fan-out.
**Content license:** All kubernetes.io/docs content is licensed under
[CC BY 4.0](https://git.k8s.io/website/LICENSE). All 28 pulled pages
fall under the site default license; no per-page exceptions encountered.
## Scope
**Include:**
- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes,
Namespaces
- Scheduling + resources: Resource Management, Node Assignment,
Taints and Tolerations, Node-pressure Eviction
- Access control: RBAC Authorization
- Health + autoscaling: Liveness/Readiness/Startup Probes,
Horizontal Pod Autoscaling
- Security: Pod Security Admission, Pod Security Standards
**Exclude:**
- Cluster administration deep-dives (etcd, kubelet, kube-apiserver
internals) β wrong audience for a recruiter-facing demo
- Tutorials (long-form, chunk poorly, hurt retrieval precision)
- kubectl command reference and API reference β wrong shape for RAG,
better served by `--help`
- Release notes and version history β no lasting value for Q&A
## Curation policy
This corpus targets **recruiter-likely questions**, not coverage. A
question about etcd raft internals will be correctly refused β the
refusal mechanism is part of the demo story, not a failure mode.
Each ingested page has:
- A canonical kubernetes.io/docs URL (source of truth, for re-scraping
if content drifts)
- A date pulled (provenance, for audit)
- A one-line rationale (why this page is in scope)
- License confirmation (default CC BY 4.0)
## Locked category breakdown
| Category | Pages | Rationale |
|---|---|---|
| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. Reranker-stressing multi-hop questions draw on 2β4 of these per question. |
| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is the pilot_005 flavor-B false_premise target. |
| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context. |
| Scheduling + resources | 4 | Resource Management, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction. Good source for `comparison` and `time_sensitive` questions. |
| Access control | 1 | RBAC Authorization. Supports 1β2 `simple` questions about RBAC primitives. |
| Health + autoscaling | 2 | Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). |
| Security | 2 | Pod Security Admission, Pod Security Standards. PSA is the `simple_w_condition` stressor where the answer depends on enforcement level. |
| **Total** | **28** | Supports 25 questions with 3 pages of headroom. |
## Pulled pages (all 28)
All pages pulled via `defuddle parse <url> --md -o data/k8s_docs/<file>.md`.
| File | Category | URL | Date pulled | Pilot evidence |
|---|---|---|---|---|
| `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | 2026-03-24 (pilot) | β |
| `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | 2026-03-24 (pilot) | β |
| `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | 2026-03-24 (pilot) | **pilot_005 flavor-B target** β chunk_index 63 contains "Anything TLS related (use a service mesh or ingress controller for this)" |
| `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | 2026-03-24 (pilot) | β |
| `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | 2026-03-24 (pilot) | β |
| `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | 2026-03-24 (pilot) | pilot_001 target (Pod IP + localhost communication) |
| `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | 2026-03-24 (pilot) | β |
| `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | 2026-03-24 (pilot) | β |
| `k8s_pod_lifecycle.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/` | 2026-04-14 | step 4 |
| `k8s_statefulset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/` | 2026-04-14 | step 4 |
| `k8s_daemonset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/` | 2026-04-14 | step 4 |
| `k8s_job.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/job/` | 2026-04-14 | step 4 |
| `k8s_cronjob.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/` | 2026-04-14 | step 4 |
| `k8s_init_containers.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/init-containers/` | 2026-04-14 | step 4 |
| `k8s_service.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/service/` | 2026-04-14 | step 4 |
| `k8s_ingress.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/ingress/` | 2026-04-14 | step 4 |
| `k8s_endpoint_slices.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/` | 2026-04-14 | step 4 |
| `k8s_dns.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/` | 2026-04-14 | step 4 |
| `k8s_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/volumes/` | 2026-04-14 | step 4 |
| `k8s_persistent_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/persistent-volumes/` | 2026-04-14 | step 4 |
| `k8s_namespaces.md` | Config + state | `https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/` | 2026-04-14 | step 4 |
| `k8s_resource_management.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/` | 2026-04-14 | step 4 |
| `k8s_assign_pod_node.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/` | 2026-04-14 | step 4 |
| `k8s_taints_tolerations.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/` | 2026-04-14 | step 4 |
| `k8s_rbac.md` | Access control | `https://kubernetes.io/docs/reference/access-authn-authz/rbac/` | 2026-04-14 | step 4 |
| `k8s_probes.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/` | 2026-04-14 | step 4 |
| `k8s_hpa.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/` | 2026-04-14 | step 4 |
| `k8s_pod_security_standards.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-standards/` | 2026-04-14 | step 4 |
**Pull tool:** [defuddle](https://github.com/kepano/defuddle) CLI v0.16.0
(`defuddle parse <url> --md -o <file>`). Defuddle extracts the main
content region of kubernetes.io/docs pages and renders clean markdown
with inline links preserved β output format matches the pilot 8 pages
exactly, so no per-file normalization was needed.
**URL verification:** All 20 step-4 URLs resolved without redirect
(defuddle followed the URL as given and produced non-empty output;
any 404 or redirect would have produced a 0-byte file, which none
did β file sizes range 115β917 lines).
## Ingestion
```bash
make ingest-k8s
```
This populates `.cache/store_k8s/` with embeddings + BM25 index
matching the FastAPI corpus's chunker settings (recursive, 512-token
chunks, 64-token overlap). Current state: **2447 chunks across 28
unique sources**.
**Ingest hygiene:** `scripts/ingest.py` excludes `SOURCES.md`,
`QUESTION_PLAN.md`, and `README.md` from the corpus β these are
version-controlled curation artifacts, not content.
## Post-ingest smoke-query validation
Per cross-cutting #8 pilot-first discipline, 5 representative queries
were run against the rebuilt store to confirm retrieval works before
step 5 golden-set authoring:
| Query | Top-1 source | Expected | Verdict |
|---|---|---|---|
| "what is a StatefulSet" | `k8s_statefulset.md` | `k8s_statefulset.md` | β |
| "how does HPA scale replicas" | `k8s_hpa.md` | `k8s_hpa.md` | β |
| "Pod evicted node pressure" | `k8s_pod_lifecycle.md` | `k8s_node_pressure_eviction.md` or `k8s_pod_lifecycle.md` | β (either acceptable β eviction is covered in both) |
| "Service route traffic to Pods" | `k8s_service.md` | `k8s_service.md` | β |
| "enforce Pod Security Standards" | `k8s_pod_security_admission.md` | `k8s_pod_security_admission.md` or `k8s_pod_security_standards.md` | β (PSA is the enforcement mechanism; PSS defines the levels β both valid hits) |
All 5 return top-1 from an expected page. No unexpected refusals.
No noise from irrelevant pages. The store is ready for step 5.
|