File size: 9,793 Bytes
3c0089e
 
fc93535
 
 
 
 
3241b7c
fc93535
 
 
3241b7c
 
fc93535
 
3c0089e
 
 
 
 
3241b7c
 
 
 
 
 
 
 
 
 
 
3c0089e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc93535
3c0089e
3241b7c
 
fc93535
3c0089e
fc93535
3241b7c
 
 
fc93535
3241b7c
fc93535
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c0089e
 
 
 
 
 
 
3241b7c
 
fc93535
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# Kubernetes Corpus Sources

**Status:** Locked. 28 pages pulled via `defuddle parse` and verified
against the 25-question `QUESTION_PLAN.md` mapping. Pilot-first
smoke-query validation on the rebuilt store confirmed retrieval returns
expected chunks for 5 representative queries (StatefulSet, HPA,
node-pressure eviction, Service routing, Pod Security enforcement).

**Target:** ~25–30 markdown files from kubernetes.io/docs β€” achieved
at 28 pages. Supports 25 golden questions at ~1 question per page
with 3 pages of headroom for multi-hop fan-out.

**Content license:** All kubernetes.io/docs content is licensed under
[CC BY 4.0](https://git.k8s.io/website/LICENSE). All 28 pulled pages
fall under the site default license; no per-page exceptions encountered.

## Scope

**Include:**

- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
  Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes,
  Namespaces
- Scheduling + resources: Resource Management, Node Assignment,
  Taints and Tolerations, Node-pressure Eviction
- Access control: RBAC Authorization
- Health + autoscaling: Liveness/Readiness/Startup Probes,
  Horizontal Pod Autoscaling
- Security: Pod Security Admission, Pod Security Standards

**Exclude:**

- Cluster administration deep-dives (etcd, kubelet, kube-apiserver
  internals) β€” wrong audience for a recruiter-facing demo
- Tutorials (long-form, chunk poorly, hurt retrieval precision)
- kubectl command reference and API reference β€” wrong shape for RAG,
  better served by `--help`
- Release notes and version history β€” no lasting value for Q&A

## Curation policy

This corpus targets **recruiter-likely questions**, not coverage. A
question about etcd raft internals will be correctly refused β€” the
refusal mechanism is part of the demo story, not a failure mode.

Each ingested page has:

- A canonical kubernetes.io/docs URL (source of truth, for re-scraping
  if content drifts)
- A date pulled (provenance, for audit)
- A one-line rationale (why this page is in scope)
- License confirmation (default CC BY 4.0)

## Locked category breakdown

| Category | Pages | Rationale |
|---|---|---|
| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. Reranker-stressing multi-hop questions draw on 2–4 of these per question. |
| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is the pilot_005 flavor-B false_premise target. |
| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context. |
| Scheduling + resources | 4 | Resource Management, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction. Good source for `comparison` and `time_sensitive` questions. |
| Access control | 1 | RBAC Authorization. Supports 1–2 `simple` questions about RBAC primitives. |
| Health + autoscaling | 2 | Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). |
| Security | 2 | Pod Security Admission, Pod Security Standards. PSA is the `simple_w_condition` stressor where the answer depends on enforcement level. |
| **Total** | **28** | Supports 25 questions with 3 pages of headroom. |

## Pulled pages (all 28)

All pages pulled via `defuddle parse <url> --md -o data/k8s_docs/<file>.md`.

| File | Category | URL | Date pulled | Pilot evidence |
|---|---|---|---|---|
| `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | 2026-03-24 (pilot) | β€” |
| `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | 2026-03-24 (pilot) | β€” |
| `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | 2026-03-24 (pilot) | **pilot_005 flavor-B target** β€” chunk_index 63 contains "Anything TLS related (use a service mesh or ingress controller for this)" |
| `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | 2026-03-24 (pilot) | β€” |
| `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | 2026-03-24 (pilot) | β€” |
| `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | 2026-03-24 (pilot) | pilot_001 target (Pod IP + localhost communication) |
| `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | 2026-03-24 (pilot) | β€” |
| `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | 2026-03-24 (pilot) | β€” |
| `k8s_pod_lifecycle.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/` | 2026-04-14 | step 4 |
| `k8s_statefulset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/` | 2026-04-14 | step 4 |
| `k8s_daemonset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/` | 2026-04-14 | step 4 |
| `k8s_job.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/job/` | 2026-04-14 | step 4 |
| `k8s_cronjob.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/` | 2026-04-14 | step 4 |
| `k8s_init_containers.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/init-containers/` | 2026-04-14 | step 4 |
| `k8s_service.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/service/` | 2026-04-14 | step 4 |
| `k8s_ingress.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/ingress/` | 2026-04-14 | step 4 |
| `k8s_endpoint_slices.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/` | 2026-04-14 | step 4 |
| `k8s_dns.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/` | 2026-04-14 | step 4 |
| `k8s_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/volumes/` | 2026-04-14 | step 4 |
| `k8s_persistent_volumes.md` | Config + state | `https://kubernetes.io/docs/concepts/storage/persistent-volumes/` | 2026-04-14 | step 4 |
| `k8s_namespaces.md` | Config + state | `https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/` | 2026-04-14 | step 4 |
| `k8s_resource_management.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/` | 2026-04-14 | step 4 |
| `k8s_assign_pod_node.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/` | 2026-04-14 | step 4 |
| `k8s_taints_tolerations.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/` | 2026-04-14 | step 4 |
| `k8s_rbac.md` | Access control | `https://kubernetes.io/docs/reference/access-authn-authz/rbac/` | 2026-04-14 | step 4 |
| `k8s_probes.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/` | 2026-04-14 | step 4 |
| `k8s_hpa.md` | Health + autoscaling | `https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/` | 2026-04-14 | step 4 |
| `k8s_pod_security_standards.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-standards/` | 2026-04-14 | step 4 |

**Pull tool:** [defuddle](https://github.com/kepano/defuddle) CLI v0.16.0
(`defuddle parse <url> --md -o <file>`). Defuddle extracts the main
content region of kubernetes.io/docs pages and renders clean markdown
with inline links preserved β€” output format matches the pilot 8 pages
exactly, so no per-file normalization was needed.

**URL verification:** All 20 step-4 URLs resolved without redirect
(defuddle followed the URL as given and produced non-empty output;
any 404 or redirect would have produced a 0-byte file, which none
did β€” file sizes range 115–917 lines).

## Ingestion

```bash
make ingest-k8s
```

This populates `.cache/store_k8s/` with embeddings + BM25 index
matching the FastAPI corpus's chunker settings (recursive, 512-token
chunks, 64-token overlap). Current state: **2447 chunks across 28
unique sources**.

**Ingest hygiene:** `scripts/ingest.py` excludes `SOURCES.md`,
`QUESTION_PLAN.md`, and `README.md` from the corpus β€” these are
version-controlled curation artifacts, not content.

## Post-ingest smoke-query validation

Per cross-cutting #8 pilot-first discipline, 5 representative queries
were run against the rebuilt store to confirm retrieval works before
step 5 golden-set authoring:

| Query | Top-1 source | Expected | Verdict |
|---|---|---|---|
| "what is a StatefulSet" | `k8s_statefulset.md` | `k8s_statefulset.md` | βœ“ |
| "how does HPA scale replicas" | `k8s_hpa.md` | `k8s_hpa.md` | βœ“ |
| "Pod evicted node pressure" | `k8s_pod_lifecycle.md` | `k8s_node_pressure_eviction.md` or `k8s_pod_lifecycle.md` | βœ“ (either acceptable β€” eviction is covered in both) |
| "Service route traffic to Pods" | `k8s_service.md` | `k8s_service.md` | βœ“ |
| "enforce Pod Security Standards" | `k8s_pod_security_admission.md` | `k8s_pod_security_admission.md` or `k8s_pod_security_standards.md` | βœ“ (PSA is the enforcement mechanism; PSS defines the levels β€” both valid hits) |

All 5 return top-1 from an expected page. No unexpected refusals.
No noise from irrelevant pages. The store is ready for step 5.