Nomearod Claude Opus 4.6 (1M context) commited on
Commit
9dfd3f0
·
1 Parent(s): 3241b7c

fix(ingest): exclude QUESTION_PLAN.md from corpus ingestion

Browse files

scripts/ingest.py already excluded SOURCES.md and README.md as
version-controlled curation artifacts. QUESTION_PLAN.md (new at
3241b7c) is the same class of artifact — it belongs next to
SOURCES.md as authoring guidance, not in the RAG corpus.

Caught during Week 1 step 4 K8s ingestion: the first make ingest-k8s
run indexed 29 unique sources instead of the expected 28, and the
store contents showed QUESTION_PLAN.md as an ingested source. This
would have surfaced QUESTION_PLAN.md chunks in retrieval results at
query time — the k8s_docs/QUESTION_PLAN.md filename would appear in
citations, which is wrong shape for the corpus.

Post-fix re-ingest: 28 unique sources, 2447 chunks. Matches the
locked SOURCES.md breakdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. scripts/ingest.py +3 -2
scripts/ingest.py CHANGED
@@ -35,8 +35,9 @@ def ingest(
35
  sys.exit(1)
36
 
37
  # Exclude curation metadata files that live alongside corpus content.
38
- # SOURCES.md is a version-controlled curation artifact, not corpus content.
39
- _EXCLUDED = {"SOURCES.md", "README.md"}
 
40
  md_files = sorted(f for f in doc_path.glob("*.md") if f.name not in _EXCLUDED)
41
  if not md_files:
42
  print(f"Error: no markdown files found in {doc_dir}")
 
35
  sys.exit(1)
36
 
37
  # Exclude curation metadata files that live alongside corpus content.
38
+ # SOURCES.md and QUESTION_PLAN.md are version-controlled curation
39
+ # artifacts, not corpus content.
40
+ _EXCLUDED = {"SOURCES.md", "QUESTION_PLAN.md", "README.md"}
41
  md_files = sorted(f for f in doc_path.glob("*.md") if f.name not in _EXCLUDED)
42
  if not md_files:
43
  print(f"Error: no markdown files found in {doc_dir}")