Spaces:
Running
fix(ingest): exclude QUESTION_PLAN.md from corpus ingestion
Browse filesscripts/ingest.py already excluded SOURCES.md and README.md as
version-controlled curation artifacts. QUESTION_PLAN.md (new at
3241b7c) is the same class of artifact — it belongs next to
SOURCES.md as authoring guidance, not in the RAG corpus.
Caught during Week 1 step 4 K8s ingestion: the first make ingest-k8s
run indexed 29 unique sources instead of the expected 28, and the
store contents showed QUESTION_PLAN.md as an ingested source. This
would have surfaced QUESTION_PLAN.md chunks in retrieval results at
query time — the k8s_docs/QUESTION_PLAN.md filename would appear in
citations, which is wrong shape for the corpus.
Post-fix re-ingest: 28 unique sources, 2447 chunks. Matches the
locked SOURCES.md breakdown.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- scripts/ingest.py +3 -2
|
@@ -35,8 +35,9 @@ def ingest(
|
|
| 35 |
sys.exit(1)
|
| 36 |
|
| 37 |
# Exclude curation metadata files that live alongside corpus content.
|
| 38 |
-
# SOURCES.md
|
| 39 |
-
|
|
|
|
| 40 |
md_files = sorted(f for f in doc_path.glob("*.md") if f.name not in _EXCLUDED)
|
| 41 |
if not md_files:
|
| 42 |
print(f"Error: no markdown files found in {doc_dir}")
|
|
|
|
| 35 |
sys.exit(1)
|
| 36 |
|
| 37 |
# Exclude curation metadata files that live alongside corpus content.
|
| 38 |
+
# SOURCES.md and QUESTION_PLAN.md are version-controlled curation
|
| 39 |
+
# artifacts, not corpus content.
|
| 40 |
+
_EXCLUDED = {"SOURCES.md", "QUESTION_PLAN.md", "README.md"}
|
| 41 |
md_files = sorted(f for f in doc_path.glob("*.md") if f.name not in _EXCLUDED)
|
| 42 |
if not md_files:
|
| 43 |
print(f"Error: no markdown files found in {doc_dir}")
|