Nomearod Claude Opus 4.6 (1M context) commited on
Commit
5fa0933
Β·
1 Parent(s): 55afe8a

docs: cold-start measurement + falsified-assumption finding + v1.1 contingency

Browse files

HF Spaces cold-start measured at N=3 samples on 2026-04-15: median
113 s, range 89-129 s. Both tails fire the 60 s pre-committed
contingency gate from the preceding "Cold-start contingency" entry.

Key finding: the pre-committed fix (lazy-load K8s corpus) was written
under an assumption about cost attribution β€” corpus loading is the
dominant cold-start cost. Measurement falsified this. K8s corpus load
contributes <1 s of the 113 s median; the dominant cost is ~90 s of
silent Python init (interpreter start + module imports + initial
model weight loading). Executing the original fix would save ~1 % of
the overshoot.

Action taken: document the falsified assumption, accept the measured
baseline, re-pre-commit a new contingency targeting the actual
dominant cost (lazy-load cross-encoder + embedder + injection
classifier to first-relevant-request, traffic-justified trigger) so
the decision is not relitigated at review time. The contingency
trigger is deliberately left unnamed in advance β€” the entire point
of this entry is that estimate-justified triggers invite the same
falsification pattern.

Methodology lesson: honoring a pre-committed gate means honoring its
intent (prevent motivated reasoning about pass/fail), not mechanically
executing the pre-committed recipe regardless of whether its
empirical premise has survived measurement.

Raw measurement logs preserved under measurements/ so the specific
quantitative claims in the DECISIONS.md entry can be cross-checked
against the underlying evidence without re-running.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DECISIONS.md CHANGED
@@ -608,6 +608,127 @@ numbers needed to make the decision; the decision is documented here
608
  so future-me remembers the threshold and doesn't optimize prematurely
609
  on a hunch.
610
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
  ## False-premise questions come in two flavors
612
 
613
  When authoring golden-dataset questions whose premise is wrong, the
 
608
  so future-me remembers the threshold and doesn't optimize prematurely
609
  on a hunch.
610
 
611
+ ## Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause
612
+
613
+ The preceding "Cold-start contingency" entry pre-committed a lazy-load
614
+ fix (FastAPI eager, K8s lazy on first request) if the measured cold
615
+ start exceeded 60 seconds. Measurement falsified the entry's core
616
+ assumption: **corpus loading is not the dominant cold-start cost**.
617
+ The committed fix addresses ~1 % of the observed overshoot. Executing
618
+ it verbatim would honor the gate's letter but not its intent β€” theater
619
+ dressed as discipline. This entry documents the measurement, the
620
+ falsified assumption, and the new contingency pre-committed at the
621
+ actual cause.
622
+
623
+ **Measurement (N=3, 2026-04-15, HF Spaces target deployment):**
624
+
625
+ | Sample | Cold start | Silent Python init | Visible phase |
626
+ |---|---|---|---|
627
+ | N=1 | 113 s | ~101 s | ~12 s |
628
+ | N=2 | 89 s | ~70 s | ~19 s |
629
+ | N=3 | 129 s | ~115 s | ~14 s |
630
+
631
+ - Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
632
+ - **Gate fire is unambiguous at both tails.** Even the fastest sample
633
+ (89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is
634
+ ~115 % over. No boundary ambiguity.
635
+ - **Sample-size justification.** N=3 is acknowledged as a small sample.
636
+ It is adequate here because (a) the gate-fire conclusion is stable
637
+ across both tails, (b) the "silent Python init dominates variance"
638
+ finding is stable across all three samples (silent phase varies
639
+ 70 β†’ 115 s across runs; visible phase varies only 12 β†’ 19 s), and
640
+ (c) the cost of additional samples (manual HF Space restart + ~2 min
641
+ wait + log extraction per sample) exceeds the marginal information
642
+ gain once both tails fire the gate and the variance pattern is stable.
643
+ N=4 would tighten the confidence interval on the median but does not
644
+ change either the gate-fire conclusion or the falsified-assumption
645
+ finding.
646
+ - **Variance source named.** HF Spaces shared-infrastructure CPU / IO
647
+ contention during Python module imports. The silent-init phase
648
+ varies 45 s across samples (70 β†’ 115 s); the visible phase is stable
649
+ (12–19 s). That is the signature of host-level contention on a
650
+ shared physical node, not code-level variability. An
651
+ exclusively-owned container would plausibly show a tighter bound.
652
+ - **Raw log captures** (preserved so this entry can be cross-checked
653
+ against the underlying evidence without re-running the measurement):
654
+ `measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`.
655
+
656
+ **Where the cost lives.** At the median (113 s):
657
+
658
+ - **Silent Python init phase β€” ~90 s (β‰ˆ 80 % of total):** interpreter
659
+ start, module imports (`torch`, `transformers`, `langchain`, `faiss`,
660
+ `fastapi`, `httpx`, the full dependency closure), and initial model
661
+ weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder
662
+ reranker). Not logged β€” no observability inside the import chain.
663
+ - **Visible startup phase β€” ~15 s (β‰ˆ 15 % of total):** injection
664
+ classifier init (~10 s, includes the "classifier skipped" warning),
665
+ FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s,
666
+ +25.8 MB RSS), reranker warmup (~2 s).
667
+
668
+ **The K8s corpus load β€” which the pre-committed fix was designed to
669
+ defer β€” contributes under 1 second of the 113-second median.**
670
+ Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load
671
+ is the same order of magnitude. Corpus loading is simply not where the
672
+ cost lives on this deployment.
673
+
674
+ **Why we are not executing the pre-committed fix.** The preceding
675
+ contingency was written under an empirical assumption about cost
676
+ attribution (corpus loading is the dominant cost). Measurement
677
+ falsified the assumption. Implementing the fix anyway would be a
678
+ mechanical execution of a recipe whose premise has been disproven β€”
679
+ it checks the gate-honoring box while failing to address the cause.
680
+ That is structurally identical to relaxing-by-redefinition ("60 s was
681
+ too tight"), just in the opposite direction: **relaxing by execution**.
682
+ The pre-commitment rule's purpose is to prevent motivated reasoning
683
+ about the gate, not to mandate mechanical compliance with a recipe
684
+ whose empirical foundation has collapsed.
685
+
686
+ The honest action is (1) accept the measurement as the v1 baseline,
687
+ (2) document the falsified assumption explicitly (this entry),
688
+ (3) re-pre-commit a new contingency at the actual dominant cost with
689
+ an explicit trigger condition so the decision is not relitigated at
690
+ review time, and (4) update the user-facing README surface to reflect
691
+ the measured cold-wake number rather than the optimistic pre-deploy
692
+ estimate.
693
+
694
+ **v1.1 contingency β€” pre-committed:**
695
+
696
+ > **If HF Spaces traffic produces more than N cold wakes per day**
697
+ > (N to be determined from observed usage patterns after launch, **not
698
+ > estimated in advance**), defer eager loading of (a) the cross-encoder
699
+ > reranker, (b) the sentence-transformers embedder, and (c) the
700
+ > injection classifier tier to first-relevant-request.
701
+ >
702
+ > **Estimated work:** 4–6 hours (lazy-init wrappers + first-request
703
+ > caching + integration tests for the warm/cold transition).
704
+ >
705
+ > **Expected tradeoff:** cold wake ~113 s β†’ ~50–60 s (approaches the
706
+ > original 60 s target); **first request after any cold wake incurs
707
+ > +8–15 s** additional latency (model weights load synchronously in
708
+ > the request path), after which subsequent warm requests return to
709
+ > normal ~5 s latency.
710
+ >
711
+ > **Trigger is usage-justified, not estimate-justified.** Until real
712
+ > traffic data justifies the work, there is nothing to optimize β€” a
713
+ > recruiter demo that gets one cold wake per day does not pay for
714
+ > 4–6 hours of engineering plus the new first-request-latency failure
715
+ > mode. The trigger threshold N is left unnamed deliberately: naming a
716
+ > number in advance would invite the same falsification pattern this
717
+ > entry is documenting.
718
+
719
+ **Methodology lesson.** When a pre-committed contingency is written
720
+ under an empirical assumption, the contingency only holds if the
721
+ assumption survives measurement. If measurement falsifies the
722
+ assumption, the correct action is to document the falsification,
723
+ accept the observed baseline, and re-pre-commit at the actual cause.
724
+ The wrong action is to execute the original recipe anyway, which
725
+ trades one form of motivated reasoning (threshold relaxation) for
726
+ another (recipe compliance). The underlying discipline β€” "pre-commit
727
+ your gates and honor them" β€” does not mean "mechanically run the
728
+ pre-committed fix regardless of what it addresses." It means "honor
729
+ the gate's *intent*, which is to prevent motivated reasoning about
730
+ pass/fail."
731
+
732
  ## False-premise questions come in two flavors
733
 
734
  When authoring golden-dataset questions whose premise is wrong, the
measurements/2026-04-15-coldstart-n1.log ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF Spaces cold-start measurement N=1
2
+ # Source: HF Spaces runtime log, container startup trace
3
+ # Container: huggingface.co/spaces/Nomearod/agentbench
4
+ # Build: first v1 deploy (commit 6955d72 on hf/main, main at 8974e47 + frontmatter injection)
5
+ # Cold-start delta: 09:40:34 βˆ’ 09:38:41 = 113 seconds
6
+
7
+ ===== Application Startup at 2026-04-15 09:38:41 =====
8
+
9
+ 2026-04-15 09:40:22 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
10
+ 2026-04-15 09:40:22 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
11
+ 2026-04-15 09:40:31 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
12
+ 2026-04-15 09:40:31 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=0.9 rss_mb=810.8 store_path=.cache/store
13
+ 2026-04-15 09:40:31 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=835.6 store_path=.cache/store_k8s
14
+ 2026-04-15 09:40:31 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
15
+ INFO: Started server process [1]
16
+ INFO: Waiting for application startup.
17
+ 2026-04-15 09:40:31 [info ] warmup_start
18
+ 2026-04-15 09:40:32 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
19
+ 2026-04-15 09:40:34 [info ] warmup_complete
20
+ INFO: Application startup complete.
21
+ INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
22
+
23
+ # Phase breakdown:
24
+ # Silent Python init: 09:38:41 β†’ 09:40:22 = 101s (interpreter start, module imports, initial model weights)
25
+ # Visible phase: 09:40:22 β†’ 09:40:34 = 12s (injection classifier warnings + corpus loads + reranker warmup)
26
+ # Cold-start total: 113s
measurements/2026-04-15-coldstart-n2.log ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF Spaces cold-start measurement N=2
2
+ # Source: HF Spaces runtime log, container startup trace
3
+ # Container: huggingface.co/spaces/Nomearod/agentbench
4
+ # Build: post-audit-path fix rebuild (commit 55afe8a on hf/main)
5
+ # Cold-start delta: 10:37:59 βˆ’ 10:36:30 = 89 seconds
6
+
7
+ ===== Application Startup at 2026-04-15 10:36:30 =====
8
+
9
+ 2026-04-15 10:37:40 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
10
+ 2026-04-15 10:37:40 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
11
+ 2026-04-15 10:37:57 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
12
+ 2026-04-15 10:37:57 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=0.9 rss_mb=811.5 store_path=.cache/store
13
+ 2026-04-15 10:37:57 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=836.3 store_path=.cache/store_k8s
14
+ 2026-04-15 10:37:57 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
15
+ INFO: Started server process [1]
16
+ INFO: Waiting for application startup.
17
+ 2026-04-15 10:37:57 [info ] warmup_start
18
+ 2026-04-15 10:37:57 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
19
+ 2026-04-15 10:37:59 [info ] warmup_complete
20
+ INFO: Application startup complete.
21
+ INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
22
+
23
+ # Phase breakdown:
24
+ # Silent Python init: 10:36:30 β†’ 10:37:40 = 70s (interpreter start, module imports, initial model weights)
25
+ # Visible phase: 10:37:40 β†’ 10:37:59 = 19s (injection classifier warnings + corpus loads + reranker warmup)
26
+ # Cold-start total: 89s
measurements/2026-04-15-coldstart-n3.log ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF Spaces cold-start measurement N=3
2
+ # Source: HF Spaces runtime log, container startup trace
3
+ # Container: huggingface.co/spaces/Nomearod/agentbench
4
+ # Build: same image as N=2 (commit 55afe8a on hf/main); manual restart via HF Space settings
5
+ # Cold-start delta: 10:49:07 βˆ’ 10:46:58 = 129 seconds
6
+
7
+ ===== Application Startup at 2026-04-15 10:46:58 =====
8
+
9
+ 2026-04-15 10:48:53 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
10
+ 2026-04-15 10:48:53 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
11
+ 2026-04-15 10:49:05 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
12
+ 2026-04-15 10:49:05 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=1.0 rss_mb=811.7 store_path=.cache/store
13
+ 2026-04-15 10:49:05 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=836.5 store_path=.cache/store_k8s
14
+ 2026-04-15 10:49:05 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
15
+ INFO: Started server process [1]
16
+ INFO: Waiting for application startup.
17
+ 2026-04-15 10:49:05 [info ] warmup_start
18
+ 2026-04-15 10:49:06 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
19
+ 2026-04-15 10:49:07 [info ] warmup_complete
20
+ INFO: Application startup complete.
21
+ INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
22
+
23
+ # Phase breakdown:
24
+ # Silent Python init: 10:46:58 β†’ 10:48:53 = 115s (interpreter start, module imports, initial model weights)
25
+ # Visible phase: 10:48:53 β†’ 10:49:07 = 14s (injection classifier warnings + corpus loads + reranker warmup)
26
+ # Cold-start total: 129s
measurements/README.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # measurements/
2
+
3
+ Raw measurement artifacts referenced from DECISIONS.md entries.
4
+
5
+ Each file is the raw observation (log snippet, trace, or metric dump)
6
+ that backs a specific quantitative claim in DECISIONS.md. Keeping the
7
+ raw data here lets a future reader cross-check any DECISIONS.md number
8
+ against its underlying evidence without having to re-run the
9
+ measurement or trust the narrative summary.
10
+
11
+ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
12
+
13
+ Current entries:
14
+ - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β€” HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β€” assumption falsified, fix deferred to v1.1 at the right cause."