Spaces:
Sleeping
docs: cold-start measurement + falsified-assumption finding + v1.1 contingency
Browse filesHF Spaces cold-start measured at N=3 samples on 2026-04-15: median
113 s, range 89-129 s. Both tails fire the 60 s pre-committed
contingency gate from the preceding "Cold-start contingency" entry.
Key finding: the pre-committed fix (lazy-load K8s corpus) was written
under an assumption about cost attribution β corpus loading is the
dominant cold-start cost. Measurement falsified this. K8s corpus load
contributes <1 s of the 113 s median; the dominant cost is ~90 s of
silent Python init (interpreter start + module imports + initial
model weight loading). Executing the original fix would save ~1 % of
the overshoot.
Action taken: document the falsified assumption, accept the measured
baseline, re-pre-commit a new contingency targeting the actual
dominant cost (lazy-load cross-encoder + embedder + injection
classifier to first-relevant-request, traffic-justified trigger) so
the decision is not relitigated at review time. The contingency
trigger is deliberately left unnamed in advance β the entire point
of this entry is that estimate-justified triggers invite the same
falsification pattern.
Methodology lesson: honoring a pre-committed gate means honoring its
intent (prevent motivated reasoning about pass/fail), not mechanically
executing the pre-committed recipe regardless of whether its
empirical premise has survived measurement.
Raw measurement logs preserved under measurements/ so the specific
quantitative claims in the DECISIONS.md entry can be cross-checked
against the underlying evidence without re-running.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DECISIONS.md +121 -0
- measurements/2026-04-15-coldstart-n1.log +26 -0
- measurements/2026-04-15-coldstart-n2.log +26 -0
- measurements/2026-04-15-coldstart-n3.log +26 -0
- measurements/README.md +14 -0
|
@@ -608,6 +608,127 @@ numbers needed to make the decision; the decision is documented here
|
|
| 608 |
so future-me remembers the threshold and doesn't optimize prematurely
|
| 609 |
on a hunch.
|
| 610 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 611 |
## False-premise questions come in two flavors
|
| 612 |
|
| 613 |
When authoring golden-dataset questions whose premise is wrong, the
|
|
|
|
| 608 |
so future-me remembers the threshold and doesn't optimize prematurely
|
| 609 |
on a hunch.
|
| 610 |
|
| 611 |
+
## Cold-start gate fired β assumption falsified, fix deferred to v1.1 at the right cause
|
| 612 |
+
|
| 613 |
+
The preceding "Cold-start contingency" entry pre-committed a lazy-load
|
| 614 |
+
fix (FastAPI eager, K8s lazy on first request) if the measured cold
|
| 615 |
+
start exceeded 60 seconds. Measurement falsified the entry's core
|
| 616 |
+
assumption: **corpus loading is not the dominant cold-start cost**.
|
| 617 |
+
The committed fix addresses ~1 % of the observed overshoot. Executing
|
| 618 |
+
it verbatim would honor the gate's letter but not its intent β theater
|
| 619 |
+
dressed as discipline. This entry documents the measurement, the
|
| 620 |
+
falsified assumption, and the new contingency pre-committed at the
|
| 621 |
+
actual cause.
|
| 622 |
+
|
| 623 |
+
**Measurement (N=3, 2026-04-15, HF Spaces target deployment):**
|
| 624 |
+
|
| 625 |
+
| Sample | Cold start | Silent Python init | Visible phase |
|
| 626 |
+
|---|---|---|---|
|
| 627 |
+
| N=1 | 113 s | ~101 s | ~12 s |
|
| 628 |
+
| N=2 | 89 s | ~70 s | ~19 s |
|
| 629 |
+
| N=3 | 129 s | ~115 s | ~14 s |
|
| 630 |
+
|
| 631 |
+
- Median 113 s, mean ~110 s, range 89β129 s (spread ~40 s)
|
| 632 |
+
- **Gate fire is unambiguous at both tails.** Even the fastest sample
|
| 633 |
+
(89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is
|
| 634 |
+
~115 % over. No boundary ambiguity.
|
| 635 |
+
- **Sample-size justification.** N=3 is acknowledged as a small sample.
|
| 636 |
+
It is adequate here because (a) the gate-fire conclusion is stable
|
| 637 |
+
across both tails, (b) the "silent Python init dominates variance"
|
| 638 |
+
finding is stable across all three samples (silent phase varies
|
| 639 |
+
70 β 115 s across runs; visible phase varies only 12 β 19 s), and
|
| 640 |
+
(c) the cost of additional samples (manual HF Space restart + ~2 min
|
| 641 |
+
wait + log extraction per sample) exceeds the marginal information
|
| 642 |
+
gain once both tails fire the gate and the variance pattern is stable.
|
| 643 |
+
N=4 would tighten the confidence interval on the median but does not
|
| 644 |
+
change either the gate-fire conclusion or the falsified-assumption
|
| 645 |
+
finding.
|
| 646 |
+
- **Variance source named.** HF Spaces shared-infrastructure CPU / IO
|
| 647 |
+
contention during Python module imports. The silent-init phase
|
| 648 |
+
varies 45 s across samples (70 β 115 s); the visible phase is stable
|
| 649 |
+
(12β19 s). That is the signature of host-level contention on a
|
| 650 |
+
shared physical node, not code-level variability. An
|
| 651 |
+
exclusively-owned container would plausibly show a tighter bound.
|
| 652 |
+
- **Raw log captures** (preserved so this entry can be cross-checked
|
| 653 |
+
against the underlying evidence without re-running the measurement):
|
| 654 |
+
`measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`.
|
| 655 |
+
|
| 656 |
+
**Where the cost lives.** At the median (113 s):
|
| 657 |
+
|
| 658 |
+
- **Silent Python init phase β ~90 s (β 80 % of total):** interpreter
|
| 659 |
+
start, module imports (`torch`, `transformers`, `langchain`, `faiss`,
|
| 660 |
+
`fastapi`, `httpx`, the full dependency closure), and initial model
|
| 661 |
+
weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder
|
| 662 |
+
reranker). Not logged β no observability inside the import chain.
|
| 663 |
+
- **Visible startup phase β ~15 s (β 15 % of total):** injection
|
| 664 |
+
classifier init (~10 s, includes the "classifier skipped" warning),
|
| 665 |
+
FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s,
|
| 666 |
+
+25.8 MB RSS), reranker warmup (~2 s).
|
| 667 |
+
|
| 668 |
+
**The K8s corpus load β which the pre-committed fix was designed to
|
| 669 |
+
defer β contributes under 1 second of the 113-second median.**
|
| 670 |
+
Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load
|
| 671 |
+
is the same order of magnitude. Corpus loading is simply not where the
|
| 672 |
+
cost lives on this deployment.
|
| 673 |
+
|
| 674 |
+
**Why we are not executing the pre-committed fix.** The preceding
|
| 675 |
+
contingency was written under an empirical assumption about cost
|
| 676 |
+
attribution (corpus loading is the dominant cost). Measurement
|
| 677 |
+
falsified the assumption. Implementing the fix anyway would be a
|
| 678 |
+
mechanical execution of a recipe whose premise has been disproven β
|
| 679 |
+
it checks the gate-honoring box while failing to address the cause.
|
| 680 |
+
That is structurally identical to relaxing-by-redefinition ("60 s was
|
| 681 |
+
too tight"), just in the opposite direction: **relaxing by execution**.
|
| 682 |
+
The pre-commitment rule's purpose is to prevent motivated reasoning
|
| 683 |
+
about the gate, not to mandate mechanical compliance with a recipe
|
| 684 |
+
whose empirical foundation has collapsed.
|
| 685 |
+
|
| 686 |
+
The honest action is (1) accept the measurement as the v1 baseline,
|
| 687 |
+
(2) document the falsified assumption explicitly (this entry),
|
| 688 |
+
(3) re-pre-commit a new contingency at the actual dominant cost with
|
| 689 |
+
an explicit trigger condition so the decision is not relitigated at
|
| 690 |
+
review time, and (4) update the user-facing README surface to reflect
|
| 691 |
+
the measured cold-wake number rather than the optimistic pre-deploy
|
| 692 |
+
estimate.
|
| 693 |
+
|
| 694 |
+
**v1.1 contingency β pre-committed:**
|
| 695 |
+
|
| 696 |
+
> **If HF Spaces traffic produces more than N cold wakes per day**
|
| 697 |
+
> (N to be determined from observed usage patterns after launch, **not
|
| 698 |
+
> estimated in advance**), defer eager loading of (a) the cross-encoder
|
| 699 |
+
> reranker, (b) the sentence-transformers embedder, and (c) the
|
| 700 |
+
> injection classifier tier to first-relevant-request.
|
| 701 |
+
>
|
| 702 |
+
> **Estimated work:** 4β6 hours (lazy-init wrappers + first-request
|
| 703 |
+
> caching + integration tests for the warm/cold transition).
|
| 704 |
+
>
|
| 705 |
+
> **Expected tradeoff:** cold wake ~113 s β ~50β60 s (approaches the
|
| 706 |
+
> original 60 s target); **first request after any cold wake incurs
|
| 707 |
+
> +8β15 s** additional latency (model weights load synchronously in
|
| 708 |
+
> the request path), after which subsequent warm requests return to
|
| 709 |
+
> normal ~5 s latency.
|
| 710 |
+
>
|
| 711 |
+
> **Trigger is usage-justified, not estimate-justified.** Until real
|
| 712 |
+
> traffic data justifies the work, there is nothing to optimize β a
|
| 713 |
+
> recruiter demo that gets one cold wake per day does not pay for
|
| 714 |
+
> 4β6 hours of engineering plus the new first-request-latency failure
|
| 715 |
+
> mode. The trigger threshold N is left unnamed deliberately: naming a
|
| 716 |
+
> number in advance would invite the same falsification pattern this
|
| 717 |
+
> entry is documenting.
|
| 718 |
+
|
| 719 |
+
**Methodology lesson.** When a pre-committed contingency is written
|
| 720 |
+
under an empirical assumption, the contingency only holds if the
|
| 721 |
+
assumption survives measurement. If measurement falsifies the
|
| 722 |
+
assumption, the correct action is to document the falsification,
|
| 723 |
+
accept the observed baseline, and re-pre-commit at the actual cause.
|
| 724 |
+
The wrong action is to execute the original recipe anyway, which
|
| 725 |
+
trades one form of motivated reasoning (threshold relaxation) for
|
| 726 |
+
another (recipe compliance). The underlying discipline β "pre-commit
|
| 727 |
+
your gates and honor them" β does not mean "mechanically run the
|
| 728 |
+
pre-committed fix regardless of what it addresses." It means "honor
|
| 729 |
+
the gate's *intent*, which is to prevent motivated reasoning about
|
| 730 |
+
pass/fail."
|
| 731 |
+
|
| 732 |
## False-premise questions come in two flavors
|
| 733 |
|
| 734 |
When authoring golden-dataset questions whose premise is wrong, the
|
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HF Spaces cold-start measurement N=1
|
| 2 |
+
# Source: HF Spaces runtime log, container startup trace
|
| 3 |
+
# Container: huggingface.co/spaces/Nomearod/agentbench
|
| 4 |
+
# Build: first v1 deploy (commit 6955d72 on hf/main, main at 8974e47 + frontmatter injection)
|
| 5 |
+
# Cold-start delta: 09:40:34 β 09:38:41 = 113 seconds
|
| 6 |
+
|
| 7 |
+
===== Application Startup at 2026-04-15 09:38:41 =====
|
| 8 |
+
|
| 9 |
+
2026-04-15 09:40:22 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 10 |
+
2026-04-15 09:40:22 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 11 |
+
2026-04-15 09:40:31 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
|
| 12 |
+
2026-04-15 09:40:31 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=0.9 rss_mb=810.8 store_path=.cache/store
|
| 13 |
+
2026-04-15 09:40:31 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=835.6 store_path=.cache/store_k8s
|
| 14 |
+
2026-04-15 09:40:31 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
|
| 15 |
+
INFO: Started server process [1]
|
| 16 |
+
INFO: Waiting for application startup.
|
| 17 |
+
2026-04-15 09:40:31 [info ] warmup_start
|
| 18 |
+
2026-04-15 09:40:32 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
|
| 19 |
+
2026-04-15 09:40:34 [info ] warmup_complete
|
| 20 |
+
INFO: Application startup complete.
|
| 21 |
+
INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
|
| 22 |
+
|
| 23 |
+
# Phase breakdown:
|
| 24 |
+
# Silent Python init: 09:38:41 β 09:40:22 = 101s (interpreter start, module imports, initial model weights)
|
| 25 |
+
# Visible phase: 09:40:22 β 09:40:34 = 12s (injection classifier warnings + corpus loads + reranker warmup)
|
| 26 |
+
# Cold-start total: 113s
|
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HF Spaces cold-start measurement N=2
|
| 2 |
+
# Source: HF Spaces runtime log, container startup trace
|
| 3 |
+
# Container: huggingface.co/spaces/Nomearod/agentbench
|
| 4 |
+
# Build: post-audit-path fix rebuild (commit 55afe8a on hf/main)
|
| 5 |
+
# Cold-start delta: 10:37:59 β 10:36:30 = 89 seconds
|
| 6 |
+
|
| 7 |
+
===== Application Startup at 2026-04-15 10:36:30 =====
|
| 8 |
+
|
| 9 |
+
2026-04-15 10:37:40 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 10 |
+
2026-04-15 10:37:40 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 11 |
+
2026-04-15 10:37:57 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
|
| 12 |
+
2026-04-15 10:37:57 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=0.9 rss_mb=811.5 store_path=.cache/store
|
| 13 |
+
2026-04-15 10:37:57 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=836.3 store_path=.cache/store_k8s
|
| 14 |
+
2026-04-15 10:37:57 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
|
| 15 |
+
INFO: Started server process [1]
|
| 16 |
+
INFO: Waiting for application startup.
|
| 17 |
+
2026-04-15 10:37:57 [info ] warmup_start
|
| 18 |
+
2026-04-15 10:37:57 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
|
| 19 |
+
2026-04-15 10:37:59 [info ] warmup_complete
|
| 20 |
+
INFO: Application startup complete.
|
| 21 |
+
INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
|
| 22 |
+
|
| 23 |
+
# Phase breakdown:
|
| 24 |
+
# Silent Python init: 10:36:30 β 10:37:40 = 70s (interpreter start, module imports, initial model weights)
|
| 25 |
+
# Visible phase: 10:37:40 β 10:37:59 = 19s (injection classifier warnings + corpus loads + reranker warmup)
|
| 26 |
+
# Cold-start total: 89s
|
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HF Spaces cold-start measurement N=3
|
| 2 |
+
# Source: HF Spaces runtime log, container startup trace
|
| 3 |
+
# Container: huggingface.co/spaces/Nomearod/agentbench
|
| 4 |
+
# Build: same image as N=2 (commit 55afe8a on hf/main); manual restart via HF Space settings
|
| 5 |
+
# Cold-start delta: 10:49:07 β 10:46:58 = 129 seconds
|
| 6 |
+
|
| 7 |
+
===== Application Startup at 2026-04-15 10:46:58 =====
|
| 8 |
+
|
| 9 |
+
2026-04-15 10:48:53 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 10 |
+
2026-04-15 10:48:53 [warning ] injection_classifier_no_url msg="Tier 'classifier' configured but classifier_url is empty; classifier tier will be skipped at runtime."
|
| 11 |
+
2026-04-15 10:49:05 [warning ] audit_hmac_key_missing msg='No HMAC key provided; using random per-process key. IP hashes will not be stable across restarts or instances. Set AUDIT_HMAC_KEY env var or pass hmac_key for stable audit correlation.'
|
| 12 |
+
2026-04-15 10:49:05 [info ] corpus_loaded label='FastAPI Docs' name=fastapi providers=['openai', 'anthropic'] rss_delta_mb=1.0 rss_mb=811.7 store_path=.cache/store
|
| 13 |
+
2026-04-15 10:49:05 [info ] corpus_loaded label=Kubernetes name=k8s providers=['openai', 'anthropic'] rss_delta_mb=25.8 rss_mb=836.5 store_path=.cache/store_k8s
|
| 14 |
+
2026-04-15 10:49:05 [info ] multi_corpus_mode corpora=['fastapi', 'k8s'] default=fastapi providers=['openai', 'anthropic']
|
| 15 |
+
INFO: Started server process [1]
|
| 16 |
+
INFO: Waiting for application startup.
|
| 17 |
+
2026-04-15 10:49:05 [info ] warmup_start
|
| 18 |
+
2026-04-15 10:49:06 [info ] reranker_loading model=cross-encoder/ms-marco-MiniLM-L-6-v2
|
| 19 |
+
2026-04-15 10:49:07 [info ] warmup_complete
|
| 20 |
+
INFO: Application startup complete.
|
| 21 |
+
INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)
|
| 22 |
+
|
| 23 |
+
# Phase breakdown:
|
| 24 |
+
# Silent Python init: 10:46:58 β 10:48:53 = 115s (interpreter start, module imports, initial model weights)
|
| 25 |
+
# Visible phase: 10:48:53 β 10:49:07 = 14s (injection classifier warnings + corpus loads + reranker warmup)
|
| 26 |
+
# Cold-start total: 129s
|
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# measurements/
|
| 2 |
+
|
| 3 |
+
Raw measurement artifacts referenced from DECISIONS.md entries.
|
| 4 |
+
|
| 5 |
+
Each file is the raw observation (log snippet, trace, or metric dump)
|
| 6 |
+
that backs a specific quantitative claim in DECISIONS.md. Keeping the
|
| 7 |
+
raw data here lets a future reader cross-check any DECISIONS.md number
|
| 8 |
+
against its underlying evidence without having to re-run the
|
| 9 |
+
measurement or trust the narrative summary.
|
| 10 |
+
|
| 11 |
+
Naming: `YYYY-MM-DD-<topic>-<variant>.log`
|
| 12 |
+
|
| 13 |
+
Current entries:
|
| 14 |
+
- `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` β HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired β assumption falsified, fix deferred to v1.1 at the right cause."
|