evalstate HF Staff commited on
Commit
30dc122
·
verified ·
1 Parent(s): bba4fab

docs: add continue-testing guide and hf CLI deployment helper updates

Browse files
Files changed (1) hide show
  1. continue-testing.md +113 -0
continue-testing.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Continue Testing Guide
2
+
3
+ This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt.
4
+
5
+ ## Objectives
6
+
7
+ We optimized for:
8
+
9
+ 1. **Quality/Correctness** on realistic user tasks
10
+ 2. **Functional Coverage** of the API surface
11
+ 3. **Efficiency** (token and tool-call cost)
12
+ 4. **Safe behavior** on destructive/unsupported actions
13
+
14
+ ## Prompt versions
15
+
16
+ - **v1** = original long prompt (high quality baseline)
17
+ - Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md`
18
+ - **v2** = compact prompt (major efficiency gains, minor regressions)
19
+ - Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md`
20
+ - **v3** = compact + targeted anti-regression rules
21
+ - Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md`
22
+ - **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md`
23
+
24
+ Variant registry:
25
+ - `scripts/hf_hub_prompt_variants.json`
26
+
27
+ ## Harness components
28
+
29
+ ### 1) Quality pack (challenge prompts)
30
+ - Prompts: `scripts/hf_hub_community_challenges.txt`
31
+ - Runner/scorer: `scripts/score_hf_hub_community_challenges.py`
32
+ - Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
33
+ - Also records usage metrics:
34
+ - tool-call count
35
+ - input/output/total tokens
36
+
37
+ ### 2) Coverage pack (non-overlapping API coverage)
38
+ - Cases: `scripts/hf_hub_community_coverage_prompts.json`
39
+ - Runner/scorer: `scripts/score_hf_hub_community_coverage.py`
40
+ - Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack
41
+
42
+ ### 3) Prompt A/B runner
43
+ - Script: `scripts/eval_hf_hub_prompt_ab.py`
44
+ - Runs **both packs** per variant/model
45
+ - Produces combined summary + plots:
46
+ - `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}`
47
+ - plots under `docs/hf_hub_prompt_ab/`
48
+
49
+ ### 4) Single-variant runner (for follow-up iterations)
50
+ - Script: `scripts/run_hf_hub_prompt_variant.py`
51
+ - Useful when testing only one new prompt version (e.g. v4)
52
+
53
+ ## Decision rule used
54
+
55
+ We evaluate with a balanced view:
56
+
57
+ 1. Challenge quality score (primary)
58
+ 2. Coverage endpoint/method match rates
59
+ 3. Total tool calls and tokens (efficiency tie-breakers)
60
+
61
+ Composite used in harness summary:
62
+ - `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method`
63
+
64
+ ## Why v3 was promoted
65
+
66
+ Observed trend:
67
+ - v1: best raw quality, very high token/tool cost
68
+ - v2: huge efficiency gains, small functional regressions
69
+ - v3: recovered those regressions while retaining v2-like efficiency
70
+
71
+ Result: v3 is the best quality/efficiency tradeoff and is now production.
72
+
73
+ ## Re-running tests
74
+
75
+ ### Full A/B across variants
76
+
77
+ ```bash
78
+ python scripts/eval_hf_hub_prompt_ab.py \
79
+ --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
80
+ --models gpt-oss \
81
+ --timeout 240
82
+ ```
83
+
84
+ ### Run just current production (v3)
85
+
86
+ ```bash
87
+ python scripts/run_hf_hub_prompt_variant.py \
88
+ --variant-id v3 \
89
+ --cards-dir .fast-agent/tool-cards \
90
+ --model gpt-oss
91
+ ```
92
+
93
+ ## Next recommended loop (for v4+)
94
+
95
+ 1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/`
96
+ 2. Make one focused prompt change
97
+ 3. Run `run_hf_hub_prompt_variant.py` for v4
98
+ 4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4
99
+ 5. Promote only if quality is maintained/improved with acceptable cost
100
+
101
+ ## Deployment workflow
102
+
103
+ Space target:
104
+ - `spaces/evalstate/hf-papers`
105
+ - `https://huggingface.co/spaces/evalstate/hf-papers/`
106
+
107
+ Use `hf` CLI deployment helper:
108
+
109
+ ```bash
110
+ scripts/publish_space.sh
111
+ ```
112
+
113
+ This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.