docs: add continue-testing guide and hf CLI deployment helper updates
Browse files- continue-testing.md +113 -0
continue-testing.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Continue Testing Guide
|
| 2 |
+
|
| 3 |
+
This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt.
|
| 4 |
+
|
| 5 |
+
## Objectives
|
| 6 |
+
|
| 7 |
+
We optimized for:
|
| 8 |
+
|
| 9 |
+
1. **Quality/Correctness** on realistic user tasks
|
| 10 |
+
2. **Functional Coverage** of the API surface
|
| 11 |
+
3. **Efficiency** (token and tool-call cost)
|
| 12 |
+
4. **Safe behavior** on destructive/unsupported actions
|
| 13 |
+
|
| 14 |
+
## Prompt versions
|
| 15 |
+
|
| 16 |
+
- **v1** = original long prompt (high quality baseline)
|
| 17 |
+
- Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md`
|
| 18 |
+
- **v2** = compact prompt (major efficiency gains, minor regressions)
|
| 19 |
+
- Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md`
|
| 20 |
+
- **v3** = compact + targeted anti-regression rules
|
| 21 |
+
- Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md`
|
| 22 |
+
- **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md`
|
| 23 |
+
|
| 24 |
+
Variant registry:
|
| 25 |
+
- `scripts/hf_hub_prompt_variants.json`
|
| 26 |
+
|
| 27 |
+
## Harness components
|
| 28 |
+
|
| 29 |
+
### 1) Quality pack (challenge prompts)
|
| 30 |
+
- Prompts: `scripts/hf_hub_community_challenges.txt`
|
| 31 |
+
- Runner/scorer: `scripts/score_hf_hub_community_challenges.py`
|
| 32 |
+
- Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
|
| 33 |
+
- Also records usage metrics:
|
| 34 |
+
- tool-call count
|
| 35 |
+
- input/output/total tokens
|
| 36 |
+
|
| 37 |
+
### 2) Coverage pack (non-overlapping API coverage)
|
| 38 |
+
- Cases: `scripts/hf_hub_community_coverage_prompts.json`
|
| 39 |
+
- Runner/scorer: `scripts/score_hf_hub_community_coverage.py`
|
| 40 |
+
- Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack
|
| 41 |
+
|
| 42 |
+
### 3) Prompt A/B runner
|
| 43 |
+
- Script: `scripts/eval_hf_hub_prompt_ab.py`
|
| 44 |
+
- Runs **both packs** per variant/model
|
| 45 |
+
- Produces combined summary + plots:
|
| 46 |
+
- `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}`
|
| 47 |
+
- plots under `docs/hf_hub_prompt_ab/`
|
| 48 |
+
|
| 49 |
+
### 4) Single-variant runner (for follow-up iterations)
|
| 50 |
+
- Script: `scripts/run_hf_hub_prompt_variant.py`
|
| 51 |
+
- Useful when testing only one new prompt version (e.g. v4)
|
| 52 |
+
|
| 53 |
+
## Decision rule used
|
| 54 |
+
|
| 55 |
+
We evaluate with a balanced view:
|
| 56 |
+
|
| 57 |
+
1. Challenge quality score (primary)
|
| 58 |
+
2. Coverage endpoint/method match rates
|
| 59 |
+
3. Total tool calls and tokens (efficiency tie-breakers)
|
| 60 |
+
|
| 61 |
+
Composite used in harness summary:
|
| 62 |
+
- `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method`
|
| 63 |
+
|
| 64 |
+
## Why v3 was promoted
|
| 65 |
+
|
| 66 |
+
Observed trend:
|
| 67 |
+
- v1: best raw quality, very high token/tool cost
|
| 68 |
+
- v2: huge efficiency gains, small functional regressions
|
| 69 |
+
- v3: recovered those regressions while retaining v2-like efficiency
|
| 70 |
+
|
| 71 |
+
Result: v3 is the best quality/efficiency tradeoff and is now production.
|
| 72 |
+
|
| 73 |
+
## Re-running tests
|
| 74 |
+
|
| 75 |
+
### Full A/B across variants
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
python scripts/eval_hf_hub_prompt_ab.py \
|
| 79 |
+
--variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
|
| 80 |
+
--models gpt-oss \
|
| 81 |
+
--timeout 240
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Run just current production (v3)
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
python scripts/run_hf_hub_prompt_variant.py \
|
| 88 |
+
--variant-id v3 \
|
| 89 |
+
--cards-dir .fast-agent/tool-cards \
|
| 90 |
+
--model gpt-oss
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## Next recommended loop (for v4+)
|
| 94 |
+
|
| 95 |
+
1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/`
|
| 96 |
+
2. Make one focused prompt change
|
| 97 |
+
3. Run `run_hf_hub_prompt_variant.py` for v4
|
| 98 |
+
4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4
|
| 99 |
+
5. Promote only if quality is maintained/improved with acceptable cost
|
| 100 |
+
|
| 101 |
+
## Deployment workflow
|
| 102 |
+
|
| 103 |
+
Space target:
|
| 104 |
+
- `spaces/evalstate/hf-papers`
|
| 105 |
+
- `https://huggingface.co/spaces/evalstate/hf-papers/`
|
| 106 |
+
|
| 107 |
+
Use `hf` CLI deployment helper:
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
scripts/publish_space.sh
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.
|