Spaces:

evalstate
/

hf-papers

Running

App Files Files Community

evalstate HF Staff commited on Feb 8

Commit

30dc122

verified ·

1 Parent(s): bba4fab

docs: add continue-testing guide and hf CLI deployment helper updates

Browse files

Files changed (1) hide show

continue-testing.md +113 -0

continue-testing.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# Continue Testing Guide
+This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt.
+## Objectives
+We optimized for:
+1. **Quality/Correctness** on realistic user tasks
+2. **Functional Coverage** of the API surface
+3. **Efficiency** (token and tool-call cost)
+4. **Safe behavior** on destructive/unsupported actions
+## Prompt versions
+- **v1** = original long prompt (high quality baseline)
+  - Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md`
+- **v2** = compact prompt (major efficiency gains, minor regressions)
+  - Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md`
+- **v3** = compact + targeted anti-regression rules
+  - Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md`
+  - **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md`
+Variant registry:
+- `scripts/hf_hub_prompt_variants.json`
+## Harness components
+### 1) Quality pack (challenge prompts)
+- Prompts: `scripts/hf_hub_community_challenges.txt`
+- Runner/scorer: `scripts/score_hf_hub_community_challenges.py`
+- Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
+- Also records usage metrics:
+  - tool-call count
+  - input/output/total tokens
+### 2) Coverage pack (non-overlapping API coverage)
+- Cases: `scripts/hf_hub_community_coverage_prompts.json`
+- Runner/scorer: `scripts/score_hf_hub_community_coverage.py`
+- Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack
+### 3) Prompt A/B runner
+- Script: `scripts/eval_hf_hub_prompt_ab.py`
+- Runs **both packs** per variant/model
+- Produces combined summary + plots:
+  - `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}`
+  - plots under `docs/hf_hub_prompt_ab/`
+### 4) Single-variant runner (for follow-up iterations)
+- Script: `scripts/run_hf_hub_prompt_variant.py`
+- Useful when testing only one new prompt version (e.g. v4)
+## Decision rule used
+We evaluate with a balanced view:
+1. Challenge quality score (primary)
+2. Coverage endpoint/method match rates
+3. Total tool calls and tokens (efficiency tie-breakers)
+Composite used in harness summary:
+- `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method`
+## Why v3 was promoted
+Observed trend:
+- v1: best raw quality, very high token/tool cost
+- v2: huge efficiency gains, small functional regressions
+- v3: recovered those regressions while retaining v2-like efficiency
+Result: v3 is the best quality/efficiency tradeoff and is now production.
+## Re-running tests
+### Full A/B across variants
+```bash
+python scripts/eval_hf_hub_prompt_ab.py \
+  --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
+  --models gpt-oss \
+  --timeout 240
+```
+### Run just current production (v3)
+```bash
+python scripts/run_hf_hub_prompt_variant.py \
+  --variant-id v3 \
+  --cards-dir .fast-agent/tool-cards \
+  --model gpt-oss
+```
+## Next recommended loop (for v4+)
+1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/`
+2. Make one focused prompt change
+3. Run `run_hf_hub_prompt_variant.py` for v4
+4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4
+5. Promote only if quality is maintained/improved with acceptable cost
+## Deployment workflow
+Space target:
+- `spaces/evalstate/hf-papers`
+- `https://huggingface.co/spaces/evalstate/hf-papers/`
+Use `hf` CLI deployment helper:
+```bash
+scripts/publish_space.sh
+```
+This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.