Spaces:
Runtime error
Runtime error
| Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file. | |
| --- | |
| ### 1 Define a thin wrapper around your agent | |
| ```python | |
| # agent_wrapper.py | |
| from typing import Dict | |
| class MyAgent: | |
| """ | |
| Replace the `answer` method with however you call your own agent | |
| (API call, local model .predict(), etc.). | |
| """ | |
| def answer(self, record: Dict) -> str: | |
| prompt = record["question"] | |
| # ► ► your code here ◄ ◄ | |
| response = ... # the raw answer string | |
| return response.strip() | |
| ``` | |
| --- | |
| ### 2 Normalization helpers (GAIA style) | |
| ```python | |
| # normalize.py | |
| import re | |
| def normalize(ans: str) -> str: | |
| """ | |
| GAIA scoring ≈ quasi-exact match after: | |
| • trim / collapse whitespace | |
| • lowercase (safe for numbers, too) | |
| Extend if you need custom rules (e.g. strip trailing $ or %). | |
| """ | |
| ans = ans.strip().lower() | |
| ans = re.sub(r"\\s+", " ", ans) # collapse inner spaces | |
| return ans | |
| ``` | |
| --- | |
| ### 3 Evaluation script | |
| ```python | |
| # evaluate_agent.py | |
| import json, argparse, pathlib, time | |
| from typing import Dict, List | |
| from agent_wrapper import MyAgent | |
| from normalize import normalize | |
| def load_records(path: pathlib.Path) -> List[Dict]: | |
| with path.open("r", encoding="utf-8") as f: | |
| return json.load(f) # your new file is a JSON array | |
| def main(path_eval: str, limit: int | None = None): | |
| eval_path = pathlib.Path(path_eval) | |
| records = load_records(eval_path) | |
| if limit: | |
| records = records[:limit] | |
| agent = MyAgent() | |
| n_total = len(records) | |
| n_correct = 0 | |
| latencies = [] | |
| for rec in records: | |
| t0 = time.perf_counter() | |
| pred = agent.answer(rec) | |
| latencies.append(time.perf_counter() - t0) | |
| gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \ | |
| or rec.get("Final answer".lower()) or rec.get("Final answer", "") | |
| if normalize(pred) == normalize(gold): | |
| n_correct += 1 | |
| acc = n_correct / n_total * 100 | |
| print(f"Accuracy: {n_correct}/{n_total} ({acc:.2f}%)") | |
| print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s") | |
| if __name__ == "__main__": | |
| parser = argparse.ArgumentParser() | |
| parser.add_argument("eval_json", help="common_questions.json (or other)") | |
| parser.add_argument("--limit", type=int, help="debug with first N records") | |
| args = parser.parse_args() | |
| main(args.eval_json, args.limit) | |
| ``` | |
| *Run*: | |
| ```bash | |
| python3 evaluate_agent.py question_set/common_questions.json | |
| ``` | |
| --- | |
| ### 4 Customizing | |
| | Need | Where to tweak | | |
| | ----------------------------------------------------------------------- | ----------------------------------------- | | |
| | **Agent call** (local model vs. API with keys, tool-use, etc.) | `MyAgent.answer()` | | |
| | **More elaborate normalization** (e.g. strip `$` or `%`, round numbers) | `normalize()` | | |
| | **Partial credit / numeric tolerance** | Replace the `==` line with your own logic | | |
| --- | |
| ### 5 Interpreting results | |
| * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers). | |
| * **Latency** helps you spot outliers in run time (e.g. long tool chains). |