How to run auto evaluation
Quick start
We prepared some example queries to run auto evaluation. You can run them by following the steps below.
- complete the
evaluator_config.json(referring to the schema inevaluator_config_template.json) under theauto_evalfolder and thetaskweaver_config.jsonunder thetaskweaverfolder. - cd to the
auto_evalfolder. - run the below command to start the auto evaluation for single case.
python taskweaver_eval.py -m single -f cases/init_say_hello.yaml
- run the below command to start the auto evaluation for multiple cases.
python taskweaver_eval.py -m batch -f ./cases
Parameters
- -m/--mode: specifies the evaluation mode, which can be either single or batch.
- -f/--file: specifies the path to the test case file or directory containing test case files.
- -r/--result: specifies the path to the result file for batch evaluation mode. This parameter is only valid in batch mode. The default value is
sample_case_results.csv. - -t/--threshold: specifies the interrupt threshold for multi-round chat evaluation. When the evaluation score of a certain round falls below this threshold, the evaluation will be interrupted. The default value is
None, which means that no interrupt threshold is used. - -flush/--flush: specifies whether to flush the result file. This parameter is only valid in batch mode. The default value is
False, which means that the evaluated cases will not be loaded again. If you want to re-evaluate the cases, you can set this parameter toTrue.
How to create a test case
A test case is a yaml file that contains the following fields:
- config_var(optional): set the config values for Taskweaver if needed.
- app_dir: the path to the project directory for Taskweaver.
- eval_query (a list, supports multiple queries)
- user_query: the user query to be evaluated.
- scoring_points:
- score_point: describes the criteria of the agent's response
- weight: the value that determines how important that criterion is
- eval_code(optional): evaluation code that will be run to determine if the criterion is met. In this case, this scoring point will not be evaluated using LLM.
- ...
- scoring_points:
- ...
- user_query: the user query to be evaluated.
- post_index: the index of the
post_listin responseroundthat should be evaluated. If it is set tonull, then the entireroundwill be evaluated.
Note: for the eval_code field, you can use the variable agent_response in your evaluation code snippet.
It can be a Round or Post JSON object determined by the post_index field.