| # Evaluations API | |
| This document outlines the API endpoints for managing evaluations in PySpur. | |
| ## List Available Evaluations | |
| **Description**: Lists all available evaluations by scanning the tasks directory for YAML files. Returns metadata about each evaluation including name, description, type, and number of samples. | |
| **URL**: `/evals/` | |
| **Method**: GET | |
| **Response Schema**: | |
| ```python | |
| List[Dict[str, Any]] | |
| ``` | |
| Each dictionary in the list contains: | |
| ```python | |
| { | |
| "name": str, # Name of the evaluation | |
| "description": str, # Description of the evaluation | |
| "type": str, # Type of evaluation | |
| "num_samples": str, # Number of samples in the evaluation | |
| "paper_link": str, # Link to the paper describing the evaluation | |
| "file_name": str # Name of the YAML file | |
| } | |
| ``` | |
| ## Launch Evaluation | |
| **Description**: Launches an evaluation job by triggering the evaluator with the specified evaluation configuration. The evaluation is run asynchronously in the background. | |
| **URL**: `/evals/launch/` | |
| **Method**: POST | |
| **Request Payload**: | |
| ```python | |
| class EvalRunRequest: | |
| eval_name: str # Name of the evaluation to run | |
| workflow_id: str # ID of the workflow to evaluate | |
| output_variable: str # Output variable to evaluate | |
| num_samples: int = 100 # Number of random samples to evaluate | |
| ``` | |
| **Response Schema**: | |
| ```python | |
| class EvalRunResponse: | |
| run_id: str # ID of the evaluation run | |
| eval_name: str # Name of the evaluation | |
| workflow_id: str # ID of the workflow being evaluated | |
| status: EvalRunStatusEnum # Status of the evaluation run | |
| start_time: datetime # When the evaluation started | |
| end_time: Optional[datetime] # When the evaluation ended (if completed) | |
| results: Optional[Dict[str, Any]] # Results of the evaluation (if completed) | |
| ``` | |
| ## Get Evaluation Run Status | |
| **Description**: Gets the status of a specific evaluation run, including results if the evaluation has completed. | |
| **URL**: `/evals/runs/{eval_run_id}` | |
| **Method**: GET | |
| **Parameters**: | |
| ```python | |
| eval_run_id: str # ID of the evaluation run | |
| ``` | |
| **Response Schema**: | |
| ```python | |
| class EvalRunResponse: | |
| run_id: str # ID of the evaluation run | |
| eval_name: str # Name of the evaluation | |
| workflow_id: str # ID of the workflow being evaluated | |
| status: EvalRunStatusEnum # Status of the evaluation run | |
| start_time: datetime # When the evaluation started | |
| end_time: Optional[datetime] # When the evaluation ended (if completed) | |
| results: Optional[Dict[str, Any]] # Results of the evaluation (if completed) | |
| ``` | |
| ## List Evaluation Runs | |
| **Description**: Lists all evaluation runs, ordered by start time descending. | |
| **URL**: `/evals/runs/` | |
| **Method**: GET | |
| **Response Schema**: | |
| ```python | |
| List[EvalRunResponse] | |
| ``` | |
| Where `EvalRunResponse` contains: | |
| ```python | |
| class EvalRunResponse: | |
| run_id: str # ID of the evaluation run | |
| eval_name: str # Name of the evaluation | |
| workflow_id: str # ID of the workflow being evaluated | |
| status: EvalRunStatusEnum # Status of the evaluation run | |
| start_time: datetime # When the evaluation started | |
| end_time: Optional[datetime] # When the evaluation ended (if completed) | |
| results: Optional[Dict[str, Any]] # Results of the evaluation (if completed) | |
| ``` |