Commit
·
7b7fd60
1
Parent(s):
f46e99c
adding plots to readme
Browse files
README.md
CHANGED
|
@@ -53,7 +53,7 @@ We design a supervisor–coder agent to carry out each task, as illustrated in F
|
|
| 53 |
|
| 54 |
The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
|
| 59 |
|
|
@@ -96,7 +96,14 @@ Over 98% of tokens originated from the model’s autonomous reasoning and self-c
|
|
| 96 |
|
| 97 |
Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
The results across models, summarized in Figures 1 and 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other models—each tested with approximately ten trials per step—the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
|
| 102 |
|
|
|
|
| 53 |
|
| 54 |
The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
|
| 55 |
|
| 56 |
+

|
| 57 |
|
| 58 |
Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
|
| 59 |
|
|
|
|
| 96 |
|
| 97 |
Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
|
| 98 |
|
| 99 |
+

|
| 100 |
+
Figure 1: Success fraction for each
|
| 101 |
+
model–step pair.
|
| 102 |
+
|
| 103 |
+

|
| 104 |
+
Figure 2: Distribution of error magnitudes by model, summarizing the variability and
|
| 105 |
+
characteristic failure patterns across models. Bars without numerical labels correspond to fractions
|
| 106 |
+
below 3%, omitted for clarity.
|
| 107 |
|
| 108 |
The results across models, summarized in Figures 1 and 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other models—each tested with approximately ten trials per step—the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
|
| 109 |
|