ho22joshua commited on
Commit
7b7fd60
·
1 Parent(s): f46e99c

adding plots to readme

Browse files
Files changed (1) hide show
  1. README.md +9 -2
README.md CHANGED
@@ -53,7 +53,7 @@ We design a supervisor–coder agent to carry out each task, as illustrated in F
53
 
54
  The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
55
 
56
- <!-- ![Illustration of internal workflow for the supervisor–coder agent.](supervisor_coder.pdf) -->
57
 
58
  Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
59
 
@@ -96,7 +96,14 @@ Over 98% of tokens originated from the model’s autonomous reasoning and self-c
96
 
97
  Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
98
 
99
- <!-- Insert figure here. -->
 
 
 
 
 
 
 
100
 
101
  The results across models, summarized in Figures 1 and 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other models—each tested with approximately ten trials per step—the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
102
 
 
53
 
54
  The supervisor and coder roles are defined by their differing access to state, memory, and system instructions. In the reference configuration, both roles are implemented using the `gemini-pro-2.5` model [Google, 2025](#ref-google-2025); however, the same architecture is applied to a range of contemporary LLMs to evaluate model-dependent performance and variability.
55
 
56
+ ![Illustration of internal workflow for the supervisor–coder agent.](agent.png)
57
 
58
  Each agent interaction is executed through API calls to the LLM. Although we set the temperature to 0, other sampling parameters (e.g., top-p, top-k) remained at their default values, and some thinking-oriented models internally raise the effective temperature. As a result, the outputs exhibit minor stochastic variation even under identical inputs. Each call includes a user instruction, a system prompt, and auxiliary metadata for tracking errors and execution records.
59
 
 
96
 
97
  Following the initial benchmark with `gemini-pro-2.5`, we expanded the study to include additional models, such as `openai-gpt-5` [OpenAI, 2025](#ref-openai-2025), `claude-3.5` [Anthropic, 2024](#ref-anthropic-2024), `qwen-3` [Alibaba, 2025](#ref-alibaba-2025), and the open-weight `gpt-oss-120b` [OpenAI et al., 2025](#ref-openai-et-al-2025), evaluated under the same agentic workflow. Based on early observations, the prompts for the **data preparation** stage were refined and divided into three subtasks: **ROOT file inspection**, **ntuple conversion**, and **preprocessing** (signal and background region selection). Input file locations were also made explicit for an agent to ensure deterministic resolution of data paths and reduce reliance on implicit context.
98
 
99
+ ![Success Rate.](success_rate.png)
100
+ Figure 1: Success fraction for each
101
+ model–step pair.
102
+
103
+ ![Error Distribution.](error_distribution.png)
104
+ Figure 2: Distribution of error magnitudes by model, summarizing the variability and
105
+ characteristic failure patterns across models. Bars without numerical labels correspond to fractions
106
+ below 3%, omitted for clarity.
107
 
108
  The results across models, summarized in Figures 1 and 2, show consistent qualitative behavior with the baseline while highlighting quantitative differences in reliability, efficiency, and error patterns. For the `gemini-pro-2.5` model, the large number of repeated trials (219 total) provides a statistically robust characterization of performance across steps. For the other models—each tested with approximately ten trials per step—the smaller sample size limits statistical interpretation, and the results should therefore be regarded as qualitative indicators of behavior rather than precise performance estimates. Nonetheless, the observed cross-model consistency and similar failure patterns suggest that the workflow and evaluation metrics are sufficiently general to support larger-scale future benchmarks. This pilot-level comparison thus establishes both the feasibility and reproducibility of the agentic workflow across distinct LLM architectures.
109