arturo-fredes commited on
Commit
b96e111
·
verified ·
1 Parent(s): 57c7100

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -2
README.md CHANGED
@@ -156,7 +156,7 @@ Benchmark scores were obtained with the following setups. Methodology varies by
156
  - **Inference library**: vLLM 0.13.0
157
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
158
  - **Reasoning effort**: high
159
- - **Decoding**: temperature = 0.6, max_tokens = 131072, top_p = 1.0, top_k = 0
160
  - **Batch size**: 64
161
 
162
  #### IFBench, AA-LCR, SciCode
@@ -165,7 +165,7 @@ Benchmark scores were obtained with the following setups. Methodology varies by
165
  - **Inference library**: vLLM 0.13.0
166
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
167
  - **Reasoning effort**: high
168
- - **Decoding**: temperature = 1.0, max_tokens = 131072, top_p = 1.0, top_k = 0
169
  - **Batch size**: 64
170
 
171
  #### Tau2-bench (Telecom)
@@ -188,6 +188,17 @@ Benchmark scores were obtained with the following setups. Methodology varies by
188
  - **Reproducibility**: subset from AA (https://artificialanalysis.ai/methodology/intelligence-benchmarking#terminal-bench-hard)
189
  - **Agent**: terminus-2, max episodes 100; repeats 3;
190
 
 
 
 
 
 
 
 
 
 
 
 
191
  ### Quantitative Results (Reported & Planned)
192
 
193
 
@@ -203,6 +214,7 @@ Benchmark scores were obtained with the following setups. Methodology varies by
203
  | LiveCodeBench | 62.75 | 51.53 | 68.68 |
204
  | Terminal Bench | 24.24 | 12.12 | 15.91 |
205
  | AA-LCR | 49.00 | 35.67 | 40.33 |
 
206
 
207
  ![Benchmarks](assets/benchmarks.png)
208
 
 
156
  - **Inference library**: vLLM 0.13.0
157
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
158
  - **Reasoning effort**: high
159
+ - **Decoding**: temperature = 1.0, top_p = 1.0
160
  - **Batch size**: 64
161
 
162
  #### IFBench, AA-LCR, SciCode
 
165
  - **Inference library**: vLLM 0.13.0
166
  - **Hardware**: 1× NVIDIA H200 Tensor Core GPU
167
  - **Reasoning effort**: high
168
+ - **Decoding**: temperature = 1.0,top_p = 1.0
169
  - **Batch size**: 64
170
 
171
  #### Tau2-bench (Telecom)
 
188
  - **Reproducibility**: subset from AA (https://artificialanalysis.ai/methodology/intelligence-benchmarking#terminal-bench-hard)
189
  - **Agent**: terminus-2, max episodes 100; repeats 3;
190
 
191
+ #### Aider polyglot
192
+
193
+ - **Evaluation framework**: [Aider-AI/aider](https://github.com/Aider-AI/aider)
194
+ - **Hardware**: 2× NVIDIA H200 Tensor Core GPU (host with Docker)
195
+ - **Dataset**: `polyglot-benchmark` (225 exercises across multiple languages)
196
+ - **Reasoning effort**: high (passed via `--reasoning-effort`)
197
+ - **Decoding**: temperature = 1.0, top_p = 1.0 (configurable via `generation_config` / `--read-model-settings` YAML)
198
+ - **Edit format**: `whole` (also supports `diff | udiff | diff-fenced | architect`)
199
+ - **Reproducibility**: leaderboard-aligned; `--tries=2` (repeats)
200
+
201
+
202
  ### Quantitative Results (Reported & Planned)
203
 
204
 
 
214
  | LiveCodeBench | 62.75 | 51.53 | 68.68 |
215
  | Terminal Bench | 24.24 | 12.12 | 15.91 |
216
  | AA-LCR | 49.00 | 35.67 | 40.33 |
217
+ | AIDER | 43.60 | 26.2 | 34.2 |
218
 
219
  ![Benchmarks](assets/benchmarks.png)
220