Spaces:
Running
Running
Upload index.html
Browse files- index.html +9 -7
index.html
CHANGED
|
@@ -86,8 +86,11 @@
|
|
| 86 |
<!-- @PAN TODO: change links -->
|
| 87 |
<a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
|
| 88 |
class="external-link button is-normal is-rounded is-dark" target="_blank">
|
| 89 |
-
<span class="icon">
|
| 90 |
<i class="fas fa-file-pdf"></i>
|
|
|
|
|
|
|
|
|
|
| 91 |
</span>
|
| 92 |
<span>Paper</span>
|
| 93 |
</a>
|
|
@@ -189,7 +192,7 @@
|
|
| 189 |
<!-- Abstract. -->
|
| 190 |
<div class="columns is-centered has-text-centered">
|
| 191 |
<div class="column is-four-fifths">
|
| 192 |
-
<h2 class="title is-3">🔔News</h2>
|
| 193 |
<div class="content has-text-justified">
|
| 194 |
|
| 195 |
|
|
@@ -219,7 +222,7 @@
|
|
| 219 |
<h2 class="title is-3">Introduction</h2>
|
| 220 |
<div class="content has-text-justified">
|
| 221 |
<p>
|
| 222 |
-
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available
|
| 223 |
</p>
|
| 224 |
</div>
|
| 225 |
</div>
|
|
@@ -313,16 +316,15 @@ The strong performance of our CoffeeEval validates its effectiveness in assessin
|
|
| 313 |
|
| 314 |
<div class="content has-text-centered">
|
| 315 |
<img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
|
| 316 |
-
<p> We
|
| 317 |
-
ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.</p>
|
| 318 |
</div>
|
| 319 |
|
| 320 |
<div class="content has-text-justified">
|
| 321 |
-
<p>
|
| 322 |
Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
|
| 323 |
In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
|
| 324 |
Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
|
| 325 |
-
</p>
|
| 326 |
</div>
|
| 327 |
|
| 328 |
</div>
|
|
|
|
| 86 |
<!-- @PAN TODO: change links -->
|
| 87 |
<a href="https://huggingface.co/spaces/Anonymous-COFFEE/Project-COFFEE/blob/main/static/ACL24__Code_Edit.pdf"
|
| 88 |
class="external-link button is-normal is-rounded is-dark" target="_blank">
|
| 89 |
+
<!-- <span class="icon">
|
| 90 |
<i class="fas fa-file-pdf"></i>
|
| 91 |
+
</span> -->
|
| 92 |
+
<span class="icon">
|
| 93 |
+
<p style="font-size:18px">📝</p>
|
| 94 |
</span>
|
| 95 |
<span>Paper</span>
|
| 96 |
</a>
|
|
|
|
| 192 |
<!-- Abstract. -->
|
| 193 |
<div class="columns is-centered has-text-centered">
|
| 194 |
<div class="column is-four-fifths">
|
| 195 |
+
<h2 class="title is-3">🔔 News</h2>
|
| 196 |
<div class="content has-text-justified">
|
| 197 |
|
| 198 |
|
|
|
|
| 222 |
<h2 class="title is-3">Introduction</h2>
|
| 223 |
<div class="content has-text-justified">
|
| 224 |
<p>
|
| 225 |
+
This paper presents COFFEE-GYM, a comprehensive RL environment for training models that provide feedback on code editing. COFFEE-GYM includes two major components: (1) COFFEE, a dataset containing humans' codeedit traces for coding questions and machine-written feedback for editing erroneous code; (2) COFFEEEVAL, a reward function that faithfully reflects the helpfulness of feedback by assess-ing the performance of the revised code in unittests. With them, COFFEE-GYM addresses theunavailability of high-quality datasets for train-ing feedback models with RL, and providesmore accurate rewards than the SOTA rewardmodel (i.e., GPT-4). By applying COFFEE-GYM, we elicit feedback models that outper-form baselines in enhancing open-source code LLMs' code editing, making them comparablewith closed-source LLMs. We make the datasetand the model checkpoint publicly available.
|
| 226 |
</p>
|
| 227 |
</div>
|
| 228 |
</div>
|
|
|
|
| 316 |
|
| 317 |
<div class="content has-text-centered">
|
| 318 |
<img src="static/images/main_results.png" alt="algebraic reasoning" width="100%"/>
|
| 319 |
+
<p> Code editing results of our feedback model trained with Coffee-Gym, i.e., PPO-COFFEEVAL, on HumanEvalFix and COFFEE-Test. We pair our feedback model with an open-source code LLM as the code editor.</p>
|
|
|
|
| 320 |
</div>
|
| 321 |
|
| 322 |
<div class="content has-text-justified">
|
| 323 |
+
<!-- <p>
|
| 324 |
Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback.
|
| 325 |
In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR).
|
| 326 |
Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.
|
| 327 |
+
</p> -->
|
| 328 |
</div>
|
| 329 |
|
| 330 |
</div>
|