Spaces:
Build error
Build error
| # CI Builds Repair Benchmark Integration | |
| This module integrates the CI Builds Repair benchmark developed by [JetBrains-Research](https://github.com/JetBrains-Research/lca-baselines/tree/main/ci-builds-repair/ci-builds-repair-benchmark). | |
| For more information, refer to the [GitHub repository](https://github.com/JetBrains-Research/lca-baselines/tree/main/ci-builds-repair/ci-builds-repair-benchmark) and the associated [research paper](https://arxiv.org/abs/2406.11612). | |
| See notice below for details | |
| ## Setup | |
| Before running any scripts, make sure to configure the benchmark by setting up `config.yaml`. | |
| This benchmark pushes to JetBrains' private GitHub repository. You will to request a `token_gh` provided by their team, to run this benchmark. | |
| ## Inference | |
| To run inference with your model: | |
| ```bash | |
| ./evaluation/benchmarks/lca_ci_build_repair/scripts/run_infer.sh llm.yourmodel | |
| ``` | |
| ## Evaluation | |
| To evaluate the predictions: | |
| ```bash | |
| ./evaluation/benchmarks/lca_ci_build_repair/scripts/eval_infer.sh predictions_path_containing_output | |
| ``` | |
| ## Results | |
| The benchmark contains 68 instances, we skip instances #126 and #145, and only run 66 instances due to dockerization errors. | |
| Due to running in live GitHub machines, the benchmark is sensitive to the date it is run. Even the golden patches in the dataset might present failures due to updates. | |
| For example, on 2025-04-09, running the benchmark against the golden patches gave 57/67 successes, with 1 job left in the waiting list. | |
| On 2025-04-10, running the benchmark full with OH and no oracle, 37 succeeded. That is 54% of the complete set of 68 instances and 64% of the 57 that succeed with golden patches. | |