Mercury7353 commited on
Commit
471cccd
Β·
verified Β·
1 Parent(s): 64d01aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -86
README.md CHANGED
@@ -1,86 +1,86 @@
1
- <h1 align="center"> PyBench: Evaluate LLM Agent on Real World Tasks </h1>
2
-
3
- <p align="center">
4
- <a href="comming soon">πŸ“ƒ Paper</a>
5
- β€’
6
- <a href="https://huggingface.co/datasets/Mercury7353/PyInstruct" >πŸ€— Data (PyInstruct)</a>
7
- β€’
8
- <a href="https://huggingface.co/Mercury7353/PyLlama3" >πŸ€— Model (PyLlama3)</a>
9
- β€’
10
- </p>
11
-
12
-
13
- PyBench is a comprehensive benchmark evaluating LLM on real-world coding tasks including **chart analysis**, **text analysis**, **image/ audio editing**, **complex math** and **software/website development**.
14
- We collect files from Kaggle, arXiv, and other sources and automatically generate queries according to the type and content of each file.
15
-
16
- ![Overview](images/main.png)
17
-
18
-
19
-
20
-
21
- ## Why PyBench?
22
-
23
- The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image processing.
24
- %
25
- However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks.
26
- %
27
- To address this gap, we introduce **PyBench**, a benchmark that encompasses 6 main categories of real-world tasks, covering more than 10 types of files.
28
- ![How PyBench Works](images/generateTraj.png)
29
-
30
- ## πŸ“ PyInstruct
31
-
32
- To figure out a way to enhance the model's ability on PyBench, we generate a homologous dataset: **PyInstruct**. The PyInstruct contains multi-turn interaction between the model and files, stimulating the model's capability on coding, debugging and multi-turn complex task solving. Compare to other Datasets focus on multi-turn coding ability, PyInstruct has longer turns and tokens per trajectory.
33
-
34
- ![Data Statistics](images/data.png)
35
- *Dataset Statistics. Token statistics are computed using Llama-2 tokenizer.*
36
-
37
- ## πŸͺ„ PyLlama
38
-
39
- We trained Llama3-8B-base on PyInstruct, CodeActInstruct, CodeFeedback, and Jupyter Notebook Corpus to get PyLlama3, achieving an outstanding performance on PyBench
40
-
41
-
42
- ## πŸš€ Model Evaluation with PyBench!
43
- <video src="https://github.com/Mercury7353/PyBench/assets/103104011/fef85310-55a3-4ee8-a441-612e7dbbaaab"> </video>
44
- *Demonstration of the chat interface.*
45
- ### Environment Setup:
46
- Begin by establishing the required environment:
47
-
48
- ```bash
49
- conda env create -f environment.yml
50
- ```
51
-
52
- ### Model Configuration
53
- Initialize a local server using the vllm framework, which defaults to port "8001":
54
-
55
- ```bash
56
- bash SetUpModel.sh
57
- ```
58
-
59
-
60
- A Jinja template is necessary to launch a vllm server. Commonly used templates can be located in the `./jinja/` directory.
61
- Prior to starting the vllm server, specify the model path and Jinja template path in `SetUpModel.sh`.
62
- ### Configuration Adjustments
63
- Specify your model's path and the server port in `./config/model.yaml`. This configuration file also allows for customization of the system prompts.
64
- ### Execution on PyBench
65
- Ensure to update the output trajectory file path in the script before execution:
66
-
67
- ```bash
68
- python /data/zyl7353/codeinterpreterbenchmark/inference.py --config_path ./config/<your config>.yaml --task_path ./data/meta/task.json --output_path <your trajectory.jsonl path>
69
- ```
70
-
71
-
72
- ### Unit Testing Procedure
73
- - **Step 1:** Store the output files in `./output`.
74
- - **Step 2:** Define the trajectory file path in
75
- `./data/unit_test/enter_point.py`.
76
- - **Step 3:** Execute the unit test script:
77
- ```bash
78
- python data/unit_test/enter_point.py
79
- ```
80
-
81
- ## πŸ“Š LeaderBoard
82
- ![LLM Leaderboard](images/leaderboard.png)
83
- # πŸ“š Citation
84
- ```bibtex
85
- TBD
86
- ```
 
1
+ <h1 align="center"> PyBench: Evaluate LLM Agent on Real World Tasks </h1>
2
+
3
+ <p align="center">
4
+ <a href="https://arxiv.org/abs/2407.16732">πŸ“ƒ Paper</a>
5
+ β€’
6
+ <a href="https://huggingface.co/datasets/Mercury7353/PyInstruct" >πŸ€— Data (PyInstruct)</a>
7
+ β€’
8
+ <a href="https://huggingface.co/Mercury7353/PyLlama3" >πŸ€— Model (PyLlama3)</a>
9
+ β€’
10
+ </p>
11
+
12
+
13
+ PyBench is a comprehensive benchmark evaluating LLM on real-world coding tasks including **chart analysis**, **text analysis**, **image/ audio editing**, **complex math** and **software/website development**.
14
+ We collect files from Kaggle, arXiv, and other sources and automatically generate queries according to the type and content of each file.
15
+
16
+ ![Overview](images/main.png)
17
+
18
+
19
+
20
+
21
+ ## Why PyBench?
22
+
23
+ The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image processing.
24
+ %
25
+ However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks.
26
+ %
27
+ To address this gap, we introduce **PyBench**, a benchmark that encompasses 6 main categories of real-world tasks, covering more than 10 types of files.
28
+ ![How PyBench Works](images/generateTraj.png)
29
+
30
+ ## πŸ“ PyInstruct
31
+
32
+ To figure out a way to enhance the model's ability on PyBench, we generate a homologous dataset: **PyInstruct**. The PyInstruct contains multi-turn interaction between the model and files, stimulating the model's capability on coding, debugging and multi-turn complex task solving. Compare to other Datasets focus on multi-turn coding ability, PyInstruct has longer turns and tokens per trajectory.
33
+
34
+ ![Data Statistics](images/data.png)
35
+ *Dataset Statistics. Token statistics are computed using Llama-2 tokenizer.*
36
+
37
+ ## πŸͺ„ PyLlama
38
+
39
+ We trained Llama3-8B-base on PyInstruct, CodeActInstruct, CodeFeedback, and Jupyter Notebook Corpus to get PyLlama3, achieving an outstanding performance on PyBench
40
+
41
+
42
+ ## πŸš€ Model Evaluation with PyBench!
43
+ <video src="https://github.com/Mercury7353/PyBench/assets/103104011/fef85310-55a3-4ee8-a441-612e7dbbaaab"> </video>
44
+ *Demonstration of the chat interface.*
45
+ ### Environment Setup:
46
+ Begin by establishing the required environment:
47
+
48
+ ```bash
49
+ conda env create -f environment.yml
50
+ ```
51
+
52
+ ### Model Configuration
53
+ Initialize a local server using the vllm framework, which defaults to port "8001":
54
+
55
+ ```bash
56
+ bash SetUpModel.sh
57
+ ```
58
+
59
+
60
+ A Jinja template is necessary to launch a vllm server. Commonly used templates can be located in the `./jinja/` directory.
61
+ Prior to starting the vllm server, specify the model path and Jinja template path in `SetUpModel.sh`.
62
+ ### Configuration Adjustments
63
+ Specify your model's path and the server port in `./config/model.yaml`. This configuration file also allows for customization of the system prompts.
64
+ ### Execution on PyBench
65
+ Ensure to update the output trajectory file path in the script before execution:
66
+
67
+ ```bash
68
+ python /data/zyl7353/codeinterpreterbenchmark/inference.py --config_path ./config/<your config>.yaml --task_path ./data/meta/task.json --output_path <your trajectory.jsonl path>
69
+ ```
70
+
71
+
72
+ ### Unit Testing Procedure
73
+ - **Step 1:** Store the output files in `./output`.
74
+ - **Step 2:** Define the trajectory file path in
75
+ `./data/unit_test/enter_point.py`.
76
+ - **Step 3:** Execute the unit test script:
77
+ ```bash
78
+ python data/unit_test/enter_point.py
79
+ ```
80
+
81
+ ## πŸ“Š LeaderBoard
82
+ ![LLM Leaderboard](images/leaderboard.png)
83
+ # πŸ“š Citation
84
+ ```bibtex
85
+ TBD
86
+ ```