Add robotics pipeline tag and paper link

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +34 -33
README.md CHANGED
@@ -1,31 +1,34 @@
1
  ---
2
- license: apache-2.0
3
- license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE
4
- language:
5
- - en
6
  base_model:
7
  - Qwen/Qwen2.5-Coder-7B-Instruct
8
- pipeline_tag: text-generation
 
9
  library_name: transformers
 
 
 
10
  tags:
11
  - code
12
  - chat
13
  - qwen
14
  - qwen-coder
15
  - agent
 
16
  ---
17
 
18
  # Dria-Agent-α-7B
19
 
 
 
20
  ## Introduction
21
 
22
  ***Dria-Agent-α*** are series of large language models trained on top of the [Qwen2.5-Coder](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) series, specifically on top of the [Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) and [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) models to be used in agentic applications. These models are the first instalment of agent-focused LLMs (hence the **α** in the naming) we hope to improve with better and more elaborate techniques in subsequent releases.
23
 
24
  Dria-Agent-α employs ***Pythonic function calling***, which is LLMs using blocks of Python code to interact with provided tools and output actions. This method was inspired by many previous work, including but not limited to [DynaSaur](https://arxiv.org/pdf/2411.01747), [RLEF](https://arxiv.org/pdf/2410.02089), [ADAS](https://arxiv.org/pdf/2408.08435) and [CAMEL](https://arxiv.org/pdf/2303.17760). This way of function calling has a few advantages over traditional JSON-based function calling methods:
25
 
26
- 1. **One-shot Parallel Multiple Function Calls:** The model can can utilise many synchronous processes in one chat turn to arrive to a solution, which would require other function calling models multiple turns of conversation.
27
- 2. **Free-form Reasoning and Actions:** The model provides reasoning traces freely in natural language and the actions in between \`\`\`python \`\`\` blocks, as it already tends to do without special prompting or tuning. This tries to mitigate the possible performance loss caused by imposing specific formats on LLM outputs discussed in [Let Me Speak Freely?](https://arxiv.org/pdf/2408.02442)
28
- 3. **On-the-fly Complex Solution Generation:** The solution provided by the model is essentially a Python program with the exclusion of some "risky" builtins like `exec`, `eval` and `compile` (see full list in **Quickstart** below). This enables the model to implement custom complex logic with conditionals and synchronous pipelines (using the output of one function in the next function's arguments) which would not be possible with the current JSON-based function calling methods (as far as we know).
29
 
30
  ## Quickstart
31
 
@@ -197,38 +200,36 @@ This code will first determine if the specified time slot is available tomorrow.
197
 
198
  We evaluate the model on the following benchmarks:
199
 
200
- 1. Berkeley Function Calling Leaderboard (BFCL)
201
- 2. MMLU-Pro
202
- 3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
203
 
204
  Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
205
 
206
  | Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-7B | gpt-4o-2024-11-20 (Prompt) |
207
- |---------------------------------------|----------------------------|-------------------|-------------------|---------------------------|
208
- | **Non-Live Simple AST** | 75.50% | 75.08% | 77.58% | 79.42% |
209
- | **Non-Live Multiple AST** | 90.00% | 93.00% | 94.00% | 95.50% |
210
- | **Non-Live Parallel AST** | 80.00% | 85.00% | 93.50% | 94.00% |
211
- | **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 89.50% | 83.50% |
212
- | **Non-Live Simple Exec** | 82.07% | 87.57% | 93.29% | 100.00% |
213
- | **Non-Live Multiple Exec** | 86.00% | 85.14% | 88.00% | 94.00% |
214
- | **Non-Live Parallel Exec** | 82.00% | 90.00% | 88.00% | 86.00% |
215
- | **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 72.50% | 77.50% |
216
- | **Live Simple AST** | 68.22% | 70.16% | 81.40% | 83.72% |
217
- | **Live Multiple AST** | 66.00% | 67.14% | 78.73% | 79.77% |
218
- | **Live Parallel AST** | 62.50% | 50.00% | 75.00% | 87.50% |
219
- | **Live Parallel Multiple AST** | 66.67% | 70.83% | 62.50% | 70.83% |
220
- | **Relevance Detection** | 88.89% | 100.00% | 100.00% | 83.33% |
221
-
222
-
223
 
224
  and the MMLU-Pro and DPAB results:
225
 
226
- | Benchmark Name | Qwen2.5-Coder-7B-Instruct | Dria-Agent-α-7B |
227
- |----------------|---------------------------|-----------------|
228
- | MMLU-Pro | 45.6 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 42.54 |
229
- | DPAB (Pythonic, Strict) | 44.0 | 70.0 |
230
 
231
- **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~3% decrease.
232
 
233
  #### Citation
234
 
@@ -238,4 +239,4 @@ and the MMLU-Pro and DPAB results:
238
  title={Dria-Agent-a},
239
  author={"andthattoo", "Atakan Tekparmak"}
240
  }
241
- ```
 
1
  ---
 
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-Coder-7B-Instruct
4
+ language:
5
+ - en
6
  library_name: transformers
7
+ license: apache-2.0
8
+ license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE
9
+ pipeline_tag: text-generation
10
  tags:
11
  - code
12
  - chat
13
  - qwen
14
  - qwen-coder
15
  - agent
16
+ - robotics
17
  ---
18
 
19
  # Dria-Agent-α-7B
20
 
21
+ This repository hosts Dria-Agent-α-7B as presented in the paper [DynaSaur: Large Language Agents Beyond Predefined Actions](https://huggingface.co/papers/2411.01747).
22
+
23
  ## Introduction
24
 
25
  ***Dria-Agent-α*** are series of large language models trained on top of the [Qwen2.5-Coder](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) series, specifically on top of the [Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) and [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) models to be used in agentic applications. These models are the first instalment of agent-focused LLMs (hence the **α** in the naming) we hope to improve with better and more elaborate techniques in subsequent releases.
26
 
27
  Dria-Agent-α employs ***Pythonic function calling***, which is LLMs using blocks of Python code to interact with provided tools and output actions. This method was inspired by many previous work, including but not limited to [DynaSaur](https://arxiv.org/pdf/2411.01747), [RLEF](https://arxiv.org/pdf/2410.02089), [ADAS](https://arxiv.org/pdf/2408.08435) and [CAMEL](https://arxiv.org/pdf/2303.17760). This way of function calling has a few advantages over traditional JSON-based function calling methods:
28
 
29
+ 1. **One-shot Parallel Multiple Function Calls:** The model can can utilise many synchronous processes in one chat turn to arrive to a solution, which would require other function calling models multiple turns of conversation.
30
+ 2. **Free-form Reasoning and Actions:** The model provides reasoning traces freely in natural language and the actions in between \`\`\`python \`\`\` blocks, as it already tends to do without special prompting or tuning. This tries to mitigate the possible performance loss caused by imposing specific formats on LLM outputs discussed in [Let Me Speak Freely?](https://arxiv.org/pdf/2408.02442)
31
+ 3. **On-the-fly Complex Solution Generation:** The solution provided by the model is essentially a Python program with the exclusion of some "risky" builtins like `exec`, `eval` and `compile` (see full list in **Quickstart** below). This enables the model to implement custom complex logic with conditionals and synchronous pipelines (using the output of one function in the next function's arguments) which would not be possible with the current JSON-based function calling methods (as far as we know).
32
 
33
  ## Quickstart
34
 
 
200
 
201
  We evaluate the model on the following benchmarks:
202
 
203
+ 1. Berkeley Function Calling Leaderboard (BFCL)
204
+ 2. MMLU-Pro
205
+ 3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
206
 
207
  Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
208
 
209
  | Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-7B | gpt-4o-2024-11-20 (Prompt) |
210
+ | ------------------------------------- | -------------------------- | ----------------- | ----------------- | -------------------------- |
211
+ | **Non-Live Simple AST** | 75.50% | 75.08% | 77.58% | 79.42% |
212
+ | **Non-Live Multiple AST** | 90.00% | 93.00% | 94.00% | 95.50% |
213
+ | **Non-Live Parallel AST** | 80.00% | 85.00% | 93.50% | 94.00% |
214
+ | **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 89.50% | 83.50% |
215
+ | **Non-Live Simple Exec** | 82.07% | 87.57% | 93.29% | 100.00% |
216
+ | **Non-Live Multiple Exec** | 86.00% | 85.14% | 88.00% | 94.00% |
217
+ | **Non-Live Parallel Exec** | 82.00% | 90.00% | 88.00% | 86.00% |
218
+ | **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 72.50% | 77.50% |
219
+ | **Live Simple AST** | 68.22% | 70.16% | 81.40% | 83.72% |
220
+ | **Live Multiple AST** | 66.00% | 67.14% | 78.73% | 79.77% |
221
+ | **Live Parallel AST** | 62.50% | 50.00% | 75.00% | 87.50% |
222
+ | **Live Parallel Multiple AST** | 66.67% | 70.83% | 62.50% | 70.83% |
223
+ | **Relevance Detection** | 88.89% | 100.00% | 100.00% | 83.33% |
 
 
224
 
225
  and the MMLU-Pro and DPAB results:
226
 
227
+ | Benchmark Name | Qwen2.5-Coder-7B-Instruct | Dria-Agent-α-7B |
228
+ | ------------------------- | -------------------------- | --------------- |
229
+ | MMLU-Pro | 45.6 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 42.54 |
230
+ | DPAB (Pythonic, Strict) | 44.0 | 70.0 |
231
 
232
+ **\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a \~3% decrease.
233
 
234
  #### Citation
235
 
 
239
  title={Dria-Agent-a},
240
  author={"andthattoo", "Atakan Tekparmak"}
241
  }
242
+ ```