Add robotics pipeline tag and paper link
#3
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,31 +1,34 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE
|
| 4 |
-
language:
|
| 5 |
-
- en
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-Coder-7B-Instruct
|
| 8 |
-
|
|
|
|
| 9 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
| 10 |
tags:
|
| 11 |
- code
|
| 12 |
- chat
|
| 13 |
- qwen
|
| 14 |
- qwen-coder
|
| 15 |
- agent
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
# Dria-Agent-α-7B
|
| 19 |
|
|
|
|
|
|
|
| 20 |
## Introduction
|
| 21 |
|
| 22 |
***Dria-Agent-α*** are series of large language models trained on top of the [Qwen2.5-Coder](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) series, specifically on top of the [Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) and [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) models to be used in agentic applications. These models are the first instalment of agent-focused LLMs (hence the **α** in the naming) we hope to improve with better and more elaborate techniques in subsequent releases.
|
| 23 |
|
| 24 |
Dria-Agent-α employs ***Pythonic function calling***, which is LLMs using blocks of Python code to interact with provided tools and output actions. This method was inspired by many previous work, including but not limited to [DynaSaur](https://arxiv.org/pdf/2411.01747), [RLEF](https://arxiv.org/pdf/2410.02089), [ADAS](https://arxiv.org/pdf/2408.08435) and [CAMEL](https://arxiv.org/pdf/2303.17760). This way of function calling has a few advantages over traditional JSON-based function calling methods:
|
| 25 |
|
| 26 |
-
1.
|
| 27 |
-
2.
|
| 28 |
-
3.
|
| 29 |
|
| 30 |
## Quickstart
|
| 31 |
|
|
@@ -197,38 +200,36 @@ This code will first determine if the specified time slot is available tomorrow.
|
|
| 197 |
|
| 198 |
We evaluate the model on the following benchmarks:
|
| 199 |
|
| 200 |
-
1.
|
| 201 |
-
2.
|
| 202 |
-
3.
|
| 203 |
|
| 204 |
Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
|
| 205 |
|
| 206 |
| Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-7B | gpt-4o-2024-11-20 (Prompt) |
|
| 207 |
-
|-------------------------------------
|
| 208 |
-
| **Non-Live Simple AST** | 75.50%
|
| 209 |
-
| **Non-Live Multiple AST** | 90.00%
|
| 210 |
-
| **Non-Live Parallel AST** | 80.00%
|
| 211 |
-
| **Non-Live Parallel Multiple AST** | 78.50%
|
| 212 |
-
| **Non-Live Simple Exec** | 82.07%
|
| 213 |
-
| **Non-Live Multiple Exec** | 86.00%
|
| 214 |
-
| **Non-Live Parallel Exec** | 82.00%
|
| 215 |
-
| **Non-Live Parallel Multiple Exec** | 80.00%
|
| 216 |
-
| **Live Simple AST** | 68.22%
|
| 217 |
-
| **Live Multiple AST** | 66.00%
|
| 218 |
-
| **Live Parallel AST** | 62.50%
|
| 219 |
-
| **Live Parallel Multiple AST** | 66.67%
|
| 220 |
-
| **Relevance Detection** | 88.89%
|
| 221 |
-
|
| 222 |
-
|
| 223 |
|
| 224 |
and the MMLU-Pro and DPAB results:
|
| 225 |
|
| 226 |
-
| Benchmark Name
|
| 227 |
-
|----------------
|
| 228 |
-
| MMLU-Pro
|
| 229 |
-
| DPAB (Pythonic, Strict) | 44.0
|
| 230 |
|
| 231 |
-
**\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a ~3% decrease.
|
| 232 |
|
| 233 |
#### Citation
|
| 234 |
|
|
@@ -238,4 +239,4 @@ and the MMLU-Pro and DPAB results:
|
|
| 238 |
title={Dria-Agent-a},
|
| 239 |
author={"andthattoo", "Atakan Tekparmak"}
|
| 240 |
}
|
| 241 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-Coder-7B-Instruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
tags:
|
| 11 |
- code
|
| 12 |
- chat
|
| 13 |
- qwen
|
| 14 |
- qwen-coder
|
| 15 |
- agent
|
| 16 |
+
- robotics
|
| 17 |
---
|
| 18 |
|
| 19 |
# Dria-Agent-α-7B
|
| 20 |
|
| 21 |
+
This repository hosts Dria-Agent-α-7B as presented in the paper [DynaSaur: Large Language Agents Beyond Predefined Actions](https://huggingface.co/papers/2411.01747).
|
| 22 |
+
|
| 23 |
## Introduction
|
| 24 |
|
| 25 |
***Dria-Agent-α*** are series of large language models trained on top of the [Qwen2.5-Coder](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) series, specifically on top of the [Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct) and [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) models to be used in agentic applications. These models are the first instalment of agent-focused LLMs (hence the **α** in the naming) we hope to improve with better and more elaborate techniques in subsequent releases.
|
| 26 |
|
| 27 |
Dria-Agent-α employs ***Pythonic function calling***, which is LLMs using blocks of Python code to interact with provided tools and output actions. This method was inspired by many previous work, including but not limited to [DynaSaur](https://arxiv.org/pdf/2411.01747), [RLEF](https://arxiv.org/pdf/2410.02089), [ADAS](https://arxiv.org/pdf/2408.08435) and [CAMEL](https://arxiv.org/pdf/2303.17760). This way of function calling has a few advantages over traditional JSON-based function calling methods:
|
| 28 |
|
| 29 |
+
1. **One-shot Parallel Multiple Function Calls:** The model can can utilise many synchronous processes in one chat turn to arrive to a solution, which would require other function calling models multiple turns of conversation.
|
| 30 |
+
2. **Free-form Reasoning and Actions:** The model provides reasoning traces freely in natural language and the actions in between \`\`\`python \`\`\` blocks, as it already tends to do without special prompting or tuning. This tries to mitigate the possible performance loss caused by imposing specific formats on LLM outputs discussed in [Let Me Speak Freely?](https://arxiv.org/pdf/2408.02442)
|
| 31 |
+
3. **On-the-fly Complex Solution Generation:** The solution provided by the model is essentially a Python program with the exclusion of some "risky" builtins like `exec`, `eval` and `compile` (see full list in **Quickstart** below). This enables the model to implement custom complex logic with conditionals and synchronous pipelines (using the output of one function in the next function's arguments) which would not be possible with the current JSON-based function calling methods (as far as we know).
|
| 32 |
|
| 33 |
## Quickstart
|
| 34 |
|
|
|
|
| 200 |
|
| 201 |
We evaluate the model on the following benchmarks:
|
| 202 |
|
| 203 |
+
1. Berkeley Function Calling Leaderboard (BFCL)
|
| 204 |
+
2. MMLU-Pro
|
| 205 |
+
3. **Dria-Pythonic-Agent-Benchmark (DPAB):** The benchmark we curated with a synthetic data generation +model-based validation + filtering and manual selection to evaluate LLMs on their Pythonic function calling ability, spanning multiple scenarios and tasks. More detailed information about the benchmark and the Github repo will be released soon.
|
| 206 |
|
| 207 |
Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
|
| 208 |
|
| 209 |
| Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-7B | gpt-4o-2024-11-20 (Prompt) |
|
| 210 |
+
| ------------------------------------- | -------------------------- | ----------------- | ----------------- | -------------------------- |
|
| 211 |
+
| **Non-Live Simple AST** | 75.50% | 75.08% | 77.58% | 79.42% |
|
| 212 |
+
| **Non-Live Multiple AST** | 90.00% | 93.00% | 94.00% | 95.50% |
|
| 213 |
+
| **Non-Live Parallel AST** | 80.00% | 85.00% | 93.50% | 94.00% |
|
| 214 |
+
| **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 89.50% | 83.50% |
|
| 215 |
+
| **Non-Live Simple Exec** | 82.07% | 87.57% | 93.29% | 100.00% |
|
| 216 |
+
| **Non-Live Multiple Exec** | 86.00% | 85.14% | 88.00% | 94.00% |
|
| 217 |
+
| **Non-Live Parallel Exec** | 82.00% | 90.00% | 88.00% | 86.00% |
|
| 218 |
+
| **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 72.50% | 77.50% |
|
| 219 |
+
| **Live Simple AST** | 68.22% | 70.16% | 81.40% | 83.72% |
|
| 220 |
+
| **Live Multiple AST** | 66.00% | 67.14% | 78.73% | 79.77% |
|
| 221 |
+
| **Live Parallel AST** | 62.50% | 50.00% | 75.00% | 87.50% |
|
| 222 |
+
| **Live Parallel Multiple AST** | 66.67% | 70.83% | 62.50% | 70.83% |
|
| 223 |
+
| **Relevance Detection** | 88.89% | 100.00% | 100.00% | 83.33% |
|
|
|
|
|
|
|
| 224 |
|
| 225 |
and the MMLU-Pro and DPAB results:
|
| 226 |
|
| 227 |
+
| Benchmark Name | Qwen2.5-Coder-7B-Instruct | Dria-Agent-α-7B |
|
| 228 |
+
| ------------------------- | -------------------------- | --------------- |
|
| 229 |
+
| MMLU-Pro | 45.6 ([Self Reported](https://arxiv.org/pdf/2409.12186)) | 42.54 |
|
| 230 |
+
| DPAB (Pythonic, Strict) | 44.0 | 70.0 |
|
| 231 |
|
| 232 |
+
**\*Note:** The model tends to use Pythonic function calling for a lot of the test cases in STEM-related fields (math, physics, chemistry, etc.) in the MMLU-Pro benchmark, which isn't captured by the evaluation framework and scripts provided in their [Github repository](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main). We haven't modified the script for evaluation, and leave it for the future iterations of this model. However, by performing qualitative analysis on the model responses, we suspect that the model's score will increase instead of suffering a \~3% decrease.
|
| 233 |
|
| 234 |
#### Citation
|
| 235 |
|
|
|
|
| 239 |
title={Dria-Agent-a},
|
| 240 |
author={"andthattoo", "Atakan Tekparmak"}
|
| 241 |
}
|
| 242 |
+
```
|