Update README.md
Browse files
README.md
CHANGED
|
@@ -33,20 +33,6 @@ The code used to generate the dataset can be found [here](https://github.com/pre
|
|
| 33 |
<img src="assets/line_plot.png" alt="Line Plot" width="80%">
|
| 34 |
</div>
|
| 35 |
|
| 36 |
-
|
| 37 |
-
## Inference
|
| 38 |
-
|
| 39 |
-
- Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
|
| 40 |
-
- We use vLLM deployment with `tool_choice="auto"`.
|
| 41 |
-
|
| 42 |
-
## Metrics
|
| 43 |
-
|
| 44 |
-
Given a list of predicted and reference function calls, we report two metrics:
|
| 45 |
-
- **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
|
| 46 |
-
- **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
|
| 47 |
-
|
| 48 |
-
EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
|
| 49 |
-
|
| 50 |
## Results
|
| 51 |
|
| 52 |
### BFCL v3
|
|
@@ -483,6 +469,20 @@ EM is a strict metric, and penalizes string arguments in function calls that may
|
|
| 483 |
</table>
|
| 484 |
|
| 485 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 486 |
# Quickstart
|
| 487 |
|
| 488 |
```python
|
|
|
|
| 33 |
<img src="assets/line_plot.png" alt="Line Plot" width="80%">
|
| 34 |
</div>
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Results
|
| 37 |
|
| 38 |
### BFCL v3
|
|
|
|
| 469 |
</table>
|
| 470 |
|
| 471 |
|
| 472 |
+
## Inference
|
| 473 |
+
|
| 474 |
+
- Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
|
| 475 |
+
- We use vLLM deployment with `tool_choice="auto"`.
|
| 476 |
+
|
| 477 |
+
## Metrics
|
| 478 |
+
|
| 479 |
+
Given a list of predicted and reference function calls, we report two metrics:
|
| 480 |
+
- **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
|
| 481 |
+
- **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
|
| 482 |
+
|
| 483 |
+
EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
|
| 484 |
+
|
| 485 |
+
|
| 486 |
# Quickstart
|
| 487 |
|
| 488 |
```python
|