ojus1 commited on
Commit
b4bb154
·
verified ·
1 Parent(s): fbccbfc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -33,20 +33,6 @@ The code used to generate the dataset can be found [here](https://github.com/pre
33
  <img src="assets/line_plot.png" alt="Line Plot" width="80%">
34
  </div>
35
 
36
-
37
- ## Inference
38
-
39
- - Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
40
- - We use vLLM deployment with `tool_choice="auto"`.
41
-
42
- ## Metrics
43
-
44
- Given a list of predicted and reference function calls, we report two metrics:
45
- - **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
46
- - **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
47
-
48
- EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
49
-
50
  ## Results
51
 
52
  ### BFCL v3
@@ -483,6 +469,20 @@ EM is a strict metric, and penalizes string arguments in function calls that may
483
  </table>
484
 
485
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
486
  # Quickstart
487
 
488
  ```python
 
33
  <img src="assets/line_plot.png" alt="Line Plot" width="80%">
34
  </div>
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Results
37
 
38
  ### BFCL v3
 
469
  </table>
470
 
471
 
472
+ ## Inference
473
+
474
+ - Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
475
+ - We use vLLM deployment with `tool_choice="auto"`.
476
+
477
+ ## Metrics
478
+
479
+ Given a list of predicted and reference function calls, we report two metrics:
480
+ - **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
481
+ - **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
482
+
483
+ EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
484
+
485
+
486
  # Quickstart
487
 
488
  ```python