Spaces:
Runtime error
Runtime error
update about
Browse files- src/display/about.py +7 -7
src/display/about.py
CHANGED
|
@@ -17,7 +17,7 @@ class Tasks(Enum):
|
|
| 17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
| 18 |
task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
|
| 19 |
task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
|
| 20 |
-
|
| 21 |
task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
|
| 22 |
task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
|
| 23 |
|
|
@@ -35,19 +35,19 @@ LLM_BENCHMARKS_TEXT = f"""
|
|
| 35 |
## How it works
|
| 36 |
|
| 37 |
## Tasks
|
| 38 |
-
1.
|
| 39 |
found at https://huggingface.co/datasets/PatronusAI/financebench.
|
| 40 |
|
| 41 |
-
2.
|
| 42 |
Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
|
| 43 |
|
| 44 |
-
3.
|
| 45 |
|
| 46 |
-
4.
|
| 47 |
|
| 48 |
-
5.
|
| 49 |
|
| 50 |
-
6.
|
| 51 |
|
| 52 |
## What is Patronus AI?
|
| 53 |
|
|
|
|
| 17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
| 18 |
task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
|
| 19 |
task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
|
| 20 |
+
task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
|
| 21 |
task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
|
| 22 |
task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
|
| 23 |
|
|
|
|
| 35 |
## How it works
|
| 36 |
|
| 37 |
## Tasks
|
| 38 |
+
1.FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
|
| 39 |
found at https://huggingface.co/datasets/PatronusAI/financebench.
|
| 40 |
|
| 41 |
+
2.Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
|
| 42 |
Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
|
| 43 |
|
| 44 |
+
3.Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM. The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
|
| 45 |
|
| 46 |
+
4.Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
|
| 47 |
|
| 48 |
+
5.Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
|
| 49 |
|
| 50 |
+
6.Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
|
| 51 |
|
| 52 |
## What is Patronus AI?
|
| 53 |
|