lytang commited on
Commit
d253869
·
verified ·
1 Parent(s): cc10590

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -4
README.md CHANGED
@@ -6,9 +6,9 @@ pipeline_tag: text-classification
6
 
7
  # Model Summary
8
 
9
- This is a fact-checking model from the work ([GitHub Repo](https://github.com/Liyan06/MiniCheck)):
10
 
11
- 📃 **MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents** ([link](https://arxiv.org/pdf/2404.10774.pdf))
12
 
13
  The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
14
  The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
@@ -28,12 +28,77 @@ We also have other two MiniCheck model variants:
28
 
29
 
30
  ### Model Performance
31
- The performance of these model is evaluated on our new collected benchmark, [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
32
  from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
33
- exisiting specialized fact-checkers with a similar scale by a large margin and is on par with GPT-4. See full results in our work.
 
 
 
 
 
34
 
35
  # Model Usage Demo
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
 
38
 
 
 
 
 
 
 
 
 
 
 
39
 
 
6
 
7
  # Model Summary
8
 
9
+ This is a fact-checking model from our work:
10
 
11
+ 📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
12
 
13
  The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
14
  The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
 
28
 
29
 
30
  ### Model Performance
31
+ The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
32
  from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
33
+ exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4. See full results in our work.
34
+
35
+ Note: We only evaluated the performance of our models on real claims -- without any human intervention in
36
+ any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
37
+ LLMs' actual behaviors.
38
+
39
 
40
  # Model Usage Demo
41
 
42
+ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck).
43
+
44
+
45
+ ### Below is a simple use case
46
+
47
+ ```python
48
+ from minicheck.minicheck import MiniCheck
49
+
50
+ doc = "A group of students gather in the school library to study for their upcoming final exams."
51
+ claim_1 = "The students are preparing for an examination."
52
+ claim_2 = "The students are on vacation."
53
+
54
+ # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
55
+ scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
56
+ pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
57
+
58
+ print(pred_label) # [1, 0]
59
+ print(raw_prob) # [0.9805923700332642, 0.007121307775378227]
60
+ ```
61
+
62
+ ### Test on the LLM-AggreFact Benchmark
63
+
64
+ ```python
65
+ import pandas as pd
66
+ from datasets import load_dataset
67
+ from minicheck.minicheck import MiniCheck
68
+
69
+ df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
70
+ docs = df.doc.values
71
+ claims = df.claim.values
72
+
73
+ scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
74
+ pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 20 mins, depending on hardware
75
+ ```
76
+
77
+ To evalaute the result on the benchmark
78
+ ```python
79
+ from sklearn.metrics import balanced_accuracy_score
80
+
81
+ df['preds'] = pred_label
82
+ result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
83
+ for dataset in df.dataset.unique():
84
+ sub_df = df[df.dataset == dataset]
85
+ bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
86
+ result_df.loc[len(result_df)] = [dataset, bacc]
87
+
88
+ result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
89
+ result_df.round(1)
90
+ ```
91
 
92
+ # Citation
93
 
94
+ ```
95
+ @misc{tang2024minicheck,
96
+ title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
97
+ author={Liyan Tang and Philippe Laban and Greg Durrett},
98
+ year={2024},
99
+ eprint={2404.10774},
100
+ archivePrefix={arXiv},
101
+ primaryClass={cs.CL}
102
+ }
103
+ ```
104