Update README.md
Browse files
README.md
CHANGED
|
@@ -6,9 +6,9 @@ pipeline_tag: text-classification
|
|
| 6 |
|
| 7 |
# Model Summary
|
| 8 |
|
| 9 |
-
This is a fact-checking model from
|
| 10 |
|
| 11 |
-
📃 **MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**
|
| 12 |
|
| 13 |
The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
|
| 14 |
The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
|
|
@@ -28,12 +28,77 @@ We also have other two MiniCheck model variants:
|
|
| 28 |
|
| 29 |
|
| 30 |
### Model Performance
|
| 31 |
-
The performance of these
|
| 32 |
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
|
| 33 |
-
exisiting specialized fact-checkers with a similar scale by a large margin and is on par with GPT-4. See full results in our work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
# Model Usage Demo
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
|
|
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
|
|
|
| 6 |
|
| 7 |
# Model Summary
|
| 8 |
|
| 9 |
+
This is a fact-checking model from our work:
|
| 10 |
|
| 11 |
+
📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
|
| 12 |
|
| 13 |
The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
|
| 14 |
The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
|
|
|
|
| 28 |
|
| 29 |
|
| 30 |
### Model Performance
|
| 31 |
+
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
| 32 |
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
|
| 33 |
+
exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4. See full results in our work.
|
| 34 |
+
|
| 35 |
+
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
| 36 |
+
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
|
| 37 |
+
LLMs' actual behaviors.
|
| 38 |
+
|
| 39 |
|
| 40 |
# Model Usage Demo
|
| 41 |
|
| 42 |
+
Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck).
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
### Below is a simple use case
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from minicheck.minicheck import MiniCheck
|
| 49 |
+
|
| 50 |
+
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
| 51 |
+
claim_1 = "The students are preparing for an examination."
|
| 52 |
+
claim_2 = "The students are on vacation."
|
| 53 |
+
|
| 54 |
+
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
|
| 55 |
+
scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
|
| 56 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
| 57 |
+
|
| 58 |
+
print(pred_label) # [1, 0]
|
| 59 |
+
print(raw_prob) # [0.9805923700332642, 0.007121307775378227]
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Test on the LLM-AggreFact Benchmark
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
import pandas as pd
|
| 66 |
+
from datasets import load_dataset
|
| 67 |
+
from minicheck.minicheck import MiniCheck
|
| 68 |
+
|
| 69 |
+
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
| 70 |
+
docs = df.doc.values
|
| 71 |
+
claims = df.claim.values
|
| 72 |
+
|
| 73 |
+
scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
|
| 74 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 20 mins, depending on hardware
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
To evalaute the result on the benchmark
|
| 78 |
+
```python
|
| 79 |
+
from sklearn.metrics import balanced_accuracy_score
|
| 80 |
+
|
| 81 |
+
df['preds'] = pred_label
|
| 82 |
+
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
|
| 83 |
+
for dataset in df.dataset.unique():
|
| 84 |
+
sub_df = df[df.dataset == dataset]
|
| 85 |
+
bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
|
| 86 |
+
result_df.loc[len(result_df)] = [dataset, bacc]
|
| 87 |
+
|
| 88 |
+
result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
|
| 89 |
+
result_df.round(1)
|
| 90 |
+
```
|
| 91 |
|
| 92 |
+
# Citation
|
| 93 |
|
| 94 |
+
```
|
| 95 |
+
@misc{tang2024minicheck,
|
| 96 |
+
title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
|
| 97 |
+
author={Liyan Tang and Philippe Laban and Greg Durrett},
|
| 98 |
+
year={2024},
|
| 99 |
+
eprint={2404.10774},
|
| 100 |
+
archivePrefix={arXiv},
|
| 101 |
+
primaryClass={cs.CL}
|
| 102 |
+
}
|
| 103 |
+
```
|
| 104 |
|