lytang
/

MiniCheck-Flan-T5-Large

@@ -6,9 +6,9 @@ pipeline_tag: text-classification
 # Model Summary
-This is a fact-checking model from the work ([GitHub Repo](https://github.com/Liyan06/MiniCheck)):
-📃 **MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents** ([link](https://arxiv.org/pdf/2404.10774.pdf))
 The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
 The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
@@ -28,12 +28,77 @@ We also have other two MiniCheck model variants:
 ### Model Performance
-The performance of these model is evaluated on our new collected benchmark, [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
 from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
-exisiting specialized fact-checkers with a similar scale by a large margin and is on par with GPT-4. See full results in our work.
 # Model Usage Demo

 # Model Summary
+This is a fact-checking model from our work:
+📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
 The model is based on Flan-T5-Large that predicts a binary label - 1 for supported and 0 for unsupported.
 The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
 ### Model Performance
+The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
 from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
+exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4. See full results in our work.
+Note: We only evaluated the performance of our models on real claims -- without any human intervention in
+any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
+LLMs' actual behaviors.
 # Model Usage Demo
+Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck).
+### Below is a simple use case
+```python
+from minicheck.minicheck import MiniCheck
+doc = "A group of students gather in the school library to study for their upcoming final exams."
+claim_1 = "The students are preparing for an examination."
+claim_2 = "The students are on vacation."
+# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
+scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
+pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
+print(pred_label) # [1, 0]
+print(raw_prob)   # [0.9805923700332642, 0.007121307775378227]
+```
+### Test on the LLM-AggreFact Benchmark
+```python
+import pandas as pd
+from datasets import load_dataset
+from minicheck.minicheck import MiniCheck
+df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
+docs = df.doc.values
+claims = df.claim.values
+scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
+pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)  # ~ 20 mins, depending on hardware
+```
+To evalaute the result on the benchmark
+```python
+from sklearn.metrics import balanced_accuracy_score
+df['preds'] = pred_label
+result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
+for dataset in df.dataset.unique():
+    sub_df = df[df.dataset == dataset]
+    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
+    result_df.loc[len(result_df)] = [dataset, bacc]
+result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
+result_df.round(1)
+```
+# Citation
+```
+@misc{tang2024minicheck,
+      title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
+      author={Liyan Tang and Philippe Laban and Greg Durrett},
+      year={2024},
+      eprint={2404.10774},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```