Update README.md
Browse files
README.md
CHANGED
|
@@ -16,27 +16,29 @@ The model is doing predictions on the *sentence-level*. It takes as input a docu
|
|
| 16 |
whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
|
| 17 |
|
| 18 |
|
| 19 |
-
MiniCheck-Flan-T5-Large is fine tuned from `google/flan-t5-large` ([Chung et al., 2022](https://arxiv.org/pdf/2210.11416.pdf))
|
| 20 |
on the combination of 35K data:
|
| 21 |
- 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
|
| 22 |
- 14K synthetic data generated from scratch in a structed way (more details in the paper).
|
| 23 |
|
| 24 |
|
| 25 |
### Model Variants
|
| 26 |
-
We also have other
|
| 27 |
-
- [
|
| 28 |
-
- [lytang/MiniCheck-
|
|
|
|
| 29 |
|
| 30 |
|
| 31 |
### Model Performance
|
| 32 |
|
| 33 |
<p align="center">
|
| 34 |
-
<img src="./
|
| 35 |
</p>
|
| 36 |
|
| 37 |
|
|
|
|
| 38 |
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
| 39 |
-
from
|
| 40 |
exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4, but 400x cheaper. See full results in our work.
|
| 41 |
|
| 42 |
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
|
@@ -53,13 +55,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
|
|
| 53 |
|
| 54 |
```python
|
| 55 |
from minicheck.minicheck import MiniCheck
|
|
|
|
|
|
|
| 56 |
|
| 57 |
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
| 58 |
claim_1 = "The students are preparing for an examination."
|
| 59 |
claim_2 = "The students are on vacation."
|
| 60 |
|
| 61 |
-
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
|
| 62 |
-
scorer = MiniCheck(model_name='flan-t5-large',
|
| 63 |
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
| 64 |
|
| 65 |
print(pred_label) # [1, 0]
|
|
@@ -72,14 +76,16 @@ print(raw_prob) # [0.9805923700332642, 0.007121307775378227]
|
|
| 72 |
import pandas as pd
|
| 73 |
from datasets import load_dataset
|
| 74 |
from minicheck.minicheck import MiniCheck
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
# load
|
| 77 |
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
| 78 |
docs = df.doc.values
|
| 79 |
claims = df.claim.values
|
| 80 |
|
| 81 |
-
scorer = MiniCheck(model_name='flan-t5-large',
|
| 82 |
-
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~
|
| 83 |
```
|
| 84 |
|
| 85 |
To evalaute the result on the benchmark
|
|
|
|
| 16 |
whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
|
| 17 |
|
| 18 |
|
| 19 |
+
**MiniCheck-Flan-T5-Large is the best fack-checking model with size < 1B** and reaches GPT-4 performance. It is fine tuned from `google/flan-t5-large` ([Chung et al., 2022](https://arxiv.org/pdf/2210.11416.pdf))
|
| 20 |
on the combination of 35K data:
|
| 21 |
- 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
|
| 22 |
- 14K synthetic data generated from scratch in a structed way (more details in the paper).
|
| 23 |
|
| 24 |
|
| 25 |
### Model Variants
|
| 26 |
+
We also have other three MiniCheck model variants:
|
| 27 |
+
- [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
|
| 28 |
+
- [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
|
| 29 |
+
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)
|
| 30 |
|
| 31 |
|
| 32 |
### Model Performance
|
| 33 |
|
| 34 |
<p align="center">
|
| 35 |
+
<img src="./performance.png" width="550">
|
| 36 |
</p>
|
| 37 |
|
| 38 |
|
| 39 |
+
|
| 40 |
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
|
| 41 |
+
from 11 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-Flan-T5-Large outperform all
|
| 42 |
exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4, but 400x cheaper. See full results in our work.
|
| 43 |
|
| 44 |
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
|
|
|
|
| 55 |
|
| 56 |
```python
|
| 57 |
from minicheck.minicheck import MiniCheck
|
| 58 |
+
import os
|
| 59 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
| 60 |
|
| 61 |
doc = "A group of students gather in the school library to study for their upcoming final exams."
|
| 62 |
claim_1 = "The students are preparing for an examination."
|
| 63 |
claim_2 = "The students are on vacation."
|
| 64 |
|
| 65 |
+
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
|
| 66 |
+
scorer = MiniCheck(model_name='flan-t5-large', cache_dir='./ckpts')
|
| 67 |
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
|
| 68 |
|
| 69 |
print(pred_label) # [1, 0]
|
|
|
|
| 76 |
import pandas as pd
|
| 77 |
from datasets import load_dataset
|
| 78 |
from minicheck.minicheck import MiniCheck
|
| 79 |
+
import os
|
| 80 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
|
| 81 |
|
| 82 |
+
# load 29K test data
|
| 83 |
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
|
| 84 |
docs = df.doc.values
|
| 85 |
claims = df.claim.values
|
| 86 |
|
| 87 |
+
scorer = MiniCheck(model_name='flan-t5-large', cache_dir='./ckpts')
|
| 88 |
+
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 500 docs/min, depending on hardware
|
| 89 |
```
|
| 90 |
|
| 91 |
To evalaute the result on the benchmark
|