lytang
/

MiniCheck-RoBERTa-Large

@@ -20,21 +20,21 @@ on 14K synthetic data generated from scratch in a structed way (more details in
 ### Model Variants
-We also have other two MiniCheck model variants:
-- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
-- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)
 ### Model Performance
 <p align="center">
-    <img src="./cost-vs-bacc.png" width="360">
 </p>
 The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
 from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
-exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large, which
-is on par with GPT-4 but 400x cheaper. See full results in our work.
 Note: We only evaluated the performance of our models on real claims -- without any human intervention in
 any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
@@ -50,12 +50,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
 ```python
 from minicheck.minicheck import MiniCheck
 doc = "A group of students gather in the school library to study for their upcoming final exams."
 claim_1 = "The students are preparing for an examination."
 claim_2 = "The students are on vacation."
-# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
-scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
 pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
 print(pred_label) # [1, 0]
 print(raw_prob)   # [0.9581979513168335, 0.031335990875959396]
@@ -67,14 +70,16 @@ print(raw_prob)   # [0.9581979513168335, 0.031335990875959396]
 import pandas as pd
 from datasets import load_dataset
 from minicheck.minicheck import MiniCheck
 # load 13K test data
 df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
 docs = df.doc.values
 claims = df.claim.values
-scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
-pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)  # ~ 15 mins, depending on hardware
 ```
 To evalaute the result on the benchmark

 ### Model Variants
+We also have other three MiniCheck model variants:
+- [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
+- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
+- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)
 ### Model Performance
 <p align="center">
+    <img src="./performance_focused.png" width="550">
 </p>
 The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
 from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
+exisiting specialized fact-checkers with a similar scale by a large margin. See full results in our work.
 Note: We only evaluated the performance of our models on real claims -- without any human intervention in
 any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
 ```python
 from minicheck.minicheck import MiniCheck
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
 doc = "A group of students gather in the school library to study for their upcoming final exams."
 claim_1 = "The students are preparing for an examination."
 claim_2 = "The students are on vacation."
+# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
+scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
 pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
 print(pred_label) # [1, 0]
 print(raw_prob)   # [0.9581979513168335, 0.031335990875959396]
 import pandas as pd
 from datasets import load_dataset
 from minicheck.minicheck import MiniCheck
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
 # load 13K test data
 df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
 docs = df.doc.values
 claims = df.claim.values
+scorer = MiniCheck(model_name='roberta-large', cache_dir='./ckpts')
+pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)  # ~ 800 docs/min, depending on hardware
 ```
 To evalaute the result on the benchmark