yevhenkost commited on
Commit
3a26c4f
·
verified ·
1 Parent(s): 49927c6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - evidence
6
+ - claim
7
+ - evidence alignment
8
+ ---
9
+ # Claim-Evidence Alignment TinyBERT tuned classification model
10
+
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+ This repo contains a tuned [huawei-noah/TinyBERT_General_4L_312D](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D) model for the classification
15
+ of sentence pairs: if the evidence fits the claim. For the training, the following dataset was used: [copenlu/fever_gold_evidence](https://huggingface.co/datasets/copenlu/fever_gold_evidence).
16
+
17
+ The model is trained on both test and train datasets.
18
+
19
+ ## Usage
20
+ ```python
21
+ model = transformers.AutoModelForSequenceClassification.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")
22
+ tokenizer = transformers.AutoTokenizer.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")
23
+
24
+ claim_evidence_pairs = [
25
+ ["The water is wet", "The sky is blue"],
26
+ ["The car crashed", "Driver could not see the road"]
27
+ ]
28
+
29
+ tokenized_inputs = tokenizer.batch_encode_plus(
30
+ predict_pairs,
31
+ return_tensors="pt",
32
+ padding=True,
33
+ truncation=True
34
+ )
35
+ preds = model(**tokenized_batch_input)
36
+
37
+ # logits: preds.logits
38
+ # 0 - Not aligned;
39
+ 1 - aligned
40
+ ```
41
+
42
+ ## Dataset Processing
43
+ The dataset was processed in the following way:
44
+ ```python
45
+ import os
46
+ from sklearn.model_selection import train_test_split
47
+
48
+ claims, evidences, labels = [], [], []
49
+
50
+ # LOADED WITH THE HUGGINGFACE HUB INTO JSONL FORMAT
51
+ datadir = "copenlu_fever_gold_evidence/"
52
+ for filename in os.listdir(datadir):
53
+ with open(os.path.join(datadir, filename), "r") as f:
54
+ for line in f.read().split("\n"):
55
+ if line:
56
+ row_dict = json.loads(line)
57
+ for evidence in row_dict["evidence"]:
58
+ evidences.append(evidence[-1])
59
+ claims.append(row_dict["claim"])
60
+ if row_dict["label"] != "NOT ENOUGH INFO":
61
+ labels.append(1)
62
+ else:
63
+ labels.append(0)
64
+
65
+ df = pd.DataFrame()
66
+ df["text_a"] = claims
67
+ df["text_b"] = evidences
68
+ df["labels"] = labels
69
+
70
+ df = df.drop_duplicates(subset=["text_a", "text_b"])
71
+
72
+ train_df, eval_df = train_test_split(df, random_state=2, test_size=0.2)
73
+ ```
74
+
75
+ ### Metrics
76
+ ```
77
+ precision recall f1-score support
78
+
79
+ 0 0.86 0.60 0.71 15958
80
+ 1 0.86 0.96 0.91 42327
81
+
82
+ accuracy 0.86 58285
83
+ macro avg 0.86 0.78 0.81 58285
84
+ weighted avg 0.86 0.86 0.85 58285
85
+ ```
86
+