yevhenkost
/

claim_evidence_alignment_fever_gold_tuned_tinybert

Text Classification

evidence alignment

text-embeddings-inference

Model card Files Files and versions

yevhenkost commited on May 2, 2024

Commit

3a26c4f

·

verified ·

1 Parent(s): 49927c6

Create README.md

Files changed (1) hide show

README.md +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+---
+language:
+- en
+tags:
+- evidence
+- claim
+- evidence alignment
+---
+# Claim-Evidence Alignment TinyBERT tuned classification model
+<!-- Provide a quick summary of what the model is/does. -->
+This repo contains a tuned [huawei-noah/TinyBERT_General_4L_312D](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D) model for the classification
+of sentence pairs: if the evidence fits the claim. For the training, the following dataset was used: [copenlu/fever_gold_evidence](https://huggingface.co/datasets/copenlu/fever_gold_evidence).
+The model is trained on both test and train datasets.
+## Usage
+```python
+model = transformers.AutoModelForSequenceClassification.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")
+tokenizer = transformers.AutoTokenizer.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")
+claim_evidence_pairs = [
+  ["The water is wet", "The sky is blue"],
+["The car crashed", "Driver could not see the road"]
+]
+tokenized_inputs = tokenizer.batch_encode_plus(
+            predict_pairs,
+            return_tensors="pt",
+            padding=True,
+            truncation=True
+        )
+preds = model(**tokenized_batch_input)
+# logits: preds.logits
+# 0 - Not aligned;
+ 1 - aligned
+```
+## Dataset Processing
+The dataset was processed in the following way:
+```python
+import os
+from sklearn.model_selection import train_test_split
+claims, evidences, labels = [], [], []
+# LOADED WITH THE HUGGINGFACE HUB INTO JSONL FORMAT
+datadir = "copenlu_fever_gold_evidence/"
+for filename in os.listdir(datadir):
+    with open(os.path.join(datadir, filename), "r") as f:
+        for line in f.read().split("\n"):
+            if line:
+                row_dict = json.loads(line)
+                for evidence in row_dict["evidence"]:
+                    evidences.append(evidence[-1])
+                    claims.append(row_dict["claim"])
+                    if row_dict["label"] != "NOT ENOUGH INFO":
+                        labels.append(1)
+                    else:
+                        labels.append(0)
+df = pd.DataFrame()
+df["text_a"] = claims
+df["text_b"] = evidences
+df["labels"] = labels
+df = df.drop_duplicates(subset=["text_a", "text_b"])
+train_df, eval_df = train_test_split(df, random_state=2, test_size=0.2)
+```
+### Metrics
+```
+              precision    recall  f1-score   support
+           0       0.86      0.60      0.71     15958
+           1       0.86      0.96      0.91     42327
+    accuracy                           0.86     58285
+   macro avg       0.86      0.78      0.81     58285
+weighted avg       0.86      0.86      0.85     58285
+```