yevhenkost
/

claim_evidence_alignment_fever_gold_tuned_tinybert

Text Classification

evidence alignment

text-embeddings-inference

Model card Files Files and versions

claim_evidence_alignment_fever_gold_tuned_tinybert / README.md

yevhenkost's picture

Create README.md

3a26c4f verified over 1 year ago

|

history blame contribute delete

2.65 kB

	---
	language:
	- en
	tags:
	- evidence
	- claim
	- evidence alignment
	---
	# Claim-Evidence Alignment TinyBERT tuned classification model


	<!-- Provide a quick summary of what the model is/does. -->

	This repo contains a tuned [huawei-noah/TinyBERT_General_4L_312D](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D) model for the classification
	of sentence pairs: if the evidence fits the claim. For the training, the following dataset was used: [copenlu/fever_gold_evidence](https://huggingface.co/datasets/copenlu/fever_gold_evidence).

	The model is trained on both test and train datasets.

	## Usage
	```python
	model = transformers.AutoModelForSequenceClassification.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")
	tokenizer = transformers.AutoTokenizer.from_pretrained("yevhenkost/claim_evidence_alignment_fever_gold_tuned_tinybert")

	claim_evidence_pairs = [
	["The water is wet", "The sky is blue"],
	["The car crashed", "Driver could not see the road"]
	]

	tokenized_inputs = tokenizer.batch_encode_plus(
	predict_pairs,
	return_tensors="pt",
	padding=True,
	truncation=True
	)
	preds = model(**tokenized_batch_input)

	# logits: preds.logits
	# 0 - Not aligned;
	1 - aligned
	```

	## Dataset Processing
	The dataset was processed in the following way:
	```python
	import os
	from sklearn.model_selection import train_test_split

	claims, evidences, labels = [], [], []

	# LOADED WITH THE HUGGINGFACE HUB INTO JSONL FORMAT
	datadir = "copenlu_fever_gold_evidence/"
	for filename in os.listdir(datadir):
	with open(os.path.join(datadir, filename), "r") as f:
	for line in f.read().split("\n"):
	if line:
	row_dict = json.loads(line)
	for evidence in row_dict["evidence"]:
	evidences.append(evidence[-1])
	claims.append(row_dict["claim"])
	if row_dict["label"] != "NOT ENOUGH INFO":
	labels.append(1)
	else:
	labels.append(0)

	df = pd.DataFrame()
	df["text_a"] = claims
	df["text_b"] = evidences
	df["labels"] = labels

	df = df.drop_duplicates(subset=["text_a", "text_b"])

	train_df, eval_df = train_test_split(df, random_state=2, test_size=0.2)
	```

	### Metrics
	```
	precision recall f1-score support

	0 0.86 0.60 0.71 15958
	1 0.86 0.96 0.91 42327

	accuracy 0.86 58285
	macro avg 0.86 0.78 0.81 58285
	weighted avg 0.86 0.86 0.85 58285
	```