File size: 2,807 Bytes
3f20bbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
863b1ef
3f20bbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
language:
- multilingual
license: mit
library_name: sentence-transformers
tags:
- claim2vec
- embedding-model
- fact-checking
- claim-clustering
- semantic-search
- misinformation
- contrastive-learning
- multilingual-nlp
---

# 🧠 Claim2Vec

**Claim2Vec** is a multilingual embedding model designed specifically for **fact-checked claim representation and clustering** in misinformation detection systems.

It learns a semantic embedding space where recurrent and semantically equivalent claims are mapped close together, enabling improved grouping and retrieval of fact-checkable information across languages.

---

## 🎯 Motivation

Recurrent claims are a major challenge for automated fact-checking systems, especially in multilingual environments. While existing approaches focus on pairwise claim matching, they often fail to capture global structures of semantically equivalent claims.

Claim2Vec addresses this gap by learning embeddings optimized for **claim clustering**, enabling better organization of repeated misinformation narratives across datasets and languages.

---

## πŸš€ Key Features

- 🌍 Multilingual claim representation  
- πŸ”— Optimized for claim clustering tasks  
- 🧠 Contrastive learning with semantically similar claim pairs  
- πŸ“Š Improved embedding geometry for cluster separation  
- πŸ”„ Strong cross-lingual knowledge transfer  
- ⚑ Designed for scalable fact-checking pipelines  

---

## πŸ§ͺ Training Objective

Claim2Vec is trained using contrastive learning, encouraging semantically similar claims to have closer embeddings while pushing unrelated claims apart.

---

## πŸ“Š Experimental Results

Evaluation across:
- 3 benchmark datasets  
- 14 embedding baselines  
- 7 clustering algorithms  

shows that Claim2Vec consistently improves:
- Cluster label alignment  
- Embedding space structure  
- Robustness across clustering configurations  

---

## πŸ’‘ Use Cases

- Fact-checking systems  
- Misinformation detection pipelines  
- Claim deduplication  
- Cross-lingual semantic clustering  

---

## 🧬 Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Rrubaa/claim2vec")

claims = [
    "COVID vaccines cause infertility",
    "Studies show no link between COVID vaccines and infertility"
]

embeddings = model.encode(claims)
print(embeddings.shape)
```

## πŸ“„ Citation

If you use Claim2Vec in your work, please cite:

```bibtex
@misc{claim2vec2026,
  title={Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering},
  author={Panchendrarajan, Rrubaa and Zubiaga, Arkaitz},
  year={2026},
  eprint={2604.09812},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.09812}
}
```

πŸ“„ arXiv: https://arxiv.org/abs/2604.09812