| --- |
| language: |
| - multilingual |
| license: mit |
| library_name: sentence-transformers |
| tags: |
| - claim2vec |
| - embedding-model |
| - fact-checking |
| - claim-clustering |
| - semantic-search |
| - misinformation |
| - contrastive-learning |
| - multilingual-nlp |
| --- |
| |
| # π§ Claim2Vec |
|
|
| **Claim2Vec** is a multilingual embedding model designed specifically for **fact-checked claim representation and clustering** in misinformation detection systems. |
|
|
| It learns a semantic embedding space where recurrent and semantically equivalent claims are mapped close together, enabling improved grouping and retrieval of fact-checkable information across languages. |
|
|
| --- |
|
|
| ## π― Motivation |
|
|
| Recurrent claims are a major challenge for automated fact-checking systems, especially in multilingual environments. While existing approaches focus on pairwise claim matching, they often fail to capture global structures of semantically equivalent claims. |
|
|
| Claim2Vec addresses this gap by learning embeddings optimized for **claim clustering**, enabling better organization of repeated misinformation narratives across datasets and languages. |
|
|
| --- |
|
|
| ## π Key Features |
|
|
| - π Multilingual claim representation |
| - π Optimized for claim clustering tasks |
| - π§ Contrastive learning with semantically similar claim pairs |
| - π Improved embedding geometry for cluster separation |
| - π Strong cross-lingual knowledge transfer |
| - β‘ Designed for scalable fact-checking pipelines |
|
|
| --- |
|
|
| ## π§ͺ Training Objective |
|
|
| Claim2Vec is trained using contrastive learning, encouraging semantically similar claims to have closer embeddings while pushing unrelated claims apart. |
|
|
| --- |
|
|
| ## π Experimental Results |
|
|
| Evaluation across: |
| - 3 benchmark datasets |
| - 14 embedding baselines |
| - 7 clustering algorithms |
|
|
| shows that Claim2Vec consistently improves: |
| - Cluster label alignment |
| - Embedding space structure |
| - Robustness across clustering configurations |
|
|
| --- |
|
|
| ## π‘ Use Cases |
|
|
| - Fact-checking systems |
| - Misinformation detection pipelines |
| - Claim deduplication |
| - Cross-lingual semantic clustering |
|
|
| --- |
|
|
| ## 𧬠Usage |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("Rrubaa/claim2vec") |
| |
| claims = [ |
| "COVID vaccines cause infertility", |
| "Studies show no link between COVID vaccines and infertility" |
| ] |
| |
| embeddings = model.encode(claims) |
| print(embeddings.shape) |
| ``` |
|
|
| ## π Citation |
|
|
| If you use Claim2Vec in your work, please cite: |
|
|
| ```bibtex |
| @misc{claim2vec2026, |
| title={Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering}, |
| author={Panchendrarajan, Rrubaa and Zubiaga, Arkaitz}, |
| year={2026}, |
| eprint={2604.09812}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2604.09812} |
| } |
| ``` |
|
|
| π arXiv: https://arxiv.org/abs/2604.09812 |
|
|