Rrubaa commited on
Commit
3f20bbb
Β·
verified Β·
1 Parent(s): 7af0730

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +113 -113
README.md CHANGED
@@ -1,113 +1,113 @@
1
- ---
2
- language:
3
- - multilingual
4
- license: mit
5
- library_name: sentence-transformers
6
- tags:
7
- - claim2vec
8
- - embedding-model
9
- - fact-checking
10
- - claim-clustering
11
- - semantic-search
12
- - misinformation
13
- - contrastive-learning
14
- - multilingual-nlp
15
- ---
16
-
17
- # 🧠 Claim2Vec
18
-
19
- **Claim2Vec** is a multilingual embedding model designed specifically for **fact-checked claim representation and clustering** in misinformation detection systems.
20
-
21
- It learns a semantic embedding space where recurrent and semantically equivalent claims are mapped close together, enabling improved grouping and retrieval of fact-checkable information across languages.
22
-
23
- ---
24
-
25
- ## 🎯 Motivation
26
-
27
- Recurrent claims are a major challenge for automated fact-checking systems, especially in multilingual environments. While existing approaches focus on pairwise claim matching, they often fail to capture global structures of semantically equivalent claims.
28
-
29
- Claim2Vec addresses this gap by learning embeddings optimized for **claim clustering**, enabling better organization of repeated misinformation narratives across datasets and languages.
30
-
31
- ---
32
-
33
- ## πŸš€ Key Features
34
-
35
- - 🌍 Multilingual claim representation
36
- - πŸ”— Optimized for claim clustering tasks
37
- - 🧠 Contrastive learning with semantically similar claim pairs
38
- - πŸ“Š Improved embedding geometry for cluster separation
39
- - πŸ”„ Strong cross-lingual knowledge transfer
40
- - ⚑ Designed for scalable fact-checking pipelines
41
-
42
- ---
43
-
44
- ## πŸ§ͺ Training Objective
45
-
46
- Claim2Vec is trained using contrastive learning, encouraging semantically similar claims to have closer embeddings while pushing unrelated claims apart.
47
-
48
- ---
49
-
50
- ## πŸ“Š Experimental Results
51
-
52
- Evaluation across:
53
- - 3 benchmark datasets
54
- - 14 embedding baselines
55
- - 7 clustering algorithms
56
-
57
- shows that Claim2Vec consistently improves:
58
- - Cluster label alignment
59
- - Embedding space structure
60
- - Robustness across clustering configurations
61
-
62
- ---
63
-
64
- ## 🌐 Multilingual Performance
65
-
66
- Claim2Vec demonstrates strong performance in multilingual settings, where clusters containing multiple languages benefit significantly from fine-tuning, indicating effective cross-lingual semantic transfer.
67
-
68
- ---
69
-
70
- ## πŸ’‘ Use Cases
71
-
72
- - Fact-checking systems
73
- - Misinformation detection pipelines
74
- - Claim deduplication
75
- - Evidence grouping for RAG systems
76
- - News verification tools
77
- - Cross-lingual semantic clustering
78
-
79
- ---
80
-
81
- ## 🧬 Usage
82
-
83
- ```python
84
- from sentence_transformers import SentenceTransformer
85
-
86
- model = SentenceTransformer("your-username/claim2vec")
87
-
88
- claims = [
89
- "COVID vaccines cause infertility",
90
- "Studies show no link between COVID vaccines and infertility"
91
- ]
92
-
93
- embeddings = model.encode(claims)
94
- print(embeddings.shape)
95
- ```
96
-
97
- ## πŸ“„ Citation
98
-
99
- If you use Claim2Vec in your work, please cite:
100
-
101
- ```bibtex
102
- @misc{claim2vec2026,
103
- title={Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering},
104
- author={Panchendrarajan, Rrubaa and Zubiaga, Arkaitz},
105
- year={2026},
106
- eprint={2604.09812},
107
- archivePrefix={arXiv},
108
- primaryClass={cs.CL},
109
- url={https://arxiv.org/abs/2604.09812}
110
- }
111
- ```
112
-
113
- πŸ“„ arXiv: https://arxiv.org/abs/2604.09812
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ license: mit
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - claim2vec
8
+ - embedding-model
9
+ - fact-checking
10
+ - claim-clustering
11
+ - semantic-search
12
+ - misinformation
13
+ - contrastive-learning
14
+ - multilingual-nlp
15
+ ---
16
+
17
+ # 🧠 Claim2Vec
18
+
19
+ **Claim2Vec** is a multilingual embedding model designed specifically for **fact-checked claim representation and clustering** in misinformation detection systems.
20
+
21
+ It learns a semantic embedding space where recurrent and semantically equivalent claims are mapped close together, enabling improved grouping and retrieval of fact-checkable information across languages.
22
+
23
+ ---
24
+
25
+ ## 🎯 Motivation
26
+
27
+ Recurrent claims are a major challenge for automated fact-checking systems, especially in multilingual environments. While existing approaches focus on pairwise claim matching, they often fail to capture global structures of semantically equivalent claims.
28
+
29
+ Claim2Vec addresses this gap by learning embeddings optimized for **claim clustering**, enabling better organization of repeated misinformation narratives across datasets and languages.
30
+
31
+ ---
32
+
33
+ ## πŸš€ Key Features
34
+
35
+ - 🌍 Multilingual claim representation
36
+ - πŸ”— Optimized for claim clustering tasks
37
+ - 🧠 Contrastive learning with semantically similar claim pairs
38
+ - πŸ“Š Improved embedding geometry for cluster separation
39
+ - πŸ”„ Strong cross-lingual knowledge transfer
40
+ - ⚑ Designed for scalable fact-checking pipelines
41
+
42
+ ---
43
+
44
+ ## πŸ§ͺ Training Objective
45
+
46
+ Claim2Vec is trained using contrastive learning, encouraging semantically similar claims to have closer embeddings while pushing unrelated claims apart.
47
+
48
+ ---
49
+
50
+ ## πŸ“Š Experimental Results
51
+
52
+ Evaluation across:
53
+ - 3 benchmark datasets
54
+ - 14 embedding baselines
55
+ - 7 clustering algorithms
56
+
57
+ shows that Claim2Vec consistently improves:
58
+ - Cluster label alignment
59
+ - Embedding space structure
60
+ - Robustness across clustering configurations
61
+
62
+ ---
63
+
64
+ ## 🌐 Multilingual Performance
65
+
66
+ Claim2Vec demonstrates strong performance in multilingual settings, where clusters containing multiple languages benefit significantly from fine-tuning, indicating effective cross-lingual semantic transfer.
67
+
68
+ ---
69
+
70
+ ## πŸ’‘ Use Cases
71
+
72
+ - Fact-checking systems
73
+ - Misinformation detection pipelines
74
+ - Claim deduplication
75
+ - Evidence grouping for RAG systems
76
+ - News verification tools
77
+ - Cross-lingual semantic clustering
78
+
79
+ ---
80
+
81
+ ## 🧬 Usage
82
+
83
+ ```python
84
+ from sentence_transformers import SentenceTransformer
85
+
86
+ model = SentenceTransformer("your-username/claim2vec")
87
+
88
+ claims = [
89
+ "COVID vaccines cause infertility",
90
+ "Studies show no link between COVID vaccines and infertility"
91
+ ]
92
+
93
+ embeddings = model.encode(claims)
94
+ print(embeddings.shape)
95
+ ```
96
+
97
+ ## πŸ“„ Citation
98
+
99
+ If you use Claim2Vec in your work, please cite:
100
+
101
+ ```bibtex
102
+ @misc{claim2vec2026,
103
+ title={Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering},
104
+ author={Panchendrarajan, Rrubaa and Zubiaga, Arkaitz},
105
+ year={2026},
106
+ eprint={2604.09812},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CL},
109
+ url={https://arxiv.org/abs/2604.09812}
110
+ }
111
+ ```
112
+
113
+ πŸ“„ arXiv: https://arxiv.org/abs/2604.09812