arxiv:2601.13251

Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

Published on Jan 19

· Submitted by

Mahmud ElHuseyni 🇵🇸 on Jan 21

Upvote

Authors:

Ebubekir Tosun ,

Mahmoud ElHussieni

Abstract

A large-scale semantic clustering system addresses the limitation of neural embeddings in distinguishing synonyms from antonyms through a specialized three-way discriminator and novel clustering algorithm.

AI-generated summary

Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot -> spicy -> pain -> depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.

View arXiv page View PDF Add to collection

Community

MElHuseyni

Paper author Paper submitter about 13 hours ago

This paper addresses the inability of neural embeddings to distinguish synonyms from antonyms. The authors introduce a soft-to-hard clustering algorithm that prevents semantic drift and a 3-way relation discriminator (90% F1). Validated against a new dataset of 843k pairs generated via Gemini 2.5, the pipeline yields 2.9M semantic clusters, significantly enhancing precision for RAG and semantic search.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.13251 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.13251 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.13251 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.