TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
Abstract
A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
Community
the generalist embedding model that unifies tabular classification and retrieval in a single shared space.
the cross-modal setup in tabembed, where you use natural language queries as anchors to bind a serialized tabular row to its constraints, is a clever departure from the usual row-to-row contrasts. that design seems to be the lever that preserves numeric semantics and column-level meaning in the embedding space, something text-only or vanilla tabular encoders struggle with. one tight question: how well does this hold up under schema shifts, like adding new columns or reinterpreting a column across domains, where the natural language constraint might drift? btw the arxivlens breakdown at https://arxivlens.com/PaperView/Details/tabembed-benchmarking-and-learning-generalist-embeddings-for-tabular-understanding-5676-ee07c573 helped me parse the method details, especially the triplet construction and training recipe. an ablation that tests removing the query anchor or the hard negative mining would really nail down which piece is driving the gains.
Get this paper in your agent:
hf papers read 2605.04962 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper