arxiv:2605.04962

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Published on May 6

· Submitted by

Authors:

Abstract

A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.

AI-generated summary

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

View arXiv page View PDF GitHub 1 Add to collection

Community

taiganga

Paper submitter about 16 hours ago

the generalist embedding model that unifies tabular classification and retrieval in a single shared space.

avahal

about 7 hours ago

the cross-modal setup in tabembed, where you use natural language queries as anchors to bind a serialized tabular row to its constraints, is a clever departure from the usual row-to-row contrasts. that design seems to be the lever that preserves numeric semantics and column-level meaning in the embedding space, something text-only or vanilla tabular encoders struggle with. one tight question: how well does this hold up under schema shifts, like adding new columns or reinterpreting a column across domains, where the natural language constraint might drift? btw the arxivlens breakdown at https://arxivlens.com/PaperView/Details/tabembed-benchmarking-and-learning-generalist-embeddings-for-tabular-understanding-5676-ee07c573 helped me parse the method details, especially the triplet construction and training recipe. an ablation that tests removing the query anchor or the hard negative mining would really nail down which piece is driving the gains.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.04962

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.04962 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.04962 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.