How can I get dense, colbert embeddings with transformers?

#80

by Calvinnncy97 - opened Aug 29, 2024

Discussion

Calvinnncy97

Aug 29, 2024

•

edited Aug 29, 2024

Given

from transformers import AutoModel, AutoTokenizer
from torch import Tensor
import torch

model_path = 'BAAI/bge-m3'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

test_sentence = ["this is a test sentence"]

batch_dict = tokenizer(test_sentence, return_tensors='pt', max_length=128, padding=True, truncation=True)
outputs = model(**batch_dict)

I get BaseModelOutputWithPoolingAndCrossAttentions with pooler_output and last_hidden_state keys. Is pooler_output the CLS embedding and last_hidden_state all the token embeddings?

Kindly clarify. Thank you.

Shitao

Beijing Academy of Artificial Intelligence org Aug 30, 2024

@Calvinnncy97 , the cls embedding (dense) is the first embedding in last_hidden_state, other embedding (except the first embedding) is colbert embeddings.

frankgxy

Mar 14, 2025

@Shitao How to got sparse embedding? sparse linear has not load to model

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment