Pooling method: mean vs last?

#25

by alexzhou689 - opened Aug 6, 2024

Discussion

alexzhou689

Aug 6, 2024

Same to title, which one should i choose for inference or training?

thenlper

Alibaba-NLP org Aug 14, 2024

recommending to use the last token pooling method, please refer to the example code in the model introduction.

gu-qizheng

Oct 1, 2024

I noted that in the original GTE paper "Towards General Text Embeddings with Multi-stage Contrastive Learning" Section 3.1 Model Architecture, mean pooling is used. However in gte-Qwen2-7B-instruct, last token pooling is used, as is shown in the example code and config file. I wonder is there any literature reference or experience could be shared on the design choice of the pooling method? It looks like bidirectional embedding models typically use mean pooling (as is the case in the original GTE paper with BERT), while the last token embedding is more common for decoder-only LLM based models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment