English
Vietnamese
hieupth
chore: initial commit
b5c8076 unverified
metadata
datasets:
  - tiennv/mmarco-passage-en
  - tiennv/mmarco-passage-vi
language:
  - en
  - vi

This is a pythera/mbert-retrieve-qry-base: It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.

Usage

import torch
from transformers import AutoModel, AutoTokenizer

# CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

# Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings

# Prepare query want to embedding
query = [
'Tại sao bầu trời lại màu xanh?',
'Định nghĩa Generative AI là gì?'
]

# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-qry-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-qry-base')

# Encode docs
output_emb = encode(query)
print('Output embedding: ', output_emb)

Evaluation

We evaluate our research on mMARCO (vi) with several methods:

Model Trained Datasets Recall@1000 MRR@10
vietnamese-bi-encoder MSMACRO + SQuADv2.0 + 80% Zalo 79.58 18.74
mColB MSMACRO 71.90 18.0
mbert (our) MSMACRO 85.86 21.42