English
Vietnamese
hieupth
chore: initial commit
b5c8076 unverified
---
datasets:
- tiennv/mmarco-passage-en
- tiennv/mmarco-passage-vi
language:
- en
- vi
---
This is a [pythera/mbert-retrieve-qry-base](https://huggingface.co/pythera/mbert-retrieve-qry-base): It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.
## Usage
```python
import torch
from transformers import AutoModel, AutoTokenizer
# CLS Pooling - Take output from first token
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
# Encode text
def encode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Prepare query want to embedding
query = [
'Tại sao bầu trời lại màu xanh?',
'Định nghĩa Generative AI là gì?'
]
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-qry-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-qry-base')
# Encode docs
output_emb = encode(query)
print('Output embedding: ', output_emb)
```
## Evaluation
We evaluate our research on mMARCO (vi) with several methods:
| Model | Trained Datasets | Recall@1000 | MRR@10 |
|-------------------------------|---------------------------------------|:------------:|:-------------:|
| [vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 |
| mColB | MSMACRO | 71.90 | 18.0 |
| mbert (our) | MSMACRO | 85.86 | 21.42 |