|
|
--- |
|
|
datasets: |
|
|
- tiennv/mmarco-passage-en |
|
|
- tiennv/mmarco-passage-vi |
|
|
language: |
|
|
- en |
|
|
- vi |
|
|
--- |
|
|
|
|
|
This is a [pythera/mbert-retrieve-qry-base](https://huggingface.co/pythera/mbert-retrieve-qry-base): It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search. |
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# CLS Pooling - Take output from first token |
|
|
def cls_pooling(model_output): |
|
|
return model_output.last_hidden_state[:,0] |
|
|
|
|
|
# Encode text |
|
|
def encode(texts): |
|
|
# Tokenize sentences |
|
|
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') |
|
|
|
|
|
# Compute token embeddings |
|
|
with torch.no_grad(): |
|
|
model_output = model(**encoded_input, return_dict=True) |
|
|
|
|
|
# Perform pooling |
|
|
embeddings = cls_pooling(model_output) |
|
|
|
|
|
return embeddings |
|
|
|
|
|
# Prepare query want to embedding |
|
|
query = [ |
|
|
'Tại sao bầu trời lại màu xanh?', |
|
|
'Định nghĩa Generative AI là gì?' |
|
|
] |
|
|
|
|
|
# Load model from HuggingFace Hub |
|
|
model = AutoModel.from_pretrained('pythera/mbert-retrieve-qry-base') |
|
|
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-qry-base') |
|
|
|
|
|
# Encode docs |
|
|
output_emb = encode(query) |
|
|
print('Output embedding: ', output_emb) |
|
|
``` |
|
|
## Evaluation |
|
|
|
|
|
We evaluate our research on mMARCO (vi) with several methods: |
|
|
|
|
|
| Model | Trained Datasets | Recall@1000 | MRR@10 | |
|
|
|-------------------------------|---------------------------------------|:------------:|:-------------:| |
|
|
| [vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 | |
|
|
| mColB | MSMACRO | 71.90 | 18.0 | |
|
|
| mbert (our) | MSMACRO | 85.86 | 21.42 | |