|
|
--- |
|
|
datasets: |
|
|
- tiennv/mmarco-passage-en |
|
|
- tiennv/mmarco-passage-vi |
|
|
language: |
|
|
- en |
|
|
- vi |
|
|
--- |
|
|
|
|
|
This is a [pythera/mbert-retrieve-ctx-base model](https://huggingface.co/pythera/mbert-retrieve-ctx-base): It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search. |
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# CLS Pooling - Take output from first token |
|
|
def cls_pooling(model_output): |
|
|
return model_output.last_hidden_state[:,0] |
|
|
|
|
|
# Encode text |
|
|
def encode(texts): |
|
|
# Tokenize sentences |
|
|
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') |
|
|
|
|
|
# Compute token embeddings |
|
|
with torch.no_grad(): |
|
|
model_output = model(**encoded_input, return_dict=True) |
|
|
|
|
|
# Perform pooling |
|
|
embeddings = cls_pooling(model_output) |
|
|
|
|
|
return embeddings |
|
|
|
|
|
# Prepare documents want to embedding |
|
|
passage = [ |
|
|
'2023 đánh dấu AI không còn bó hẹp trong cộng đồng nhỏ, mà ứng dụng rộng khắp để phục vụ hàng triệu người Việt, từ viết văn đến tạo ảnh avatar.', |
|
|
'According to industry reports, the global machine learning market is expected to reach a staggering $96.7 billion by 2025.' |
|
|
] |
|
|
|
|
|
# Load model from HuggingFace Hub |
|
|
model = AutoModel.from_pretrained('pythera/mbert-retrieve-ctx-base') |
|
|
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-ctx-base') |
|
|
|
|
|
# Encode docs |
|
|
output_emb = encode(passage ) |
|
|
print('Output embedding: ', output_emb) |
|
|
``` |
|
|
## Evaluation |
|
|
|
|
|
We evaluate our research on mMARCO (vi) with several methods: |
|
|
|
|
|
| Model | Trained Datasets | Recall@1000 | MRR@10 | |
|
|
|-------------------------------|---------------------------------------|:------------:|:-------------:| |
|
|
| [vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 | |
|
|
| mColB | MSMACRO | 71.90 | 18.0 | |
|
|
| mbert (our) | MSMACRO | 85.86 | 21.42 | |