RAG
Collection
4 items • Updated
How to use pythera/mbert-retrieve-ctx-base with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="pythera/mbert-retrieve-ctx-base") # Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("pythera/mbert-retrieve-ctx-base")
model = AutoModel.from_pretrained("pythera/mbert-retrieve-ctx-base")# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="pythera/mbert-retrieve-ctx-base")# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("pythera/mbert-retrieve-ctx-base")
model = AutoModel.from_pretrained("pythera/mbert-retrieve-ctx-base")This is a pythera/mbert-retrieve-ctx-base model: It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.
import torch
from transformers import AutoModel, AutoTokenizer
# CLS Pooling - Take output from first token
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
# Encode text
def encode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Prepare documents want to embedding
passage = [
'2023 đánh dấu AI không còn bó hẹp trong cộng đồng nhỏ, mà ứng dụng rộng khắp để phục vụ hàng triệu người Việt, từ viết văn đến tạo ảnh avatar.',
'According to industry reports, the global machine learning market is expected to reach a staggering $96.7 billion by 2025.'
]
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-ctx-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-ctx-base')
# Encode docs
output_emb = encode(passage )
print('Output embedding: ', output_emb)
We evaluate our research on mMARCO (vi) with several methods:
| Model | Trained Datasets | Recall@1000 | MRR@10 |
|---|---|---|---|
| vietnamese-bi-encoder | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 |
| mColB | MSMACRO | 71.90 | 18.0 |
| mbert (our) | MSMACRO | 85.86 | 21.42 |
# Gated model: Login with a HF token with gated access permission hf auth login