Instructions to use NbAiLab/nb-sbert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use NbAiLab/nb-sbert-base with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("NbAiLab/nb-sbert-base") sentences = [ "This is a Norwegian boy", "Dette er en norsk gutt", "This is an English boy", "This is a dog" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use NbAiLab/nb-sbert-base with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("NbAiLab/nb-sbert-base") model = AutoModel.from_pretrained("NbAiLab/nb-sbert-base") - Inference
- Notebooks
- Google Colab
- Kaggle
model_max_length and max_seq_length
Hi!
First of all great job on this sBERT model!
Secondly, it looks like something weird is going on with the model_max_length and max_seq_length attributes when instantiating the model via AutoModel / AutoTokenizer and SentenceTransformer, respectively.
The sentence-transformers implementation gives a max length of 75:
model_st = SentenceTransformer('NbAiLab/nb-sbert-base')
model_st.max_seq_length
# 75
While loading the tokenizer through HF's AutoTokenizer gives a very different max length:
model = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base')
tokenizer.model_max_length
# 1000000000000000019884624838656
The second one is clearly incorrect, but is 75 the correct max sequence length for this model? If I remember correctly, BERT models have a sequence length of 512, or has that changed when finetuning this model?
This also means for sequences longer than 75, both implementations will give different embeddings, which may be worth mentioning.
Hi.
The sequence length of 75 comes from the training script we use.
The other one seems to come from having no max length. The nb-bert-base model has the same value.
The correct one would be 75, but I wouldn't be surprised if you could change max length and input sequences up to 512 with good success.
Thanks for the quick reply and good to know! I'll experiment with input sequences of 512 to see how they compare with the 75-length sequences.