Sentence Similarity
sentence-transformers
PyTorch
TensorFlow
JAX
ONNX
Safetensors
OpenVINO
Transformers
English
bert
feature-extraction
text-embeddings-inference
Instructions to use novelcore/model5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use novelcore/model5 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("novelcore/model5") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use novelcore/model5 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("novelcore/model5") model = AutoModel.from_pretrained("novelcore/model5") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| library_name: sentence-transformers | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - transformers | |
| pipeline_tag: sentence-similarity | |
| # msmarco-MiniLM-L6-cos-v5 | |
| This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. It has been trained on 500k (query, answer) pairs from the [MS MARCO Passages dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking). For an introduction to semantic search, have a look at: [SBERT.net - Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html) | |
| ## Usage (Sentence-Transformers) | |
| Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: | |
| ``` | |
| pip install -U sentence-transformers | |
| ``` | |
| Then you can use the model like this: | |
| ```python | |
| from sentence_transformers import SentenceTransformer, util | |
| query = "How many people live in London?" | |
| docs = ["Around 9 Million people live in London", "London is known for its financial district"] | |
| #Load the model | |
| model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L6-cos-v5') | |
| #Encode query and documents | |
| query_emb = model.encode(query) | |
| doc_emb = model.encode(docs) | |
| #Compute dot score between query and all document embeddings | |
| scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() | |
| #Combine docs & scores | |
| doc_score_pairs = list(zip(docs, scores)) | |
| #Sort by decreasing score | |
| doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) | |
| #Output passages & scores | |
| for doc, score in doc_score_pairs: | |
| print(score, doc) | |
| ``` | |
| ## Usage (HuggingFace Transformers) | |
| Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings. | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| import torch.nn.functional as F | |
| #Mean Pooling - Take average of all tokens | |
| def mean_pooling(model_output, attention_mask): | |
| token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings | |
| input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() | |
| return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) | |
| #Encode text | |
| def encode(texts): | |
| # Tokenize sentences | |
| encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') | |
| # Compute token embeddings | |
| with torch.no_grad(): | |
| model_output = model(**encoded_input, return_dict=True) | |
| # Perform pooling | |
| embeddings = mean_pooling(model_output, encoded_input['attention_mask']) | |
| # Normalize embeddings | |
| embeddings = F.normalize(embeddings, p=2, dim=1) | |
| return embeddings | |
| # Sentences we want sentence embeddings for | |
| query = "How many people live in London?" | |
| docs = ["Around 9 Million people live in London", "London is known for its financial district"] | |
| # Load model from HuggingFace Hub | |
| tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-MiniLM-L6-cos-v5") | |
| model = AutoModel.from_pretrained("sentence-transformers/msmarco-MiniLM-L6-cos-v5") | |
| #Encode query and docs | |
| query_emb = encode(query) | |
| doc_emb = encode(docs) | |
| #Compute dot score between query and all document embeddings | |
| scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() | |
| #Combine docs & scores | |
| doc_score_pairs = list(zip(docs, scores)) | |
| #Sort by decreasing score | |
| doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) | |
| #Output passages & scores | |
| for doc, score in doc_score_pairs: | |
| print(score, doc) | |
| ``` | |
| ## Technical Details | |
| In the following some technical details how this model must be used: | |
| | Setting | Value | | |
| | --- | :---: | | |
| | Dimensions | 384 | | |
| | Produces normalized embeddings | Yes | | |
| | Pooling-Method | Mean pooling | | |
| | Suitable score functions | dot-product (`util.dot_score`), cosine-similarity (`util.cos_sim`), or euclidean distance | | |
| Note: When loaded with `sentence-transformers`, this model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used. | |
| ## Citing & Authors | |
| This model was trained by [sentence-transformers](https://www.sbert.net/). | |
| If you find this model helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084): | |
| ```bibtex | |
| @inproceedings{reimers-2019-sentence-bert, | |
| title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", | |
| author = "Reimers, Nils and Gurevych, Iryna", | |
| booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", | |
| month = "11", | |
| year = "2019", | |
| publisher = "Association for Computational Linguistics", | |
| url = "http://arxiv.org/abs/1908.10084", | |
| } | |
| ``` |