--- language: - en license: apache-2.0 tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - dataset_size:900 - loss:MatryoshkaLoss - loss:MultipleNegativesRankingLoss base_model: microsoft/codebert-base widget: - source_sentence: Best practices for _invocation_params sentences: - "def after_model(self, state: StateT, runtime: Runtime[ContextT]) -> dict[str,\ \ Any] | None:\n \"\"\"Logic to run after the model is called.\n\n \ \ Args:\n state: The current agent state.\n runtime:\ \ The runtime context.\n\n Returns:\n Agent state updates to\ \ apply after model call.\n \"\"\"" - "def _get_trace_callbacks(\n project_name: str | None = None,\n example_id:\ \ str | UUID | None = None,\n callback_manager: CallbackManager | AsyncCallbackManager\ \ | None = None,\n) -> Callbacks:\n if _tracing_v2_is_enabled():\n project_name_\ \ = project_name or _get_tracer_project()\n tracer = tracing_v2_callback_var.get()\ \ or LangChainTracer(\n project_name=project_name_,\n example_id=example_id,\n\ \ )\n if callback_manager is None:\n cb = cast(\"Callbacks\"\ , [tracer])\n else:\n if not any(\n isinstance(handler,\ \ LangChainTracer)\n for handler in callback_manager.handlers\n\ \ ):\n callback_manager.add_handler(tracer)\n \ \ # If it already has a LangChainTracer, we don't need to add another\ \ one.\n # this would likely mess up the trace hierarchy.\n \ \ cb = callback_manager\n else:\n cb = None\n return cb" - "def _invocation_params(self) -> dict[str, Any]:\n params: dict = {\"model\"\ : self.model, **self.model_kwargs}\n if self.dimensions is not None:\n\ \ params[\"dimensions\"] = self.dimensions\n return params" - source_sentence: How does _approximate_token_counter work in Python? sentences: - "def _approximate_token_counter(messages: Sequence[BaseMessage]) -> int:\n \ \ \"\"\"Wrapper for `count_tokens_approximately` that matches expected signature.\"\ \"\"\n return count_tokens_approximately(messages)" - "def remove_request_headers(request: Any) -> Any:\n for k in request.headers:\n\ \ request.headers[k] = \"**REDACTED**\"\n return request" - "def get_format_instructions(self) -> str:\n \"\"\"Returns formatting instructions\ \ for the given output parser.\"\"\"\n return self.format_instructions" - source_sentence: How to implement _create_thread_and_run? sentences: - "async def on_retriever_end(\n self, documents: Sequence[Document], **kwargs:\ \ Any\n ) -> None:\n \"\"\"Run when the retriever ends running.\n\n\ \ Args:\n documents: The retrieved documents.\n **kwargs:\ \ Additional keyword arguments.\n\n \"\"\"\n if not self.handlers:\n\ \ return\n await ahandle_event(\n self.handlers,\n\ \ \"on_retriever_end\",\n \"ignore_retriever\",\n \ \ documents,\n run_id=self.run_id,\n parent_run_id=self.parent_run_id,\n\ \ tags=self.tags,\n **kwargs,\n )" - "def _create_thread_and_run(self, input_dict: dict, thread: dict) -> Any:\n \ \ params = {\n k: v\n for k, v in input_dict.items()\n\ \ if k\n in (\n \"instructions\",\n \ \ \"model\",\n \"tools\",\n \"parallel_tool_calls\"\ ,\n \"top_p\",\n \"temperature\",\n \ \ \"max_completion_tokens\",\n \"max_prompt_tokens\",\n \ \ \"run_metadata\",\n )\n }\n return self.client.beta.threads.create_and_run(\n\ \ assistant_id=self.assistant_id,\n thread=thread,\n \ \ **params,\n )" - "def test_pandas_output_parser_col_no_array() -> None:\n with pytest.raises(OutputParserException):\n\ \ parser.parse(\"column:num_legs\")" - source_sentence: Explain the get_token_ids logic sentences: - "def _runnable(inputs: dict[str, Any]) -> str:\n if inputs[\"text\"] == \"\ foo\":\n return \"first\"\n if \"exception\" not in inputs:\n \ \ msg = \"missing exception\"\n raise ValueError(msg)\n if inputs[\"\ text\"] == \"bar\":\n return \"second\"\n if isinstance(inputs[\"exception\"\ ], ValueError):\n raise RuntimeError # noqa: TRY004\n return \"third\"" - "def validate_params(cls, values: dict) -> dict:\n \"\"\"Validate similarity\ \ parameters.\"\"\"\n if values[\"k\"] is None and values[\"similarity_threshold\"\ ] is None:\n msg = \"Must specify one of `k` or `similarity_threshold`.\"\ \n raise ValueError(msg)\n return values" - "def get_token_ids(self, text: str) -> list[int]:\n \"\"\"Return the ordered\ \ IDs of the tokens in a text.\n\n Args:\n text: The string\ \ input to tokenize.\n\n Returns:\n A list of IDs corresponding\ \ to the tokens in the text, in order they occur\n in the text.\n\ \ \"\"\"\n if self.custom_get_token_ids is not None:\n \ \ return self.custom_get_token_ids(text)\n return _get_token_ids_default_method(text)" - source_sentence: How does __init__ work in Python? sentences: - "def test_loading_few_shot_prompt_from_json() -> None:\n \"\"\"Test loading\ \ few shot prompt from json.\"\"\"\n with change_directory(EXAMPLE_DIR):\n\ \ prompt = load_prompt(\"few_shot_prompt.json\")\n expected_prompt\ \ = FewShotPromptTemplate(\n input_variables=[\"adjective\"],\n \ \ prefix=\"Write antonyms for the following words.\",\n example_prompt=PromptTemplate(\n\ \ input_variables=[\"input\", \"output\"],\n template=\"\ Input: {input}\\nOutput: {output}\",\n ),\n examples=[\n\ \ {\"input\": \"happy\", \"output\": \"sad\"},\n \ \ {\"input\": \"tall\", \"output\": \"short\"},\n ],\n \ \ suffix=\"Input: {adjective}\\nOutput:\",\n )\n assert prompt ==\ \ expected_prompt" - "def __init__(\n self,\n encoding_name: str = \"gpt2\",\n \ \ model_name: str | None = None,\n allowed_special: Literal[\"all\"] |\ \ AbstractSet[str] = set(),\n disallowed_special: Literal[\"all\"] | Collection[str]\ \ = \"all\",\n **kwargs: Any,\n ) -> None:\n \"\"\"Create a new\ \ `TextSplitter`.\n\n Args:\n encoding_name: The name of the\ \ tiktoken encoding to use.\n model_name: The name of the model to\ \ use. If provided, this will\n override the `encoding_name`.\n\ \ allowed_special: Special tokens that are allowed during encoding.\n\ \ disallowed_special: Special tokens that are disallowed during encoding.\n\ \n Raises:\n ImportError: If the tiktoken package is not installed.\n\ \ \"\"\"\n super().__init__(**kwargs)\n if not _HAS_TIKTOKEN:\n\ \ msg = (\n \"Could not import tiktoken python package.\ \ \"\n \"This is needed in order to for TokenTextSplitter. \"\n\ \ \"Please install it with `pip install tiktoken`.\"\n \ \ )\n raise ImportError(msg)\n\n if model_name is not None:\n\ \ enc = tiktoken.encoding_for_model(model_name)\n else:\n \ \ enc = tiktoken.get_encoding(encoding_name)\n self._tokenizer\ \ = enc\n self._allowed_special = allowed_special\n self._disallowed_special\ \ = disallowed_special" - "def test_fixed_message_response_when_docs_found() -> None:\n fixed_resp =\ \ \"I don't know\"\n answer = \"I know the answer!\"\n llm = FakeListLLM(responses=[answer])\n\ \ retriever = SequentialRetriever(\n sequential_responses=[[Document(page_content=answer)]],\n\ \ )\n memory = ConversationBufferMemory(\n k=1,\n output_key=\"\ answer\",\n memory_key=\"chat_history\",\n return_messages=True,\n\ \ )\n qa_chain = ConversationalRetrievalChain.from_llm(\n llm=llm,\n\ \ memory=memory,\n retriever=retriever,\n return_source_documents=True,\n\ \ rephrase_question=False,\n response_if_no_docs_found=fixed_resp,\n\ \ verbose=True,\n )\n got = qa_chain(\"What is the answer?\")\n \ \ assert got[\"chat_history\"][1].content == answer\n assert got[\"answer\"\ ] == answer" pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - cosine_accuracy@1 - cosine_accuracy@3 - cosine_accuracy@5 - cosine_accuracy@10 - cosine_precision@1 - cosine_precision@3 - cosine_precision@5 - cosine_precision@10 - cosine_recall@1 - cosine_recall@3 - cosine_recall@5 - cosine_recall@10 - cosine_ndcg@10 - cosine_mrr@10 - cosine_map@100 model-index: - name: codeBert Base results: - task: type: information-retrieval name: Information Retrieval dataset: name: dim 768 type: dim_768 metrics: - type: cosine_accuracy@1 value: 0.83 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.85 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.86 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.94 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.83 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.83 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.83 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.45299999999999996 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.16599999999999998 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.498 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.83 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9059999999999999 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8712089918828809 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.8532738095238095 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.861635686929646 name: Cosine Map@100 - task: type: information-retrieval name: Information Retrieval dataset: name: dim 512 type: dim_512 metrics: - type: cosine_accuracy@1 value: 0.85 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.86 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.87 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.95 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.85 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.84 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.8419999999999999 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.45299999999999996 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.16999999999999996 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.504 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.8419999999999999 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9059999999999999 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8775797199885595 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.8699404761904762 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.8692738075020783 name: Cosine Map@100 - task: type: information-retrieval name: Information Retrieval dataset: name: dim 256 type: dim_256 metrics: - type: cosine_accuracy@1 value: 0.86 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.89 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.9 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.93 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.86 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.85 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.85 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.45 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.17199999999999996 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.51 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.85 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8789938349894767 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.8805952380952381 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.8726611807317667 name: Cosine Map@100 - task: type: information-retrieval name: Information Retrieval dataset: name: dim 128 type: dim_128 metrics: - type: cosine_accuracy@1 value: 0.84 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.87 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.88 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.93 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.84 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.8366666666666667 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.8419999999999999 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.455 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.16799999999999998 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.502 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.8419999999999999 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.91 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8777095006932575 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.8630000000000001 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.8661619081282643 name: Cosine Map@100 - task: type: information-retrieval name: Information Retrieval dataset: name: dim 64 type: dim_64 metrics: - type: cosine_accuracy@1 value: 0.78 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.81 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.81 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.93 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.78 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.7866666666666667 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.7859999999999999 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.44799999999999995 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.15599999999999997 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.472 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.7859999999999999 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.8959999999999999 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.8445404597381452 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.8120634920634922 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.8308457034802883 name: Cosine Map@100 --- # codeBert Base This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Language:** en - **License:** apache-2.0 ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'}) (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("killdollar/codebert-embed-base-dense-retriever") # Run inference sentences = [ 'How does __init__ work in Python?', 'def __init__(\n self,\n encoding_name: str = "gpt2",\n model_name: str | None = None,\n allowed_special: Literal["all"] | AbstractSet[str] = set(),\n disallowed_special: Literal["all"] | Collection[str] = "all",\n **kwargs: Any,\n ) -> None:\n """Create a new `TextSplitter`.\n\n Args:\n encoding_name: The name of the tiktoken encoding to use.\n model_name: The name of the model to use. If provided, this will\n override the `encoding_name`.\n allowed_special: Special tokens that are allowed during encoding.\n disallowed_special: Special tokens that are disallowed during encoding.\n\n Raises:\n ImportError: If the tiktoken package is not installed.\n """\n super().__init__(**kwargs)\n if not _HAS_TIKTOKEN:\n msg = (\n "Could not import tiktoken python package. "\n "This is needed in order to for TokenTextSplitter. "\n "Please install it with `pip install tiktoken`."\n )\n raise ImportError(msg)\n\n if model_name is not None:\n enc = tiktoken.encoding_for_model(model_name)\n else:\n enc = tiktoken.get_encoding(encoding_name)\n self._tokenizer = enc\n self._allowed_special = allowed_special\n self._disallowed_special = disallowed_special', 'def test_fixed_message_response_when_docs_found() -> None:\n fixed_resp = "I don\'t know"\n answer = "I know the answer!"\n llm = FakeListLLM(responses=[answer])\n retriever = SequentialRetriever(\n sequential_responses=[[Document(page_content=answer)]],\n )\n memory = ConversationBufferMemory(\n k=1,\n output_key="answer",\n memory_key="chat_history",\n return_messages=True,\n )\n qa_chain = ConversationalRetrievalChain.from_llm(\n llm=llm,\n memory=memory,\n retriever=retriever,\n return_source_documents=True,\n rephrase_question=False,\n response_if_no_docs_found=fixed_resp,\n verbose=True,\n )\n got = qa_chain("What is the answer?")\n assert got["chat_history"][1].content == answer\n assert got["answer"] == answer', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 768] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities) # tensor([[1.0000, 0.7336, 0.0979], # [0.7336, 1.0000, 0.1742], # [0.0979, 0.1742, 1.0000]]) ``` ## Evaluation ### Metrics #### Information Retrieval * Dataset: `dim_768` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 768 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.83 | | cosine_accuracy@3 | 0.85 | | cosine_accuracy@5 | 0.86 | | cosine_accuracy@10 | 0.94 | | cosine_precision@1 | 0.83 | | cosine_precision@3 | 0.83 | | cosine_precision@5 | 0.83 | | cosine_precision@10 | 0.453 | | cosine_recall@1 | 0.166 | | cosine_recall@3 | 0.498 | | cosine_recall@5 | 0.83 | | cosine_recall@10 | 0.906 | | **cosine_ndcg@10** | **0.8712** | | cosine_mrr@10 | 0.8533 | | cosine_map@100 | 0.8616 | #### Information Retrieval * Dataset: `dim_512` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 512 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.85 | | cosine_accuracy@3 | 0.86 | | cosine_accuracy@5 | 0.87 | | cosine_accuracy@10 | 0.95 | | cosine_precision@1 | 0.85 | | cosine_precision@3 | 0.84 | | cosine_precision@5 | 0.842 | | cosine_precision@10 | 0.453 | | cosine_recall@1 | 0.17 | | cosine_recall@3 | 0.504 | | cosine_recall@5 | 0.842 | | cosine_recall@10 | 0.906 | | **cosine_ndcg@10** | **0.8776** | | cosine_mrr@10 | 0.8699 | | cosine_map@100 | 0.8693 | #### Information Retrieval * Dataset: `dim_256` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 256 } ``` | Metric | Value | |:--------------------|:----------| | cosine_accuracy@1 | 0.86 | | cosine_accuracy@3 | 0.89 | | cosine_accuracy@5 | 0.9 | | cosine_accuracy@10 | 0.93 | | cosine_precision@1 | 0.86 | | cosine_precision@3 | 0.85 | | cosine_precision@5 | 0.85 | | cosine_precision@10 | 0.45 | | cosine_recall@1 | 0.172 | | cosine_recall@3 | 0.51 | | cosine_recall@5 | 0.85 | | cosine_recall@10 | 0.9 | | **cosine_ndcg@10** | **0.879** | | cosine_mrr@10 | 0.8806 | | cosine_map@100 | 0.8727 | #### Information Retrieval * Dataset: `dim_128` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 128 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.84 | | cosine_accuracy@3 | 0.87 | | cosine_accuracy@5 | 0.88 | | cosine_accuracy@10 | 0.93 | | cosine_precision@1 | 0.84 | | cosine_precision@3 | 0.8367 | | cosine_precision@5 | 0.842 | | cosine_precision@10 | 0.455 | | cosine_recall@1 | 0.168 | | cosine_recall@3 | 0.502 | | cosine_recall@5 | 0.842 | | cosine_recall@10 | 0.91 | | **cosine_ndcg@10** | **0.8777** | | cosine_mrr@10 | 0.863 | | cosine_map@100 | 0.8662 | #### Information Retrieval * Dataset: `dim_64` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) with these parameters: ```json { "truncate_dim": 64 } ``` | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.78 | | cosine_accuracy@3 | 0.81 | | cosine_accuracy@5 | 0.81 | | cosine_accuracy@10 | 0.93 | | cosine_precision@1 | 0.78 | | cosine_precision@3 | 0.7867 | | cosine_precision@5 | 0.786 | | cosine_precision@10 | 0.448 | | cosine_recall@1 | 0.156 | | cosine_recall@3 | 0.472 | | cosine_recall@5 | 0.786 | | cosine_recall@10 | 0.896 | | **cosine_ndcg@10** | **0.8445** | | cosine_mrr@10 | 0.8121 | | cosine_map@100 | 0.8308 | ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 900 training samples * Columns: anchor and positive * Approximate statistics based on the first 900 samples: | | anchor | positive | |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------| | type | string | string | | details | | | * Samples: | anchor | positive | |:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Explain the test_qdrant_similarity_search_with_relevance_scores logic | def test_qdrant_similarity_search_with_relevance_scores(
batch_size: int,
content_payload_key: str,
metadata_payload_key: str,
vector_name: str \| None,
) -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
docsearch = Qdrant.from_texts(
texts,
ConsistentFakeEmbeddings(),
location=":memory:",
content_payload_key=content_payload_key,
metadata_payload_key=metadata_payload_key,
batch_size=batch_size,
vector_name=vector_name,
)
output = docsearch.similarity_search_with_relevance_scores("foo", k=3)

assert all(
(score <= 1 or np.isclose(score, 1)) and score >= 0 for _, score in output
)
| | How to implement LangChainPendingDeprecationWarning? | class LangChainPendingDeprecationWarning(PendingDeprecationWarning):
"""A class for issuing deprecation warnings for LangChain users."""
| | Example usage of random_name | def random_name() -> str:
"""Generate a random name."""
adjective = random.choice(adjectives) # noqa: S311
noun = random.choice(nouns) # noqa: S311
number = random.randint(1, 100) # noqa: S311
return f"{adjective}-{noun}-{number}"
| * Loss: [MatryoshkaLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters: ```json { "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: epoch - `per_device_train_batch_size`: 4 - `per_device_eval_batch_size`: 4 - `gradient_accumulation_steps`: 16 - `learning_rate`: 2e-05 - `num_train_epochs`: 4 - `lr_scheduler_type`: cosine - `warmup_ratio`: 0.1 - `fp16`: True - `load_best_model_at_end`: True - `optim`: adamw_torch - `batch_sampler`: no_duplicates #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: epoch - `prediction_loss_only`: True - `per_device_train_batch_size`: 4 - `per_device_eval_batch_size`: 4 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 16 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 2e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 4 - `max_steps`: -1 - `lr_scheduler_type`: cosine - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.1 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `bf16`: False - `fp16`: True - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: True - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `parallelism_config`: None - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `project`: huggingface - `trackio_space_id`: trackio - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `hub_revision`: None - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: no - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `liger_kernel_config`: None - `eval_use_gather_object`: False - `average_tokens_across_devices`: True - `prompts`: None - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: proportional - `router_mapping`: {} - `learning_rate_mapping`: {}
### Training Logs | Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 | |:-------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:| | 0.7111 | 10 | 6.8447 | - | - | - | - | - | | 1.0 | 15 | - | 0.1025 | 0.0367 | 0.0548 | 0.0502 | 0.1185 | | 0.7111 | 10 | 4.8545 | - | - | - | - | - | | 1.0 | 15 | - | 0.2250 | 0.3047 | 0.2895 | 0.2892 | 0.3178 | | 0.7111 | 10 | 1.9011 | - | - | - | - | - | | 1.0 | 15 | - | 0.6530 | 0.6393 | 0.6269 | 0.6631 | 0.6658 | | 1.3556 | 20 | 0.6349 | - | - | - | - | - | | 2.0 | 30 | 0.1887 | 0.8480 | 0.8643 | 0.8641 | 0.8532 | 0.7974 | | 2.7111 | 40 | 0.0959 | - | - | - | - | - | | 3.0 | 45 | - | 0.8688 | 0.8774 | 0.8754 | 0.8725 | 0.8457 | | 3.3556 | 50 | 0.0359 | - | - | - | - | - | | **4.0** | **60** | **0.0515** | **0.8712** | **0.8776** | **0.879** | **0.8777** | **0.8445** | * The bold row denotes the saved checkpoint. ### Framework Versions - Python: 3.12.12 - Sentence Transformers: 5.2.0 - Transformers: 4.57.3 - PyTorch: 2.9.0+cu126 - Accelerate: 1.12.0 - Datasets: 4.0.0 - Tokenizers: 0.22.2 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MatryoshkaLoss ```bibtex @misc{kusupati2024matryoshka, title={Matryoshka Representation Learning}, author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, year={2024}, eprint={2205.13147}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```