Add new SentenceTransformer model

aeb55ce verified 3 months ago

45 kB

language:
  - en
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:180
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: shubharuidas/codebert-embed-base-dense-retriever
widget:
  - source_sentence: Explain the tool1 logic
    sentences:
      - |-
        def stream(
                self,
                thread_id: str,
                assistant_id: str,
                *,
                input: Input | None = None,
                command: Command | None = None,
                stream_mode: StreamMode | Sequence[StreamMode] = "values",
                stream_subgraphs: bool = False,
                stream_resumable: bool = False,
                metadata: Mapping[str, Any] | None = None,
                config: Config | None = None,
                context: Context | None = None,
                checkpoint: Checkpoint | None = None,
                checkpoint_id: str | None = None,
                checkpoint_during: bool | None = None,
                interrupt_before: All | Sequence[str] | None = None,
                interrupt_after: All | Sequence[str] | None = None,
                feedback_keys: Sequence[str] | None = None,
                on_disconnect: DisconnectMode | None = None,
                webhook: str | None = None,
                multitask_strategy: MultitaskStrategy | None = None,
                if_not_exists: IfNotExists | None = None,
                after_seconds: int | None = None,
                headers: Mapping[str, str] | None = None,
                params: QueryParamTypes | None = None,
                on_run_created: Callable[[RunCreateMetadata], None] | None = None,
            ) -> AsyncIterator[StreamPart]: ...
      - |-
        def tool1(some_val: int, some_other_val: str) -> str:
            """Tool 1 docstring."""
            if some_val == 0:
                msg = "Test error"
                raise ValueError(msg)
            return f"{some_val} - {some_other_val}"
      - |-
        class IndexConfig(TypedDict, total=False):
            """Configuration for indexing documents for semantic search in the store.

            If not provided to the store, the store will not support vector search.
            In that case, all `index` arguments to `put()` and `aput()` operations will be ignored.
            """

            dims: int
            """Number of dimensions in the embedding vectors.
            
            Common embedding models have the following dimensions:
                - `openai:text-embedding-3-large`: `3072`
                - `openai:text-embedding-3-small`: `1536`
                - `openai:text-embedding-ada-002`: `1536`
                - `cohere:embed-english-v3.0`: `1024`
                - `cohere:embed-english-light-v3.0`: `384`
                - `cohere:embed-multilingual-v3.0`: `1024`
                - `cohere:embed-multilingual-light-v3.0`: `384`
            """

            embed: Embeddings | EmbeddingsFunc | AEmbeddingsFunc | str
            """Optional function to generate embeddings from text.
            
            Can be specified in three ways:
                1. A LangChain `Embeddings` instance
                2. A synchronous embedding function (`EmbeddingsFunc`)
                3. An asynchronous embedding function (`AEmbeddingsFunc`)
                4. A provider string (e.g., `"openai:text-embedding-3-small"`)
            
            ???+ example "Examples"

                Using LangChain's initialization with `InMemoryStore`:

                ```python
                from langchain.embeddings import init_embeddings
                from langgraph.store.memory import InMemoryStore
                
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": init_embeddings("openai:text-embedding-3-small")
                    }
                )
                ```
                
                Using a custom embedding function with `InMemoryStore`:

                ```python
                from openai import OpenAI
                from langgraph.store.memory import InMemoryStore
                
                client = OpenAI()
                
                def embed_texts(texts: list[str]) -> list[list[float]]:
                    response = client.embeddings.create(
                        model="text-embedding-3-small",
                        input=texts
                    )
                    return [e.embedding for e in response.data]
                    
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": embed_texts
                    }
                )
                ```
                
                Using an asynchronous embedding function with `InMemoryStore`:

                ```python
                from openai import AsyncOpenAI
                from langgraph.store.memory import InMemoryStore
                
                client = AsyncOpenAI()
                
                async def aembed_texts(texts: list[str]) -> list[list[float]]:
                    response = await client.embeddings.create(
                        model="text-embedding-3-small",
                        input=texts
                    )
                    return [e.embedding for e in response.data]
                    
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": aembed_texts
                    }
                )
                ```
            """

            fields: list[str] | None
            """Fields to extract text from for embedding generation.
            
            Controls which parts of stored items are embedded for semantic search. Follows JSON path syntax:

            - `["$"]`: Embeds the entire JSON object as one vector  (default)
            - `["field1", "field2"]`: Embeds specific top-level fields
            - `["parent.child"]`: Embeds nested fields using dot notation
            - `["array[*].field"]`: Embeds field from each array element separately
            
            Note:
                You can always override this behavior when storing an item using the
                `index` parameter in the `put` or `aput` operations.
            
            ???+ example "Examples"

                ```python
                # Embed entire document (default)
                fields=["$"]
                
                # Embed specific fields
                fields=["text", "summary"]
                
                # Embed nested fields
                fields=["metadata.title", "content.body"]
                
                # Embed from arrays
                fields=["messages[*].content"]  # Each message content separately
                fields=["context[0].text"]      # First context item's text
                ```
            
            Note:
                - Fields missing from a document are skipped
                - Array notation creates separate embeddings for each element
                - Complex nested paths are supported (e.g., `"a.b[*].c.d"`)
            """
  - source_sentence: Explain the UpdateType logic
    sentences:
      - |-
        def test_subgraph_checkpoint_true(
            sync_checkpointer: BaseCheckpointSaver, durability: Durability
        ) -> None:
            class InnerState(TypedDict):
                my_key: Annotated[str, operator.add]
                my_other_key: str

            def inner_1(state: InnerState):
                return {"my_key": " got here", "my_other_key": state["my_key"]}

            def inner_2(state: InnerState):
                return {"my_key": " and there"}

            inner = StateGraph(InnerState)
            inner.add_node("inner_1", inner_1)
            inner.add_node("inner_2", inner_2)
            inner.add_edge("inner_1", "inner_2")
            inner.set_entry_point("inner_1")
            inner.set_finish_point("inner_2")

            class State(TypedDict):
                my_key: str

            graph = StateGraph(State)
            graph.add_node("inner", inner.compile(checkpointer=True))
            graph.add_edge(START, "inner")
            graph.add_conditional_edges(
                "inner", lambda s: "inner" if s["my_key"].count("there") < 2 else END
            )
            app = graph.compile(checkpointer=sync_checkpointer)

            config = {"configurable": {"thread_id": "2"}}
            assert [
                c
                for c in app.stream(
                    {"my_key": ""}, config, subgraphs=True, durability=durability
                )
            ] == [
                (("inner",), {"inner_1": {"my_key": " got here", "my_other_key": ""}}),
                (("inner",), {"inner_2": {"my_key": " and there"}}),
                ((), {"inner": {"my_key": " got here and there"}}),
                (
                    ("inner",),
                    {
                        "inner_1": {
                            "my_key": " got here",
                            "my_other_key": " got here and there got here and there",
                        }
                    },
                ),
                (("inner",), {"inner_2": {"my_key": " and there"}}),
                (
                    (),
                    {
                        "inner": {
                            "my_key": " got here and there got here and there got here and there"
                        }
                    },
                ),
            ]

            checkpoints = list(app.get_state_history(config))
            if durability != "exit":
                assert len(checkpoints) == 4
            else:
                assert len(checkpoints) == 1
      - |-
        def is_available(self) -> bool:
                return self.value is not MISSING
      - |-
        def UpdateType(self) -> type[Value]:
                """The type of the update received by the channel."""
                return self.typ
  - source_sentence: |-
      Example usage of ToolOutputMixin:  # type: ignore[no-redef]
              pass
    sentences:
      - >-
        def task(__func_or_none__: Callable[P, Awaitable[T]]) ->
        _TaskFunction[P, T]: ...
      - |-
        def test_graph_with_jitter_retry_policy():
            """Test a graph with a RetryPolicy that uses jitter."""

            class State(TypedDict):
                foo: str

            attempt_count = 0

            def failing_node(state):
                nonlocal attempt_count
                attempt_count += 1
                if attempt_count < 2:  # Fail the first attempt
                    raise ValueError("Intentional failure")
                return {"foo": "success"}

            # Create a retry policy with jitter enabled
            retry_policy = RetryPolicy(
                max_attempts=3,
                initial_interval=0.01,
                jitter=True,  # Enable jitter for randomized backoff
                retry_on=ValueError,
            )

            # Create and compile the graph
            graph = (
                StateGraph(State)
                .add_node("failing_node", failing_node, retry_policy=retry_policy)
                .add_edge(START, "failing_node")
                .compile()
            )

            # Test graph execution with mocked random and sleep
            with (
                patch("random.uniform", return_value=0.05) as mock_random,
                patch("time.sleep") as mock_sleep,
            ):
                result = graph.invoke({"foo": ""})

            # Verify retry behavior
            assert attempt_count == 2  # The node should have been tried twice
            assert result["foo"] == "success"

            # Verify jitter was applied
            mock_random.assert_called_with(0, 1)  # Jitter should use random.uniform(0, 1)
            mock_sleep.assert_called_with(0.01 + 0.05)
      - |-
        class ToolOutputMixin:  # type: ignore[no-redef]
                pass
  - source_sentence: Best practices for async test_async_entrypoint_without_checkpointer
    sentences:
      - |-
        def __init__(
                self,
                assistant_id: str,  # graph_id
                /,
                *,
                url: str | None = None,
                api_key: str | None = None,
                headers: dict[str, str] | None = None,
                client: LangGraphClient | None = None,
                sync_client: SyncLangGraphClient | None = None,
                config: RunnableConfig | None = None,
                name: str | None = None,
                distributed_tracing: bool = False,
            ):
                """Specify `url`, `api_key`, and/or `headers` to create default sync and async clients.

                If `client` or `sync_client` are provided, they will be used instead of the default clients.
                See `LangGraphClient` and `SyncLangGraphClient` for details on the default clients. At least
                one of `url`, `client`, or `sync_client` must be provided.

                Args:
                    assistant_id: The assistant ID or graph name of the remote graph to use.
                    url: The URL of the remote API.
                    api_key: The API key to use for authentication. If not provided, it will be read from the environment (`LANGGRAPH_API_KEY`, `LANGSMITH_API_KEY`, or `LANGCHAIN_API_KEY`).
                    headers: Additional headers to include in the requests.
                    client: A `LangGraphClient` instance to use instead of creating a default client.
                    sync_client: A `SyncLangGraphClient` instance to use instead of creating a default client.
                    config: An optional `RunnableConfig` instance with additional configuration.
                    name: Human-readable name to attach to the RemoteGraph instance.
                        This is useful for adding `RemoteGraph` as a subgraph via `graph.add_node(remote_graph)`.
                        If not provided, defaults to the assistant ID.
                    distributed_tracing: Whether to enable sending LangSmith distributed tracing headers.
                """
                self.assistant_id = assistant_id
                if name is None:
                    self.name = assistant_id
                else:
                    self.name = name
                self.config = config
                self.distributed_tracing = distributed_tracing

                if client is None and url is not None:
                    client = get_client(url=url, api_key=api_key, headers=headers)
                self.client = client

                if sync_client is None and url is not None:
                    sync_client = get_sync_client(url=url, api_key=api_key, headers=headers)
                self.sync_client = sync_client
      - |-
        async def test_async_entrypoint_without_checkpointer() -> None:
            """Test no checkpointer."""
            states = []
            config = {"configurable": {"thread_id": "1"}}

            # Test without previous
            @entrypoint()
            async def foo(inputs: Any) -> Any:
                states.append(inputs)
                return inputs

            assert (await foo.ainvoke({"a": "1"}, config)) == {"a": "1"}

            @entrypoint()
            async def foo(inputs: Any, *, previous: Any) -> Any:
                states.append(previous)
                return {"previous": previous, "current": inputs}

            assert (await foo.ainvoke({"a": "1"}, config)) == {
                "current": {"a": "1"},
                "previous": None,
            }
            assert (await foo.ainvoke({"a": "1"}, config)) == {
                "current": {"a": "1"},
                "previous": None,
            }
      - |-
        class _InjectedStatePydanticV2Schema(BaseModel):
            messages: list
            foo: str
  - source_sentence: Explain the validate_autoresponse logic
    sentences:
      - |-
        def task_path_str(tup: str | int | tuple) -> str:
            """Generate a string representation of the task path."""
            return (
                f"~{', '.join(task_path_str(x) for x in tup)}"
                if isinstance(tup, (tuple, list))
                else f"{tup:010d}"
                if isinstance(tup, int)
                else str(tup)
            )
      - |-
        def ValueType(self) -> type[Value]:
                """The type of the value stored in the channel."""
                return self.typ
      - |-
        def validate_autoresponse(cls, v):
                    if v is not None and not isinstance(v, dict):
                        raise TypeError("autoresponse must be a dict or None")
                    return v
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: codeBert dense retriever
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 768
          type: dim_768
        metrics:
          - type: cosine_accuracy@1
            value: 0.65
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.85
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.65
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2666666666666666
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17000000000000007
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.65
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.85
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8047507161733674
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7455555555555555
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7455555555555555
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 512
          type: dim_512
        metrics:
          - type: cosine_accuracy@1
            value: 0.7
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.85
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.7
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17000000000000007
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.7
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.85
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7959488813947496
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7499999999999999
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7545454545454545
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: cosine_accuracy@1
            value: 0.65
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.65
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.16000000000000006
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.65
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7682506698908595
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7141666666666666
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7180128205128204
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 128
          type: dim_128
        metrics:
          - type: cosine_accuracy@1
            value: 0.6
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.6
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.18000000000000005
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.6
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.9
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7417655963056966
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6908333333333333
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6987121212121211
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 64
          type: dim_64
        metrics:
          - type: cosine_accuracy@1
            value: 0.55
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.7
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.75
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.55
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2333333333333333
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.15000000000000005
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.55
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.7
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.75
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7155704014087189
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6454166666666665
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.647202380952381
            name: Cosine Map@100

codeBert dense retriever

This is a sentence-transformers model finetuned from shubharuidas/codebert-embed-base-dense-retriever. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: shubharuidas/codebert-embed-base-dense-retriever
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("anaghaj111/codebert-base-code-embed-mrl-langchain-langgraph")
# Run inference
sentences = [
    'Explain the validate_autoresponse logic',
    'def validate_autoresponse(cls, v):\n            if v is not None and not isinstance(v, dict):\n                raise TypeError("autoresponse must be a dict or None")\n            return v',
    'def task_path_str(tup: str | int | tuple) -> str:\n    """Generate a string representation of the task path."""\n    return (\n        f"~{\', \'.join(task_path_str(x) for x in tup)}"\n        if isinstance(tup, (tuple, list))\n        else f"{tup:010d}"\n        if isinstance(tup, int)\n        else str(tup)\n    )',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8070, 0.2282],
#         [0.8070, 1.0000, 0.3158],
#         [0.2282, 0.3158, 1.0000]])

Evaluation

Metrics

Information Retrieval

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 768
}
```

Metric	Value
cosine_accuracy@1	0.65
cosine_accuracy@3	0.8
cosine_accuracy@5	0.85
cosine_accuracy@10	1.0
cosine_precision@1	0.65
cosine_precision@3	0.2667
cosine_precision@5	0.17
cosine_precision@10	0.1
cosine_recall@1	0.65
cosine_recall@3	0.8
cosine_recall@5	0.85
cosine_recall@10	1.0
cosine_ndcg@10	0.8048
cosine_mrr@10	0.7456
cosine_map@100	0.7456

Information Retrieval

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 512
}
```

Metric	Value
cosine_accuracy@1	0.7
cosine_accuracy@3	0.75
cosine_accuracy@5	0.85
cosine_accuracy@10	0.95
cosine_precision@1	0.7
cosine_precision@3	0.25
cosine_precision@5	0.17
cosine_precision@10	0.095
cosine_recall@1	0.7
cosine_recall@3	0.75
cosine_recall@5	0.85
cosine_recall@10	0.95
cosine_ndcg@10	0.7959
cosine_mrr@10	0.75
cosine_map@100	0.7545

Information Retrieval

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 256
}
```

Metric	Value
cosine_accuracy@1	0.65
cosine_accuracy@3	0.75
cosine_accuracy@5	0.8
cosine_accuracy@10	0.95
cosine_precision@1	0.65
cosine_precision@3	0.25
cosine_precision@5	0.16
cosine_precision@10	0.095
cosine_recall@1	0.65
cosine_recall@3	0.75
cosine_recall@5	0.8
cosine_recall@10	0.95
cosine_ndcg@10	0.7683
cosine_mrr@10	0.7142
cosine_map@100	0.718

Information Retrieval

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 128
}
```

Metric	Value
cosine_accuracy@1	0.6
cosine_accuracy@3	0.75
cosine_accuracy@5	0.9
cosine_accuracy@10	0.9
cosine_precision@1	0.6
cosine_precision@3	0.25
cosine_precision@5	0.18
cosine_precision@10	0.09
cosine_recall@1	0.6
cosine_recall@3	0.75
cosine_recall@5	0.9
cosine_recall@10	0.9
cosine_ndcg@10	0.7418
cosine_mrr@10	0.6908
cosine_map@100	0.6987

Information Retrieval

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 64
}
```

Metric	Value
cosine_accuracy@1	0.55
cosine_accuracy@3	0.7
cosine_accuracy@5	0.75
cosine_accuracy@10	0.95
cosine_precision@1	0.55
cosine_precision@3	0.2333
cosine_precision@5	0.15
cosine_precision@10	0.095
cosine_recall@1	0.55
cosine_recall@3	0.7
cosine_recall@5	0.75
cosine_recall@10	0.95
cosine_ndcg@10	0.7156
cosine_mrr@10	0.6454
cosine_map@100	0.6472

Training Details

Training Dataset

Unnamed Dataset

Size: 180 training samples
Columns: anchor and positive
Approximate statistics based on the first 180 samples:
anchor positive
type string string
details
min: 6 tokens
mean: 14.07 tokens
max: 354 tokens

min: 14 tokens
mean: 272.19 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 6 tokens mean: 14.07 tokens max: 354 tokens	min: 14 tokens mean: 272.19 tokens max: 512 tokens

Samples:

anchor	positive
`Best practices for test_search_items`	def test_search_items( fake_embeddings: CharacterEmbeddings, ) -> None: """Test search_items functionality by calling store methods directly.""" base = "test_search_items" test_namespaces = [ (base, "documents", "user1"), (base, "documents", "user2"), (base, "reports", "department1"), (base, "reports", "department2"), ] test_items = [ {"title": "Doc 1", "author": "John Doe", "tags": ["important"]}, {"title": "Doc 2", "author": "Jane Smith", "tags": ["draft"]}, {"title": "Report A", "author": "John Doe", "tags": ["final"]}, {"title": "Report B", "author": "Alice Johnson", "tags": ["draft"]}, ] with create_vector_store( fake_embeddings, text_fields=["key0", "key1", "key3"] ) as store: # Insert test data for ns, item in zip(test_namespaces, test_items, strict=False): key = f"item_{ns[-1]}" store.put(ns, key, item) # 1. Search documen...
`How does async store work in Python?`	async def store(request) -> AsyncIterator[AsyncPostgresStore]: database = f"test_{uuid.uuid4().hex[:16]}" uri_parts = DEFAULT_URI.split("/") uri_base = "/".join(uri_parts[:-1]) query_params = "" if "?" in uri_parts[-1]: db_name, query_params = uri_parts[-1].split("?", 1) query_params = "?" + query_params conn_string = f"{uri_base}/{database}{query_params}" admin_conn_string = DEFAULT_URI ttl_config = { "default_ttl": TTL_MINUTES, "refresh_on_read": True, "sweep_interval_minutes": TTL_MINUTES / 2, } async with await AsyncConnection.connect( admin_conn_string, autocommit=True ) as conn: await conn.execute(f"CREATE DATABASE {database}") try: async with AsyncPostgresStore.from_conn_string( conn_string, ttl=ttl_config ) as store: store.MIGRATIONS = [ ( mig.replace("ttl_minutes INT;", "ttl_minutes FLOAT;") ...
`How to implement list?`	def list( self, config: RunnableConfig \| None, *, filter: dict[str, Any] \| None = None, before: RunnableConfig \| None = None, limit: int \| None = None, ) -> Iterator[CheckpointTuple]: """List checkpoints from the database. This method retrieves a list of checkpoint tuples from the Postgres database based on the provided config. For ShallowPostgresSaver, this method returns a list with ONLY the most recent checkpoint. """ aiter_ = self.alist(config, filter=filter, before=before, limit=limit) while True: try: yield asyncio.run_coroutine_threadsafe( anext(aiter_), # type: ignore[arg-type] # noqa: F821 self.loop, ).result() except StopAsyncIteration: break

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
gradient_accumulation_steps: 16
learning_rate: 2e-05
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.1
warmup_steps: 0.1
fp16: True
load_best_model_at_end: True
optim: adamw_torch
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
gradient_accumulation_steps: 16
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: None
warmup_ratio: 0.1
warmup_steps: 0.1
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
enable_jit_checkpoint: False
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
use_cpu: False
seed: 42
data_seed: None
bf16: False
fp16: True
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: -1
ddp_backend: None
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_for_metrics: []
eval_do_concat_batches: True
auto_find_batch_size: False
full_determinism: False
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
use_cache: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	dim_768_cosine_ndcg@10	dim_512_cosine_ndcg@10	dim_256_cosine_ndcg@10	dim_128_cosine_ndcg@10	dim_64_cosine_ndcg@10
1.0	3	0.7612	0.7137	0.7083	0.6926	0.6624
2.0	6	0.8048	0.7959	0.7683	0.7418	0.7156

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.14.0
Sentence Transformers: 5.2.1
Transformers: 5.0.0
PyTorch: 2.10.0
Accelerate: 1.12.0
Datasets: 4.5.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}