anaghaj111's picture
Add new SentenceTransformer model
aeb55ce verified
|
raw
history blame
45 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:180
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: shubharuidas/codebert-embed-base-dense-retriever
widget:
  - source_sentence: Explain the tool1 logic
    sentences:
      - |-
        def stream(
                self,
                thread_id: str,
                assistant_id: str,
                *,
                input: Input | None = None,
                command: Command | None = None,
                stream_mode: StreamMode | Sequence[StreamMode] = "values",
                stream_subgraphs: bool = False,
                stream_resumable: bool = False,
                metadata: Mapping[str, Any] | None = None,
                config: Config | None = None,
                context: Context | None = None,
                checkpoint: Checkpoint | None = None,
                checkpoint_id: str | None = None,
                checkpoint_during: bool | None = None,
                interrupt_before: All | Sequence[str] | None = None,
                interrupt_after: All | Sequence[str] | None = None,
                feedback_keys: Sequence[str] | None = None,
                on_disconnect: DisconnectMode | None = None,
                webhook: str | None = None,
                multitask_strategy: MultitaskStrategy | None = None,
                if_not_exists: IfNotExists | None = None,
                after_seconds: int | None = None,
                headers: Mapping[str, str] | None = None,
                params: QueryParamTypes | None = None,
                on_run_created: Callable[[RunCreateMetadata], None] | None = None,
            ) -> AsyncIterator[StreamPart]: ...
      - |-
        def tool1(some_val: int, some_other_val: str) -> str:
            """Tool 1 docstring."""
            if some_val == 0:
                msg = "Test error"
                raise ValueError(msg)
            return f"{some_val} - {some_other_val}"
      - |-
        class IndexConfig(TypedDict, total=False):
            """Configuration for indexing documents for semantic search in the store.

            If not provided to the store, the store will not support vector search.
            In that case, all `index` arguments to `put()` and `aput()` operations will be ignored.
            """

            dims: int
            """Number of dimensions in the embedding vectors.
            
            Common embedding models have the following dimensions:
                - `openai:text-embedding-3-large`: `3072`
                - `openai:text-embedding-3-small`: `1536`
                - `openai:text-embedding-ada-002`: `1536`
                - `cohere:embed-english-v3.0`: `1024`
                - `cohere:embed-english-light-v3.0`: `384`
                - `cohere:embed-multilingual-v3.0`: `1024`
                - `cohere:embed-multilingual-light-v3.0`: `384`
            """

            embed: Embeddings | EmbeddingsFunc | AEmbeddingsFunc | str
            """Optional function to generate embeddings from text.
            
            Can be specified in three ways:
                1. A LangChain `Embeddings` instance
                2. A synchronous embedding function (`EmbeddingsFunc`)
                3. An asynchronous embedding function (`AEmbeddingsFunc`)
                4. A provider string (e.g., `"openai:text-embedding-3-small"`)
            
            ???+ example "Examples"

                Using LangChain's initialization with `InMemoryStore`:

                ```python
                from langchain.embeddings import init_embeddings
                from langgraph.store.memory import InMemoryStore
                
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": init_embeddings("openai:text-embedding-3-small")
                    }
                )
                ```
                
                Using a custom embedding function with `InMemoryStore`:

                ```python
                from openai import OpenAI
                from langgraph.store.memory import InMemoryStore
                
                client = OpenAI()
                
                def embed_texts(texts: list[str]) -> list[list[float]]:
                    response = client.embeddings.create(
                        model="text-embedding-3-small",
                        input=texts
                    )
                    return [e.embedding for e in response.data]
                    
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": embed_texts
                    }
                )
                ```
                
                Using an asynchronous embedding function with `InMemoryStore`:

                ```python
                from openai import AsyncOpenAI
                from langgraph.store.memory import InMemoryStore
                
                client = AsyncOpenAI()
                
                async def aembed_texts(texts: list[str]) -> list[list[float]]:
                    response = await client.embeddings.create(
                        model="text-embedding-3-small",
                        input=texts
                    )
                    return [e.embedding for e in response.data]
                    
                store = InMemoryStore(
                    index={
                        "dims": 1536,
                        "embed": aembed_texts
                    }
                )
                ```
            """

            fields: list[str] | None
            """Fields to extract text from for embedding generation.
            
            Controls which parts of stored items are embedded for semantic search. Follows JSON path syntax:

            - `["$"]`: Embeds the entire JSON object as one vector  (default)
            - `["field1", "field2"]`: Embeds specific top-level fields
            - `["parent.child"]`: Embeds nested fields using dot notation
            - `["array[*].field"]`: Embeds field from each array element separately
            
            Note:
                You can always override this behavior when storing an item using the
                `index` parameter in the `put` or `aput` operations.
            
            ???+ example "Examples"

                ```python
                # Embed entire document (default)
                fields=["$"]
                
                # Embed specific fields
                fields=["text", "summary"]
                
                # Embed nested fields
                fields=["metadata.title", "content.body"]
                
                # Embed from arrays
                fields=["messages[*].content"]  # Each message content separately
                fields=["context[0].text"]      # First context item's text
                ```
            
            Note:
                - Fields missing from a document are skipped
                - Array notation creates separate embeddings for each element
                - Complex nested paths are supported (e.g., `"a.b[*].c.d"`)
            """
  - source_sentence: Explain the UpdateType logic
    sentences:
      - |-
        def test_subgraph_checkpoint_true(
            sync_checkpointer: BaseCheckpointSaver, durability: Durability
        ) -> None:
            class InnerState(TypedDict):
                my_key: Annotated[str, operator.add]
                my_other_key: str

            def inner_1(state: InnerState):
                return {"my_key": " got here", "my_other_key": state["my_key"]}

            def inner_2(state: InnerState):
                return {"my_key": " and there"}

            inner = StateGraph(InnerState)
            inner.add_node("inner_1", inner_1)
            inner.add_node("inner_2", inner_2)
            inner.add_edge("inner_1", "inner_2")
            inner.set_entry_point("inner_1")
            inner.set_finish_point("inner_2")

            class State(TypedDict):
                my_key: str

            graph = StateGraph(State)
            graph.add_node("inner", inner.compile(checkpointer=True))
            graph.add_edge(START, "inner")
            graph.add_conditional_edges(
                "inner", lambda s: "inner" if s["my_key"].count("there") < 2 else END
            )
            app = graph.compile(checkpointer=sync_checkpointer)

            config = {"configurable": {"thread_id": "2"}}
            assert [
                c
                for c in app.stream(
                    {"my_key": ""}, config, subgraphs=True, durability=durability
                )
            ] == [
                (("inner",), {"inner_1": {"my_key": " got here", "my_other_key": ""}}),
                (("inner",), {"inner_2": {"my_key": " and there"}}),
                ((), {"inner": {"my_key": " got here and there"}}),
                (
                    ("inner",),
                    {
                        "inner_1": {
                            "my_key": " got here",
                            "my_other_key": " got here and there got here and there",
                        }
                    },
                ),
                (("inner",), {"inner_2": {"my_key": " and there"}}),
                (
                    (),
                    {
                        "inner": {
                            "my_key": " got here and there got here and there got here and there"
                        }
                    },
                ),
            ]

            checkpoints = list(app.get_state_history(config))
            if durability != "exit":
                assert len(checkpoints) == 4
            else:
                assert len(checkpoints) == 1
      - |-
        def is_available(self) -> bool:
                return self.value is not MISSING
      - |-
        def UpdateType(self) -> type[Value]:
                """The type of the update received by the channel."""
                return self.typ
  - source_sentence: |-
      Example usage of ToolOutputMixin:  # type: ignore[no-redef]
              pass
    sentences:
      - >-
        def task(__func_or_none__: Callable[P, Awaitable[T]]) ->
        _TaskFunction[P, T]: ...
      - |-
        def test_graph_with_jitter_retry_policy():
            """Test a graph with a RetryPolicy that uses jitter."""

            class State(TypedDict):
                foo: str

            attempt_count = 0

            def failing_node(state):
                nonlocal attempt_count
                attempt_count += 1
                if attempt_count < 2:  # Fail the first attempt
                    raise ValueError("Intentional failure")
                return {"foo": "success"}

            # Create a retry policy with jitter enabled
            retry_policy = RetryPolicy(
                max_attempts=3,
                initial_interval=0.01,
                jitter=True,  # Enable jitter for randomized backoff
                retry_on=ValueError,
            )

            # Create and compile the graph
            graph = (
                StateGraph(State)
                .add_node("failing_node", failing_node, retry_policy=retry_policy)
                .add_edge(START, "failing_node")
                .compile()
            )

            # Test graph execution with mocked random and sleep
            with (
                patch("random.uniform", return_value=0.05) as mock_random,
                patch("time.sleep") as mock_sleep,
            ):
                result = graph.invoke({"foo": ""})

            # Verify retry behavior
            assert attempt_count == 2  # The node should have been tried twice
            assert result["foo"] == "success"

            # Verify jitter was applied
            mock_random.assert_called_with(0, 1)  # Jitter should use random.uniform(0, 1)
            mock_sleep.assert_called_with(0.01 + 0.05)
      - |-
        class ToolOutputMixin:  # type: ignore[no-redef]
                pass
  - source_sentence: Best practices for async test_async_entrypoint_without_checkpointer
    sentences:
      - |-
        def __init__(
                self,
                assistant_id: str,  # graph_id
                /,
                *,
                url: str | None = None,
                api_key: str | None = None,
                headers: dict[str, str] | None = None,
                client: LangGraphClient | None = None,
                sync_client: SyncLangGraphClient | None = None,
                config: RunnableConfig | None = None,
                name: str | None = None,
                distributed_tracing: bool = False,
            ):
                """Specify `url`, `api_key`, and/or `headers` to create default sync and async clients.

                If `client` or `sync_client` are provided, they will be used instead of the default clients.
                See `LangGraphClient` and `SyncLangGraphClient` for details on the default clients. At least
                one of `url`, `client`, or `sync_client` must be provided.

                Args:
                    assistant_id: The assistant ID or graph name of the remote graph to use.
                    url: The URL of the remote API.
                    api_key: The API key to use for authentication. If not provided, it will be read from the environment (`LANGGRAPH_API_KEY`, `LANGSMITH_API_KEY`, or `LANGCHAIN_API_KEY`).
                    headers: Additional headers to include in the requests.
                    client: A `LangGraphClient` instance to use instead of creating a default client.
                    sync_client: A `SyncLangGraphClient` instance to use instead of creating a default client.
                    config: An optional `RunnableConfig` instance with additional configuration.
                    name: Human-readable name to attach to the RemoteGraph instance.
                        This is useful for adding `RemoteGraph` as a subgraph via `graph.add_node(remote_graph)`.
                        If not provided, defaults to the assistant ID.
                    distributed_tracing: Whether to enable sending LangSmith distributed tracing headers.
                """
                self.assistant_id = assistant_id
                if name is None:
                    self.name = assistant_id
                else:
                    self.name = name
                self.config = config
                self.distributed_tracing = distributed_tracing

                if client is None and url is not None:
                    client = get_client(url=url, api_key=api_key, headers=headers)
                self.client = client

                if sync_client is None and url is not None:
                    sync_client = get_sync_client(url=url, api_key=api_key, headers=headers)
                self.sync_client = sync_client
      - |-
        async def test_async_entrypoint_without_checkpointer() -> None:
            """Test no checkpointer."""
            states = []
            config = {"configurable": {"thread_id": "1"}}

            # Test without previous
            @entrypoint()
            async def foo(inputs: Any) -> Any:
                states.append(inputs)
                return inputs

            assert (await foo.ainvoke({"a": "1"}, config)) == {"a": "1"}

            @entrypoint()
            async def foo(inputs: Any, *, previous: Any) -> Any:
                states.append(previous)
                return {"previous": previous, "current": inputs}

            assert (await foo.ainvoke({"a": "1"}, config)) == {
                "current": {"a": "1"},
                "previous": None,
            }
            assert (await foo.ainvoke({"a": "1"}, config)) == {
                "current": {"a": "1"},
                "previous": None,
            }
      - |-
        class _InjectedStatePydanticV2Schema(BaseModel):
            messages: list
            foo: str
  - source_sentence: Explain the validate_autoresponse logic
    sentences:
      - |-
        def task_path_str(tup: str | int | tuple) -> str:
            """Generate a string representation of the task path."""
            return (
                f"~{', '.join(task_path_str(x) for x in tup)}"
                if isinstance(tup, (tuple, list))
                else f"{tup:010d}"
                if isinstance(tup, int)
                else str(tup)
            )
      - |-
        def ValueType(self) -> type[Value]:
                """The type of the value stored in the channel."""
                return self.typ
      - |-
        def validate_autoresponse(cls, v):
                    if v is not None and not isinstance(v, dict):
                        raise TypeError("autoresponse must be a dict or None")
                    return v
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: codeBert dense retriever
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 768
          type: dim_768
        metrics:
          - type: cosine_accuracy@1
            value: 0.65
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.85
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.65
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2666666666666666
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17000000000000007
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.65
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.85
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8047507161733674
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7455555555555555
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7455555555555555
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 512
          type: dim_512
        metrics:
          - type: cosine_accuracy@1
            value: 0.7
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.85
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.7
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17000000000000007
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.7
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.85
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7959488813947496
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7499999999999999
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7545454545454545
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: cosine_accuracy@1
            value: 0.65
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.65
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.16000000000000006
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.65
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7682506698908595
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7141666666666666
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7180128205128204
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 128
          type: dim_128
        metrics:
          - type: cosine_accuracy@1
            value: 0.6
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.75
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.6
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.24999999999999994
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.18000000000000005
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.6
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.75
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.9
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7417655963056966
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6908333333333333
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6987121212121211
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 64
          type: dim_64
        metrics:
          - type: cosine_accuracy@1
            value: 0.55
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.7
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.75
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.95
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.55
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2333333333333333
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.15000000000000005
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09500000000000001
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.55
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.7
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.75
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.95
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7155704014087189
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6454166666666665
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.647202380952381
            name: Cosine Map@100

codeBert dense retriever

This is a sentence-transformers model finetuned from shubharuidas/codebert-embed-base-dense-retriever. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: shubharuidas/codebert-embed-base-dense-retriever
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("anaghaj111/codebert-base-code-embed-mrl-langchain-langgraph")
# Run inference
sentences = [
    'Explain the validate_autoresponse logic',
    'def validate_autoresponse(cls, v):\n            if v is not None and not isinstance(v, dict):\n                raise TypeError("autoresponse must be a dict or None")\n            return v',
    'def task_path_str(tup: str | int | tuple) -> str:\n    """Generate a string representation of the task path."""\n    return (\n        f"~{\', \'.join(task_path_str(x) for x in tup)}"\n        if isinstance(tup, (tuple, list))\n        else f"{tup:010d}"\n        if isinstance(tup, int)\n        else str(tup)\n    )',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8070, 0.2282],
#         [0.8070, 1.0000, 0.3158],
#         [0.2282, 0.3158, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.65
cosine_accuracy@3 0.8
cosine_accuracy@5 0.85
cosine_accuracy@10 1.0
cosine_precision@1 0.65
cosine_precision@3 0.2667
cosine_precision@5 0.17
cosine_precision@10 0.1
cosine_recall@1 0.65
cosine_recall@3 0.8
cosine_recall@5 0.85
cosine_recall@10 1.0
cosine_ndcg@10 0.8048
cosine_mrr@10 0.7456
cosine_map@100 0.7456

Information Retrieval

Metric Value
cosine_accuracy@1 0.7
cosine_accuracy@3 0.75
cosine_accuracy@5 0.85
cosine_accuracy@10 0.95
cosine_precision@1 0.7
cosine_precision@3 0.25
cosine_precision@5 0.17
cosine_precision@10 0.095
cosine_recall@1 0.7
cosine_recall@3 0.75
cosine_recall@5 0.85
cosine_recall@10 0.95
cosine_ndcg@10 0.7959
cosine_mrr@10 0.75
cosine_map@100 0.7545

Information Retrieval

Metric Value
cosine_accuracy@1 0.65
cosine_accuracy@3 0.75
cosine_accuracy@5 0.8
cosine_accuracy@10 0.95
cosine_precision@1 0.65
cosine_precision@3 0.25
cosine_precision@5 0.16
cosine_precision@10 0.095
cosine_recall@1 0.65
cosine_recall@3 0.75
cosine_recall@5 0.8
cosine_recall@10 0.95
cosine_ndcg@10 0.7683
cosine_mrr@10 0.7142
cosine_map@100 0.718

Information Retrieval

Metric Value
cosine_accuracy@1 0.6
cosine_accuracy@3 0.75
cosine_accuracy@5 0.9
cosine_accuracy@10 0.9
cosine_precision@1 0.6
cosine_precision@3 0.25
cosine_precision@5 0.18
cosine_precision@10 0.09
cosine_recall@1 0.6
cosine_recall@3 0.75
cosine_recall@5 0.9
cosine_recall@10 0.9
cosine_ndcg@10 0.7418
cosine_mrr@10 0.6908
cosine_map@100 0.6987

Information Retrieval

Metric Value
cosine_accuracy@1 0.55
cosine_accuracy@3 0.7
cosine_accuracy@5 0.75
cosine_accuracy@10 0.95
cosine_precision@1 0.55
cosine_precision@3 0.2333
cosine_precision@5 0.15
cosine_precision@10 0.095
cosine_recall@1 0.55
cosine_recall@3 0.7
cosine_recall@5 0.75
cosine_recall@10 0.95
cosine_ndcg@10 0.7156
cosine_mrr@10 0.6454
cosine_map@100 0.6472

Training Details

Training Dataset

Unnamed Dataset

  • Size: 180 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 180 samples:
    anchor positive
    type string string
    details
    • min: 6 tokens
    • mean: 14.07 tokens
    • max: 354 tokens
    • min: 14 tokens
    • mean: 272.19 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    Best practices for test_search_items def test_search_items(
    fake_embeddings: CharacterEmbeddings,
    ) -> None:
    """Test search_items functionality by calling store methods directly."""
    base = "test_search_items"
    test_namespaces = [
    (base, "documents", "user1"),
    (base, "documents", "user2"),
    (base, "reports", "department1"),
    (base, "reports", "department2"),
    ]
    test_items = [
    {"title": "Doc 1", "author": "John Doe", "tags": ["important"]},
    {"title": "Doc 2", "author": "Jane Smith", "tags": ["draft"]},
    {"title": "Report A", "author": "John Doe", "tags": ["final"]},
    {"title": "Report B", "author": "Alice Johnson", "tags": ["draft"]},
    ]

    with create_vector_store(
    fake_embeddings, text_fields=["key0", "key1", "key3"]
    ) as store:
    # Insert test data
    for ns, item in zip(test_namespaces, test_items, strict=False):
    key = f"item_{ns[-1]}"
    store.put(ns, key, item)

    # 1. Search documen...
    How does async store work in Python? async def store(request) -> AsyncIterator[AsyncPostgresStore]:
    database = f"test_{uuid.uuid4().hex[:16]}"
    uri_parts = DEFAULT_URI.split("/")
    uri_base = "/".join(uri_parts[:-1])
    query_params = ""
    if "?" in uri_parts[-1]:
    db_name, query_params = uri_parts[-1].split("?", 1)
    query_params = "?" + query_params

    conn_string = f"{uri_base}/{database}{query_params}"
    admin_conn_string = DEFAULT_URI
    ttl_config = {
    "default_ttl": TTL_MINUTES,
    "refresh_on_read": True,
    "sweep_interval_minutes": TTL_MINUTES / 2,
    }
    async with await AsyncConnection.connect(
    admin_conn_string, autocommit=True
    ) as conn:
    await conn.execute(f"CREATE DATABASE {database}")
    try:
    async with AsyncPostgresStore.from_conn_string(
    conn_string, ttl=ttl_config
    ) as store:
    store.MIGRATIONS = [
    (
    mig.replace("ttl_minutes INT;", "ttl_minutes FLOAT;")
    ...
    How to implement list? def list(
    self,
    config: RunnableConfig | None,
    *,
    filter: dict[str, Any] | None = None,
    before: RunnableConfig | None = None,
    limit: int | None = None,
    ) -> Iterator[CheckpointTuple]:
    """List checkpoints from the database.

    This method retrieves a list of checkpoint tuples from the Postgres database based
    on the provided config. For ShallowPostgresSaver, this method returns a list with
    ONLY the most recent checkpoint.
    """
    aiter_ = self.alist(config, filter=filter, before=before, limit=limit)
    while True:
    try:
    yield asyncio.run_coroutine_threadsafe(
    anext(aiter_), # type: ignore[arg-type] # noqa: F821
    self.loop,
    ).result()
    except StopAsyncIteration:
    break
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • warmup_steps: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • optim: adamw_torch
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: None
  • warmup_ratio: 0.1
  • warmup_steps: 0.1
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • enable_jit_checkpoint: False
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • use_cpu: False
  • seed: 42
  • data_seed: None
  • bf16: False
  • fp16: True
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: -1
  • ddp_backend: None
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • auto_find_batch_size: False
  • full_determinism: False
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • use_cache: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
1.0 3 0.7612 0.7137 0.7083 0.6926 0.6624
2.0 6 0.8048 0.7959 0.7683 0.7418 0.7156
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.14.0
  • Sentence Transformers: 5.2.1
  • Transformers: 5.0.0
  • PyTorch: 2.10.0
  • Accelerate: 1.12.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}