---
tags:
- sentence-transformers
- code
- code-retrieval
- retrieval-augmented-generation
- rag
- python
- java 
- go 
- php
- sentence-similarity
- feature-extraction
- generated_from_trainer
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: |-
    search_query: Finds the top long, short, and absolute positions.

        Parameters
        ----------
        positions : pd.DataFrame
            The positions that the strategy takes over time.
        top : int, optional
            How many of each to find (default 10).

        Returns
        -------
        df_top_long : pd.DataFrame
            Top long positions.
        df_top_short : pd.DataFrame
            Top short positions.
        df_top_abs : pd.DataFrame
            Top absolute positions.
  sentences:
  - >-
    search_document: def symmetric_ema(xolds, yolds, low=None, high=None, n=512,
    decay_steps=1., low_counts_threshold=1e-8):
        '''
        perform symmetric EMA (exponential moving average)
        smoothing and resampling to an even grid with n points.
        Does not do extrapolation, so we assume
        xolds[0] <= low && high <= xolds[-1]

        Arguments:

        xolds: array or list  - x values of data. Needs to be sorted in ascending order
        yolds: array of list  - y values of data. Has to have the same length as xolds

        low: float            - min value of the new x grid. By default equals to xolds[0]
        high: float           - max value of the new x grid. By default equals to xolds[-1]

        n: int                - number of points in new x grid

        decay_steps: float    - EMA decay factor, expressed in new x grid steps.

        low_counts_threshold: float or int
                              - y values with counts less than this value will be set to NaN

        Returns:
            tuple sum_ys, count_ys where
                xs        - array with new x grid
                ys        - array of EMA of y at each point of the new x grid
                count_ys  - array of EMA of y counts at each point of the new x grid

        '''
        xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
        _,  ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
        ys2 = ys2[::-1]
        count_ys2 = count_ys2[::-1]
        count_ys = count_ys1 + count_ys2
        ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
        ys[count_ys < low_counts_threshold] = np.nan
        return xs, ys, count_ys
  - |-
    search_document: def project(self, from_shape, to_shape):
            """
            Project the polygon onto an image with different shape.

            The relative coordinates of all points remain the same.
            E.g. a point at (x=20, y=20) on an image (width=100, height=200) will be
            projected on a new image (width=200, height=100) to (x=40, y=10).

            This is intended for cases where the original image is resized.
            It cannot be used for more complex changes (e.g. padding, cropping).

            Parameters
            ----------
            from_shape : tuple of int
                Shape of the original image. (Before resize.)

            to_shape : tuple of int
                Shape of the new image. (After resize.)

            Returns
            -------
            imgaug.Polygon
                Polygon object with new coordinates.

            """
            if from_shape[0:2] == to_shape[0:2]:
                return self.copy()
            ls_proj = self.to_line_string(closed=False).project(
                from_shape, to_shape)
            return self.copy(exterior=ls_proj.coords)
  - |-
    search_document: def get_top_long_short_abs(positions, top=10):
        """
        Finds the top long, short, and absolute positions.

        Parameters
        ----------
        positions : pd.DataFrame
            The positions that the strategy takes over time.
        top : int, optional
            How many of each to find (default 10).

        Returns
        -------
        df_top_long : pd.DataFrame
            Top long positions.
        df_top_short : pd.DataFrame
            Top short positions.
        df_top_abs : pd.DataFrame
            Top absolute positions.
        """

        positions = positions.drop('cash', axis='columns')
        df_max = positions.max()
        df_min = positions.min()
        df_abs_max = positions.abs().max()
        df_top_long = df_max[df_max > 0].nlargest(top)
        df_top_short = df_min[df_min < 0].nsmallest(top)
        df_top_abs = df_abs_max.nlargest(top)
        return df_top_long, df_top_short, df_top_abs
- source_sentence: |-
    search_query: Draw text on an image.

        This uses by default DejaVuSans as its font, which is included in this library.

        dtype support::

            * ``uint8``: yes; fully tested
            * ``uint16``: no
            * ``uint32``: no
            * ``uint64``: no
            * ``int8``: no
            * ``int16``: no
            * ``int32``: no
            * ``int64``: no
            * ``float16``: no
            * ``float32``: yes; not tested
            * ``float64``: no
            * ``float128``: no
            * ``bool``: no

            TODO check if other dtypes could be enabled

        Parameters
        ----------
        img : (H,W,3) ndarray
            The image array to draw text on.
            Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0).

        y : int
            x-coordinate of the top left corner of the text.

        x : int
            y- coordinate of the top left corner of the text.

        text : str
            The text to draw.

        color : iterable of int, optional
            Color of the text to draw. For RGB-images this is expected to be an RGB color.

        size : int, optional
            Font size of the text to draw.

        Returns
        -------
        img_np : (H,W,3) ndarray
            Input image with text drawn on it.
  sentences:
  - >-
    search_document: def cross_entropy_seq_with_mask(logits, target_seqs,
    input_mask, return_details=False, name=None):
        """Returns the expression of cross-entropy of two sequences, implement
        softmax internally. Normally be used for Dynamic RNN with Synced sequence input and output.

        Parameters
        -----------
        logits : Tensor
            2D tensor with shape of [batch_size * ?, n_classes], `?` means dynamic IDs for each example.
            - Can be get from `DynamicRNNLayer` by setting ``return_seq_2d`` to `True`.
        target_seqs : Tensor
            int of tensor, like word ID. [batch_size, ?], `?` means dynamic IDs for each example.
        input_mask : Tensor
            The mask to compute loss, it has the same size with `target_seqs`, normally 0 or 1.
        return_details : boolean
            Whether to return detailed losses.
                - If False (default), only returns the loss.
                - If True, returns the loss, losses, weights and targets (see source code).

        Examples
        --------
        >>> batch_size = 64
        >>> vocab_size = 10000
        >>> embedding_size = 256
        >>> input_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="input")
        >>> target_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="target")
        >>> input_mask = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="mask")
        >>> net = tl.layers.EmbeddingInputlayer(
        ...         inputs = input_seqs,
        ...         vocabulary_size = vocab_size,
        ...         embedding_size = embedding_size,
        ...         name = 'seq_embedding')
        >>> net = tl.layers.DynamicRNNLayer(net,
        ...         cell_fn = tf.contrib.rnn.BasicLSTMCell,
        ...         n_hidden = embedding_size,
        ...         dropout = (0.7 if is_train else None),
        ...         sequence_length = tl.layers.retrieve_seq_length_op2(input_seqs),
        ...         return_seq_2d = True,
        ...         name = 'dynamicrnn')
        >>> print(net.outputs)
        (?, 256)
        >>> net = tl.layers.DenseLayer(net, n_units=vocab_size, name="output")
        >>> print(net.outputs)
        (?, 10000)
        >>> loss = tl.cost.cross_entropy_seq_with_mask(net.outputs, target_seqs, input_mask)

        """
        targets = tf.reshape(target_seqs, [-1])  # to one vector
        weights = tf.to_float(tf.reshape(input_mask, [-1]))  # to one vector like targets
        losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name) * weights
        # losses = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name)) # for TF1.0 and others

        loss = tf.divide(
            tf.reduce_sum(losses),  # loss from mask. reduce_sum before element-wise mul with mask !!
            tf.reduce_sum(weights),
            name="seq_loss_with_mask"
        )

        if return_details:
            return loss, losses, weights, targets
        else:
            return loss
  - |-
    search_document: def pickle_load(path, compression=False):
        """Unpickle a possible compressed pickle.

        Parameters
        ----------
        path: str
            path to the output file
        compression: bool
            if true assumes that pickle was compressed when created and attempts decompression.

        Returns
        -------
        obj: object
            the unpickled object
        """

        if compression:
            with zipfile.ZipFile(path, "r", compression=zipfile.ZIP_DEFLATED) as myzip:
                with myzip.open("data") as f:
                    return pickle.load(f)
        else:
            with open(path, "rb") as f:
                return pickle.load(f)
  - |-
    search_document: def draw_text(img, y, x, text, color=(0, 255, 0), size=25):
        """
        Draw text on an image.

        This uses by default DejaVuSans as its font, which is included in this library.

        dtype support::

            * ``uint8``: yes; fully tested
            * ``uint16``: no
            * ``uint32``: no
            * ``uint64``: no
            * ``int8``: no
            * ``int16``: no
            * ``int32``: no
            * ``int64``: no
            * ``float16``: no
            * ``float32``: yes; not tested
            * ``float64``: no
            * ``float128``: no
            * ``bool``: no

            TODO check if other dtypes could be enabled

        Parameters
        ----------
        img : (H,W,3) ndarray
            The image array to draw text on.
            Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0).

        y : int
            x-coordinate of the top left corner of the text.

        x : int
            y- coordinate of the top left corner of the text.

        text : str
            The text to draw.

        color : iterable of int, optional
            Color of the text to draw. For RGB-images this is expected to be an RGB color.

        size : int, optional
            Font size of the text to draw.

        Returns
        -------
        img_np : (H,W,3) ndarray
            Input image with text drawn on it.

        """
        do_assert(img.dtype in [np.uint8, np.float32])

        input_dtype = img.dtype
        if img.dtype == np.float32:
            img = img.astype(np.uint8)

        img = PIL_Image.fromarray(img)
        font = PIL_ImageFont.truetype(DEFAULT_FONT_FP, size)
        context = PIL_ImageDraw.Draw(img)
        context.text((x, y), text, fill=tuple(color), font=font)
        img_np = np.asarray(img)

        # PIL/asarray returns read only array
        if not img_np.flags["WRITEABLE"]:
            try:
                # this seems to no longer work with np 1.16 (or was pillow updated?)
                img_np.setflags(write=True)
            except ValueError as ex:
                if "cannot set WRITEABLE flag to True of this array" in str(ex):
                    img_np = np.copy(img_np)

        if img_np.dtype != input_dtype:
            img_np = img_np.astype(input_dtype)

        return img_np
- source_sentence: >-
    search_query: Choice and return an an action by given the action probability
    distribution.

        Parameters
        ------------
        probs : list of float.
            The probability distribution of all actions.
        action_list : None or a list of int or others
            A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

        Returns
        --------
        float int or str
            The chosen action.

        Examples
        ----------
        >>> for _ in range(5):
        >>>     a = choice_action_by_probs([0.2, 0.4, 0.4])
        >>>     print(a)
        0
        1
        1
        2
        1
        >>> for _ in range(3):
        >>>     a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
        >>>     print(a)
        a
        b
        b
  sentences:
  - >-
    search_document: def from_keypoint_image(image, if_not_found_coords={"x":
    -1, "y": -1}, threshold=1, nb_channels=None): # pylint:
    disable=locally-disabled, dangerous-default-value, line-too-long
            """
            Converts an image generated by ``to_keypoint_image()`` back to a KeypointsOnImage object.

            Parameters
            ----------
            image : (H,W,N) ndarray
                The keypoints image. N is the number of keypoints.

            if_not_found_coords : tuple or list or dict or None, optional
                Coordinates to use for keypoints that cannot be found in `image`.
                If this is a list/tuple, it must have two integer values.
                If it is a dictionary, it must have the keys ``x`` and ``y`` with
                each containing one integer value.
                If this is None, then the keypoint will not be added to the final
                KeypointsOnImage object.

            threshold : int, optional
                The search for keypoints works by searching for the argmax in
                each channel. This parameters contains the minimum value that
                the max must have in order to be viewed as a keypoint.

            nb_channels : None or int, optional
                Number of channels of the image on which the keypoints are placed.
                Some keypoint augmenters require that information.
                If set to None, the keypoint's shape will be set
                to ``(height, width)``, otherwise ``(height, width, nb_channels)``.

            Returns
            -------
            out : KeypointsOnImage
                The extracted keypoints.

            """
            ia.do_assert(len(image.shape) == 3)
            height, width, nb_keypoints = image.shape

            drop_if_not_found = False
            if if_not_found_coords is None:
                drop_if_not_found = True
                if_not_found_x = -1
                if_not_found_y = -1
            elif isinstance(if_not_found_coords, (tuple, list)):
                ia.do_assert(len(if_not_found_coords) == 2)
                if_not_found_x = if_not_found_coords[0]
                if_not_found_y = if_not_found_coords[1]
            elif isinstance(if_not_found_coords, dict):
                if_not_found_x = if_not_found_coords["x"]
                if_not_found_y = if_not_found_coords["y"]
            else:
                raise Exception("Expected if_not_found_coords to be None or tuple or list or dict, got %s." % (
                    type(if_not_found_coords),))

            keypoints = []
            for i in sm.xrange(nb_keypoints):
                maxidx_flat = np.argmax(image[..., i])
                maxidx_ndim = np.unravel_index(maxidx_flat, (height, width))
                found = (image[maxidx_ndim[0], maxidx_ndim[1], i] >= threshold)
                if found:
                    keypoints.append(Keypoint(x=maxidx_ndim[1], y=maxidx_ndim[0]))
                else:
                    if drop_if_not_found:
                        pass  # dont add the keypoint to the result list, i.e. drop it
                    else:
                        keypoints.append(Keypoint(x=if_not_found_x, y=if_not_found_y))

            out_shape = (height, width)
            if nb_channels is not None:
                out_shape += (nb_channels,)
            return KeypointsOnImage(keypoints, shape=out_shape)
  - >-
    search_document: def choice_action_by_probs(probs=(0.5, 0.5),
    action_list=None):
        """Choice and return an an action by given the action probability distribution.

        Parameters
        ------------
        probs : list of float.
            The probability distribution of all actions.
        action_list : None or a list of int or others
            A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

        Returns
        --------
        float int or str
            The chosen action.

        Examples
        ----------
        >>> for _ in range(5):
        >>>     a = choice_action_by_probs([0.2, 0.4, 0.4])
        >>>     print(a)
        0
        1
        1
        2
        1
        >>> for _ in range(3):
        >>>     a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
        >>>     print(a)
        a
        b
        b

        """
        if action_list is None:
            n_action = len(probs)
            action_list = np.arange(n_action)
        else:
            if len(action_list) != len(probs):
                raise Exception("number of actions should equal to number of probabilities.")
        return np.random.choice(action_list, p=probs)
  - |-
    search_document: def __validateExperimentControl(self, control):
        """ Validates control dictionary for the experiment context"""
        # Validate task list
        taskList = control.get('tasks', None)
        if taskList is not None:
          taskLabelsList = []

          for task in taskList:
            validateOpfJsonValue(task, "opfTaskSchema.json")
            validateOpfJsonValue(task['taskControl'], "opfTaskControlSchema.json")

            taskLabel = task['taskLabel']

            assert isinstance(taskLabel, types.StringTypes), \
                   "taskLabel type: %r" % type(taskLabel)
            assert len(taskLabel) > 0, "empty string taskLabel not is allowed"

            taskLabelsList.append(taskLabel.lower())

          taskLabelDuplicates = filter(lambda x: taskLabelsList.count(x) > 1,
                                       taskLabelsList)
          assert len(taskLabelDuplicates) == 0, \
                 "Duplcate task labels are not allowed: %s" % taskLabelDuplicates

        return
- source_sentence: |-
    search_query: Augment endlessly images in the source queue.

            This is a worker function for that endlessly queries the source queue (input batches),
            augments batches in it and sends the result to the output queue.
  sentences:
  - >-
    search_document: def _augment_images_worker(self, augseq, queue_source,
    queue_result, seedval):
            """
            Augment endlessly images in the source queue.

            This is a worker function for that endlessly queries the source queue (input batches),
            augments batches in it and sends the result to the output queue.

            """
            np.random.seed(seedval)
            random.seed(seedval)
            augseq.reseed(seedval)
            ia.seed(seedval)

            loader_finished = False

            while not loader_finished:
                # wait for a new batch in the source queue and load it
                try:
                    batch_str = queue_source.get(timeout=0.1)
                    batch = pickle.loads(batch_str)
                    if batch is None:
                        loader_finished = True
                        # put it back in so that other workers know that the loading queue is finished
                        queue_source.put(pickle.dumps(None, protocol=-1))
                    else:
                        batch_aug = augseq.augment_batch(batch)

                        # send augmented batch to output queue
                        batch_str = pickle.dumps(batch_aug, protocol=-1)
                        queue_result.put(batch_str)
                except QueueEmpty:
                    time.sleep(0.01)

            queue_result.put(pickle.dumps(None, protocol=-1))
            time.sleep(0.01)
  - |-
    search_document: def show_perf_attrib_stats(returns,
                               positions,
                               factor_returns,
                               factor_loadings,
                               transactions=None,
                               pos_in_dollars=True):
        """
        Calls `perf_attrib` using inputs, and displays outputs using
        `utils.print_table`.
        """
        risk_exposures, perf_attrib_data = perf_attrib(
            returns,
            positions,
            factor_returns,
            factor_loadings,
            transactions,
            pos_in_dollars=pos_in_dollars,
        )

        perf_attrib_stats, risk_exposure_stats =\
            create_perf_attrib_stats(perf_attrib_data, risk_exposures)

        percentage_formatter = '{:.2%}'.format
        float_formatter = '{:.2f}'.format

        summary_stats = perf_attrib_stats.loc[['Annualized Specific Return',
                                               'Annualized Common Return',
                                               'Annualized Total Return',
                                               'Specific Sharpe Ratio']]

        # Format return rows in summary stats table as percentages.
        for col_name in (
            'Annualized Specific Return',
            'Annualized Common Return',
            'Annualized Total Return',
        ):
            summary_stats[col_name] = percentage_formatter(summary_stats[col_name])

        # Display sharpe to two decimal places.
        summary_stats['Specific Sharpe Ratio'] = float_formatter(
            summary_stats['Specific Sharpe Ratio']
        )

        print_table(summary_stats, name='Summary Statistics')

        print_table(
            risk_exposure_stats,
            name='Exposures Summary',
            # In exposures table, format exposure column to 2 decimal places, and
            # return columns  as percentages.
            formatters={
                'Average Risk Factor Exposure': float_formatter,
                'Annualized Return': percentage_formatter,
                'Cumulative Return': percentage_formatter,
            },
        )
  - >-
    search_document: def binary_cross_entropy(output, target, epsilon=1e-8,
    name='bce_loss'):
        """Binary cross entropy operation.

        Parameters
        ----------
        output : Tensor
            Tensor with type of `float32` or `float64`.
        target : Tensor
            The target distribution, format the same with `output`.
        epsilon : float
            A small value to avoid output to be zero.
        name : str
            An optional name to attach to this function.

        References
        -----------
        - `ericjang-DRAW <https://github.com/ericjang/draw/blob/master/draw.py#L73>`__

        """
        #     with ops.op_scope([output, target], name, "bce_loss") as name:
        #         output = ops.convert_to_tensor(output, name="preds")
        #         target = ops.convert_to_tensor(targets, name="target")

        # with tf.name_scope(name):
        return tf.reduce_mean(
            tf.reduce_sum(-(target * tf.log(output + epsilon) + (1. - target) * tf.log(1. - output + epsilon)), axis=1),
            name=name
        )
- source_sentence: 'search_query: episode_batch: array(batch_size x (T or T+1) x dim_key)'
  sentences:
  - |-
    search_document: def get_txn_vol(transactions):
        """
        Extract daily transaction data from set of transaction objects.

        Parameters
        ----------
        transactions : pd.DataFrame
            Time series containing one row per symbol (and potentially
            duplicate datetime indices) and columns for amount and
            price.

        Returns
        -------
        pd.DataFrame
            Daily transaction volume and number of shares.
             - See full explanation in tears.create_full_tear_sheet.
        """

        txn_norm = transactions.copy()
        txn_norm.index = txn_norm.index.normalize()
        amounts = txn_norm.amount.abs()
        prices = txn_norm.price
        values = amounts * prices
        daily_amounts = amounts.groupby(amounts.index).sum()
        daily_values = values.groupby(values.index).sum()
        daily_amounts.name = "txn_shares"
        daily_values.name = "txn_volume"
        return pd.concat([daily_values, daily_amounts], axis=1)
  - |-
    search_document: def deepcopy(self, exterior=None, label=None):
            """
            Create a deep copy of the Polygon object.

            Parameters
            ----------
            exterior : list of Keypoint or list of tuple or (N,2) ndarray, optional
                List of points defining the polygon. See `imgaug.Polygon.__init__` for details.

            label : None or str
                If not None, then the label of the copied object will be set to this value.

            Returns
            -------
            imgaug.Polygon
                Deep copy.

            """
            return Polygon(
                exterior=np.copy(self.exterior) if exterior is None else exterior,
                label=self.label if label is None else label
            )
  - |-
    search_document: def store_episode(self, episode_batch):
            """episode_batch: array(batch_size x (T or T+1) x dim_key)
            """
            batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()]
            assert np.all(np.array(batch_sizes) == batch_sizes[0])
            batch_size = batch_sizes[0]

            with self.lock:
                idxs = self._get_storage_idx(batch_size)

                # load inputs into buffers
                for key in self.buffers.keys():
                    self.buffers[key][idxs] = episode_batch[key]

                self.n_transitions_stored += batch_size * self.T
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: codesearchnet val
      type: codesearchnet_val
    metrics:
    - type: cosine_accuracy@1
      value: 0.8926
      name: Cosine Accuracy@1
    - type: cosine_accuracy@3
      value: 0.9453666666666667
      name: Cosine Accuracy@3
    - type: cosine_accuracy@5
      value: 0.9545
      name: Cosine Accuracy@5
    - type: cosine_accuracy@10
      value: 0.9637666666666667
      name: Cosine Accuracy@10
    - type: cosine_precision@1
      value: 0.8926
      name: Cosine Precision@1
    - type: cosine_precision@3
      value: 0.31512222222222214
      name: Cosine Precision@3
    - type: cosine_precision@5
      value: 0.19090000000000004
      name: Cosine Precision@5
    - type: cosine_precision@10
      value: 0.09637666666666667
      name: Cosine Precision@10
    - type: cosine_recall@1
      value: 0.8926
      name: Cosine Recall@1
    - type: cosine_recall@3
      value: 0.9453666666666667
      name: Cosine Recall@3
    - type: cosine_recall@5
      value: 0.9545
      name: Cosine Recall@5
    - type: cosine_recall@10
      value: 0.9637666666666667
      name: Cosine Recall@10
    - type: cosine_ndcg@10
      value: 0.9313201256618757
      name: Cosine Ndcg@10
    - type: cosine_mrr@10
      value: 0.9206047883597835
      name: Cosine Mrr@10
    - type: cosine_map@100
      value: 0.9212040995599341
      name: Cosine Map@100
license: mit
datasets:
- code-search-net/code_search_net
language:
- en
base_model:
- answerdotai/ModernBERT-base
---

# SentenceTransformer

This is a [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model trained on the [code_search_net](https://huggingface.co/datasets/code-search-net/code_search_net) dataset with
[<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with in-batch negatives. Model can be used for code retrieval and reranking.

## Perfomance on code retrieval benchmarks

**RTEB**

On 14.10.2025 the model is **6th** on RTEB leaderbord among models with <500M parameters:
<details>
  <summary>Click</summary>
    <figure>
      <img src="Rteb_top.jpg">
    </figure>
</details>

Perfomance per task:
| Model | AppsRetrieval | Code1Retrieval (Private) | DS1000Retrieval | FreshStackRetrieval | HumanEvalRetrieval | JapaneseCode1Retrieval (Private)| MBPPRetrieval | WikiSQLRetrieval |
|-------|---------------|----------------|-----------------|---------------------|--------------------|------------------------|---------------|------------------|
| english_code_retriever | 8.04 | 75.36 | 32.42 | 18.30 | 71.82 | 46.59 | 72.06 | 87.92 |

**COIR**:
| Model | AppsRetrieval | COIRCodeSearchNetRetrieval | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCCRetrieval | CodeTransOceanContest | CodeTransOceanDL | CosQA | StackOverflowQA | SyntheticText2SQL |
|-------|---------------|----------------------------|----------------|----------------|--------------------------|------------------------|------------------|-------|------------------|-------------------|
| english_code_retriever | 8.04 | 74.23 | 44.01 | 57.79 | 42.71 | 60.68 | 35.16 | 25.56 | 56.53 | 42.79 |

more information you cand find [in MTEB leaderbord](https://huggingface.co/spaces/mteb/leaderboard)


## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 768
- **Similarity Function:** Cosine Similarity
- Mean pooling

## Usage

Using is easy with Sentence Transformers.

Pay attention that model was trained with prefixes 'search_query' for queries and 'search_document' for docs with code. 
So using with prefixes will improve model retrieving abilities.

```python
import torch
from sentence_transformers import SentenceTransformer, util

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("fyaronskiy/english_code_retriever").to(device)

queries = [
    "Write a Python function that calculates the factorial of a number recursively.",
    "How to check if a given string reads the same backward and forward?",
    "Combine two sorted lists into a single sorted list."
]

corpus = [
    # Relevant for Q1
    """def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)""",

    # Hard negative for Q1 (similar structure but computes sum)
    """def sum_recursive(n):
    if n == 0:
        return 0
    return n + sum_recursive(n-1)""",

    # Relevant for Q2
    """def is_palindrome(s: str) -> bool:
    s = s.lower().replace(" ", "")
    return s == s[::-1]""",

    # Hard negative for Q2 (string reverse but not palindrome check)
    """def reverse_string(s: str) -> str:
    return s[::-1]""",

    # Relevant for Q3
    """def merge_sorted_lists(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result""",

    # Hard negative for Q3 (similar iteration but sums two lists elementwise)
    """def add_lists(a, b):
    return [x + y for x, y in zip(a, b)]"""
]


doc_embeddings = model.encode(corpus, prompt_name='search_query', convert_to_tensor=True, device=device)
query_embeddings = model.encode(queries, prompt_name='search_document', convert_to_tensor=True, device=device)

# Compute cosine similarity and retrieve top-1
for i, query in enumerate(queries):
    scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0]
    best_idx = torch.argmax(scores).item()
    print(f"\n Query {i+1}: {query}")
    print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

''' Query 1: Write a Python function that calculates the factorial of a number recursively.
Top-1 match (score=0.5983):
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)

 Query 2: How to check if a given string reads the same backward and forward?
Top-1 match (score=0.4925):
def is_palindrome(s: str) -> bool:
    s = s.lower().replace(" ", "")
    return s == s[::-1]

 Query 3: Combine two sorted lists into a single sorted list.
Top-1 match (score=0.6524):
def merge_sorted_lists(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result
'''
```

Using with Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "fyaronskiy/english_code_retriever"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
model.eval()


queries = [
"function of addition of two numbers",
"finding the maximum element in an array",
"sorting a list in ascending order"
]

corpus = [
    "def add(a, b): return a + b",
    "def find_max(arr): return max(arr)",
    "def sort_list(lst): return sorted(lst)"
]

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # (batch_size, seq_len, hidden_dim)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1).clamp(min=1e-9)

def encode_texts(texts):
    encoded = tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt",
        max_length=8192
    ).to(device)
    with torch.no_grad():
        model_output = model(**encoded)
    return mean_pooling(model_output, encoded["attention_mask"])

doc_embeddings = encode_texts(["search_document: " + document  for document in corpus])
query_embeddings = encode_texts(["search_query: " + query  for query in queries])

# Normalize embeddings for cosine similarity
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)

# Compute cosine similarity and retrieve top-1
for i, query in enumerate(queries):
    scores = torch.matmul(query_embeddings[i], doc_embeddings.T)
    best_idx = torch.argmax(scores).item()
    print(f"\n Query {i+1}: {query}")
    print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

''' Query 1: function of addition of two numbers
Top-1 match (score=0.6047):
def add(a, b): return a + b

 Query 2: finding the maximum element in an array
Top-1 match (score=0.7772):
def find_max(arr): return max(arr)

 Query 3: sorting a list in ascending order
Top-1 match (score=0.7389):
def sort_list(lst): return sorted(lst)
'''
```

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

#### Information Retrieval

* Dataset: validation part of  `codesearchnet_val`
* Size: 30,000 evaluation samples

* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)

| Metric              | Value      |
|:--------------------|:-----------|
| cosine_accuracy@1   | 0.8926     |
| cosine_accuracy@3   | 0.9454     |
| cosine_accuracy@5   | 0.9545     |
| cosine_accuracy@10  | 0.9638     |
| cosine_precision@1  | 0.8926     |
| cosine_precision@3  | 0.3151     |
| cosine_precision@5  | 0.1909     |
| cosine_precision@10 | 0.0964     |
| cosine_recall@1     | 0.8926     |
| cosine_recall@3     | 0.9454     |
| cosine_recall@5     | 0.9545     |
| cosine_recall@10    | 0.9638     |
| **cosine_ndcg@10**  | **0.9313** |
| cosine_mrr@10       | 0.9206     |
| cosine_map@100      | 0.9212     |

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### code_search_net

* Dataset: train part of code_search_net
* Size: 1,880,853 training samples
* queries - function docstrings in english, relevant document - code of function
* negatives was sampled from batch
* Distribution of programming languages:
* 

![image](https://cdn-uploads.huggingface.co/production/uploads/620118ee1283421e22448fc2/IefFD32ihUSTGZjbb78px.png)


### Training Hyperparameters
#### Non-Default Hyperparameters

- `batch_size`: 64
- `learning_rate`: 2e-05
- `num_epochs`: 2
- `warmup_ratio`: 0.1


### Framework Versions
- Python: 3.10.11
- Sentence Transformers: 5.1.0
- Transformers: 4.52.3
- PyTorch: 2.6.0+cu124
- Accelerate: 1.10.0
- Datasets: 3.6.0
- Tokenizers: 0.21.4


<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->