| | --- |
| | tags: |
| | - sentence-transformers |
| | - code |
| | - code-retrieval |
| | - retrieval-augmented-generation |
| | - rag |
| | - python |
| | - java |
| | - go |
| | - php |
| | - sentence-similarity |
| | - feature-extraction |
| | - generated_from_trainer |
| | - loss:MultipleNegativesRankingLoss |
| | widget: |
| | - source_sentence: |- |
| | search_query: Finds the top long, short, and absolute positions. |
| | |
| | Parameters |
| | ---------- |
| | positions : pd.DataFrame |
| | The positions that the strategy takes over time. |
| | top : int, optional |
| | How many of each to find (default 10). |
| |
|
| | Returns |
| | ------- |
| | df_top_long : pd.DataFrame |
| | Top long positions. |
| | df_top_short : pd.DataFrame |
| | Top short positions. |
| | df_top_abs : pd.DataFrame |
| | Top absolute positions. |
| | sentences: |
| | - >- |
| | search_document: def symmetric_ema(xolds, yolds, low=None, high=None, n=512, |
| | decay_steps=1., low_counts_threshold=1e-8): |
| | ''' |
| | perform symmetric EMA (exponential moving average) |
| | smoothing and resampling to an even grid with n points. |
| | Does not do extrapolation, so we assume |
| | xolds[0] <= low && high <= xolds[-1] |
| | |
| | Arguments: |
| |
|
| | xolds: array or list - x values of data. Needs to be sorted in ascending order |
| | yolds: array of list - y values of data. Has to have the same length as xolds |
| |
|
| | low: float - min value of the new x grid. By default equals to xolds[0] |
| | high: float - max value of the new x grid. By default equals to xolds[-1] |
| |
|
| | n: int - number of points in new x grid |
| |
|
| | decay_steps: float - EMA decay factor, expressed in new x grid steps. |
| |
|
| | low_counts_threshold: float or int |
| | - y values with counts less than this value will be set to NaN |
| |
|
| | Returns: |
| | tuple sum_ys, count_ys where |
| | xs - array with new x grid |
| | ys - array of EMA of y at each point of the new x grid |
| | count_ys - array of EMA of y counts at each point of the new x grid |
| |
|
| | ''' |
| | xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0) |
| | _, ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0) |
| | ys2 = ys2[::-1] |
| | count_ys2 = count_ys2[::-1] |
| | count_ys = count_ys1 + count_ys2 |
| | ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys |
| | ys[count_ys < low_counts_threshold] = np.nan |
| | return xs, ys, count_ys |
| | - |- |
| | search_document: def project(self, from_shape, to_shape): |
| | """ |
| | Project the polygon onto an image with different shape. |
| | |
| | The relative coordinates of all points remain the same. |
| | E.g. a point at (x=20, y=20) on an image (width=100, height=200) will be |
| | projected on a new image (width=200, height=100) to (x=40, y=10). |
| | |
| | This is intended for cases where the original image is resized. |
| | It cannot be used for more complex changes (e.g. padding, cropping). |
| | |
| | Parameters |
| | ---------- |
| | from_shape : tuple of int |
| | Shape of the original image. (Before resize.) |
| | |
| | to_shape : tuple of int |
| | Shape of the new image. (After resize.) |
| | |
| | Returns |
| | ------- |
| | imgaug.Polygon |
| | Polygon object with new coordinates. |
| | |
| | """ |
| | if from_shape[0:2] == to_shape[0:2]: |
| | return self.copy() |
| | ls_proj = self.to_line_string(closed=False).project( |
| | from_shape, to_shape) |
| | return self.copy(exterior=ls_proj.coords) |
| | - |- |
| | search_document: def get_top_long_short_abs(positions, top=10): |
| | """ |
| | Finds the top long, short, and absolute positions. |
| | |
| | Parameters |
| | ---------- |
| | positions : pd.DataFrame |
| | The positions that the strategy takes over time. |
| | top : int, optional |
| | How many of each to find (default 10). |
| | |
| | Returns |
| | ------- |
| | df_top_long : pd.DataFrame |
| | Top long positions. |
| | df_top_short : pd.DataFrame |
| | Top short positions. |
| | df_top_abs : pd.DataFrame |
| | Top absolute positions. |
| | """ |
| | |
| | positions = positions.drop('cash', axis='columns') |
| | df_max = positions.max() |
| | df_min = positions.min() |
| | df_abs_max = positions.abs().max() |
| | df_top_long = df_max[df_max > 0].nlargest(top) |
| | df_top_short = df_min[df_min < 0].nsmallest(top) |
| | df_top_abs = df_abs_max.nlargest(top) |
| | return df_top_long, df_top_short, df_top_abs |
| | - source_sentence: |- |
| | search_query: Draw text on an image. |
| | |
| | This uses by default DejaVuSans as its font, which is included in this library. |
| |
|
| | dtype support:: |
| |
|
| | * ``uint8``: yes; fully tested |
| | * ``uint16``: no |
| | * ``uint32``: no |
| | * ``uint64``: no |
| | * ``int8``: no |
| | * ``int16``: no |
| | * ``int32``: no |
| | * ``int64``: no |
| | * ``float16``: no |
| | * ``float32``: yes; not tested |
| | * ``float64``: no |
| | * ``float128``: no |
| | * ``bool``: no |
| |
|
| | TODO check if other dtypes could be enabled |
| |
|
| | Parameters |
| | ---------- |
| | img : (H,W,3) ndarray |
| | The image array to draw text on. |
| | Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0). |
| |
|
| | y : int |
| | x-coordinate of the top left corner of the text. |
| |
|
| | x : int |
| | y- coordinate of the top left corner of the text. |
| |
|
| | text : str |
| | The text to draw. |
| |
|
| | color : iterable of int, optional |
| | Color of the text to draw. For RGB-images this is expected to be an RGB color. |
| |
|
| | size : int, optional |
| | Font size of the text to draw. |
| |
|
| | Returns |
| | ------- |
| | img_np : (H,W,3) ndarray |
| | Input image with text drawn on it. |
| | sentences: |
| | - >- |
| | search_document: def cross_entropy_seq_with_mask(logits, target_seqs, |
| | input_mask, return_details=False, name=None): |
| | """Returns the expression of cross-entropy of two sequences, implement |
| | softmax internally. Normally be used for Dynamic RNN with Synced sequence input and output. |
| | |
| | Parameters |
| | ----------- |
| | logits : Tensor |
| | 2D tensor with shape of [batch_size * ?, n_classes], `?` means dynamic IDs for each example. |
| | - Can be get from `DynamicRNNLayer` by setting ``return_seq_2d`` to `True`. |
| | target_seqs : Tensor |
| | int of tensor, like word ID. [batch_size, ?], `?` means dynamic IDs for each example. |
| | input_mask : Tensor |
| | The mask to compute loss, it has the same size with `target_seqs`, normally 0 or 1. |
| | return_details : boolean |
| | Whether to return detailed losses. |
| | - If False (default), only returns the loss. |
| | - If True, returns the loss, losses, weights and targets (see source code). |
| |
|
| | Examples |
| | -------- |
| | >>> batch_size = 64 |
| | >>> vocab_size = 10000 |
| | >>> embedding_size = 256 |
| | >>> input_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="input") |
| | >>> target_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="target") |
| | >>> input_mask = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="mask") |
| | >>> net = tl.layers.EmbeddingInputlayer( |
| | ... inputs = input_seqs, |
| | ... vocabulary_size = vocab_size, |
| | ... embedding_size = embedding_size, |
| | ... name = 'seq_embedding') |
| | >>> net = tl.layers.DynamicRNNLayer(net, |
| | ... cell_fn = tf.contrib.rnn.BasicLSTMCell, |
| | ... n_hidden = embedding_size, |
| | ... dropout = (0.7 if is_train else None), |
| | ... sequence_length = tl.layers.retrieve_seq_length_op2(input_seqs), |
| | ... return_seq_2d = True, |
| | ... name = 'dynamicrnn') |
| | >>> print(net.outputs) |
| | (?, 256) |
| | >>> net = tl.layers.DenseLayer(net, n_units=vocab_size, name="output") |
| | >>> print(net.outputs) |
| | (?, 10000) |
| | >>> loss = tl.cost.cross_entropy_seq_with_mask(net.outputs, target_seqs, input_mask) |
| |
|
| | """ |
| | targets = tf.reshape(target_seqs, [-1]) # to one vector |
| | weights = tf.to_float(tf.reshape(input_mask, [-1])) # to one vector like targets |
| | losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name) * weights |
| | # losses = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name)) # for TF1.0 and others |
| | |
| | loss = tf.divide( |
| | tf.reduce_sum(losses), # loss from mask. reduce_sum before element-wise mul with mask !! |
| | tf.reduce_sum(weights), |
| | name="seq_loss_with_mask" |
| | ) |
| |
|
| | if return_details: |
| | return loss, losses, weights, targets |
| | else: |
| | return loss |
| | - |- |
| | search_document: def pickle_load(path, compression=False): |
| | """Unpickle a possible compressed pickle. |
| | |
| | Parameters |
| | ---------- |
| | path: str |
| | path to the output file |
| | compression: bool |
| | if true assumes that pickle was compressed when created and attempts decompression. |
| |
|
| | Returns |
| | ------- |
| | obj: object |
| | the unpickled object |
| | """ |
| | |
| | if compression: |
| | with zipfile.ZipFile(path, "r", compression=zipfile.ZIP_DEFLATED) as myzip: |
| | with myzip.open("data") as f: |
| | return pickle.load(f) |
| | else: |
| | with open(path, "rb") as f: |
| | return pickle.load(f) |
| | - |- |
| | search_document: def draw_text(img, y, x, text, color=(0, 255, 0), size=25): |
| | """ |
| | Draw text on an image. |
| | |
| | This uses by default DejaVuSans as its font, which is included in this library. |
| |
|
| | dtype support:: |
| |
|
| | * ``uint8``: yes; fully tested |
| | * ``uint16``: no |
| | * ``uint32``: no |
| | * ``uint64``: no |
| | * ``int8``: no |
| | * ``int16``: no |
| | * ``int32``: no |
| | * ``int64``: no |
| | * ``float16``: no |
| | * ``float32``: yes; not tested |
| | * ``float64``: no |
| | * ``float128``: no |
| | * ``bool``: no |
| |
|
| | TODO check if other dtypes could be enabled |
| |
|
| | Parameters |
| | ---------- |
| | img : (H,W,3) ndarray |
| | The image array to draw text on. |
| | Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0). |
| |
|
| | y : int |
| | x-coordinate of the top left corner of the text. |
| |
|
| | x : int |
| | y- coordinate of the top left corner of the text. |
| |
|
| | text : str |
| | The text to draw. |
| |
|
| | color : iterable of int, optional |
| | Color of the text to draw. For RGB-images this is expected to be an RGB color. |
| |
|
| | size : int, optional |
| | Font size of the text to draw. |
| |
|
| | Returns |
| | ------- |
| | img_np : (H,W,3) ndarray |
| | Input image with text drawn on it. |
| |
|
| | """ |
| | do_assert(img.dtype in [np.uint8, np.float32]) |
| | |
| | input_dtype = img.dtype |
| | if img.dtype == np.float32: |
| | img = img.astype(np.uint8) |
| | |
| | img = PIL_Image.fromarray(img) |
| | font = PIL_ImageFont.truetype(DEFAULT_FONT_FP, size) |
| | context = PIL_ImageDraw.Draw(img) |
| | context.text((x, y), text, fill=tuple(color), font=font) |
| | img_np = np.asarray(img) |
| | |
| | # PIL/asarray returns read only array |
| | if not img_np.flags["WRITEABLE"]: |
| | try: |
| | |
| | img_np.setflags(write=True) |
| | except ValueError as ex: |
| | if "cannot set WRITEABLE flag to True of this array" in str(ex): |
| | img_np = np.copy(img_np) |
| |
|
| | if img_np.dtype != input_dtype: |
| | img_np = img_np.astype(input_dtype) |
| |
|
| | return img_np |
| | - source_sentence: >- |
| | search_query: Choice and return an an action by given the action probability |
| | distribution. |
| | |
| | Parameters |
| | ------------ |
| | probs : list of float. |
| | The probability distribution of all actions. |
| | action_list : None or a list of int or others |
| | A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1. |
| |
|
| | Returns |
| | -------- |
| | float int or str |
| | The chosen action. |
| |
|
| | Examples |
| | ---------- |
| | >>> for _ in range(5): |
| | >>> a = choice_action_by_probs([0.2, 0.4, 0.4]) |
| | >>> print(a) |
| | 0 |
| | 1 |
| | 1 |
| | 2 |
| | 1 |
| | >>> for _ in range(3): |
| | >>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b']) |
| | >>> print(a) |
| | a |
| | b |
| | b |
| | sentences: |
| | - >- |
| | search_document: def from_keypoint_image(image, if_not_found_coords={"x": |
| | -1, "y": -1}, threshold=1, nb_channels=None): # pylint: |
| | disable=locally-disabled, dangerous-default-value, line-too-long |
| | """ |
| | Converts an image generated by ``to_keypoint_image()`` back to a KeypointsOnImage object. |
| | |
| | Parameters |
| | ---------- |
| | image : (H,W,N) ndarray |
| | The keypoints image. N is the number of keypoints. |
| |
|
| | if_not_found_coords : tuple or list or dict or None, optional |
| | Coordinates to use for keypoints that cannot be found in `image`. |
| | If this is a list/tuple, it must have two integer values. |
| | If it is a dictionary, it must have the keys ``x`` and ``y`` with |
| | each containing one integer value. |
| | If this is None, then the keypoint will not be added to the final |
| | KeypointsOnImage object. |
| |
|
| | threshold : int, optional |
| | The search for keypoints works by searching for the argmax in |
| | each channel. This parameters contains the minimum value that |
| | the max must have in order to be viewed as a keypoint. |
| |
|
| | nb_channels : None or int, optional |
| | Number of channels of the image on which the keypoints are placed. |
| | Some keypoint augmenters require that information. |
| | If set to None, the keypoint's shape will be set |
| | to ``(height, width)``, otherwise ``(height, width, nb_channels)``. |
| |
|
| | Returns |
| | ------- |
| | out : KeypointsOnImage |
| | The extracted keypoints. |
| |
|
| | """ |
| | ia.do_assert(len(image.shape) == 3) |
| | height, width, nb_keypoints = image.shape |
| | |
| | drop_if_not_found = False |
| | if if_not_found_coords is None: |
| | drop_if_not_found = True |
| | if_not_found_x = -1 |
| | if_not_found_y = -1 |
| | elif isinstance(if_not_found_coords, (tuple, list)): |
| | ia.do_assert(len(if_not_found_coords) == 2) |
| | if_not_found_x = if_not_found_coords[0] |
| | if_not_found_y = if_not_found_coords[1] |
| | elif isinstance(if_not_found_coords, dict): |
| | if_not_found_x = if_not_found_coords["x"] |
| | if_not_found_y = if_not_found_coords["y"] |
| | else: |
| | raise Exception("Expected if_not_found_coords to be None or tuple or list or dict, got %s." % ( |
| | type(if_not_found_coords),)) |
| |
|
| | keypoints = [] |
| | for i in sm.xrange(nb_keypoints): |
| | maxidx_flat = np.argmax(image[..., i]) |
| | maxidx_ndim = np.unravel_index(maxidx_flat, (height, width)) |
| | found = (image[maxidx_ndim[0], maxidx_ndim[1], i] >= threshold) |
| | if found: |
| | keypoints.append(Keypoint(x=maxidx_ndim[1], y=maxidx_ndim[0])) |
| | else: |
| | if drop_if_not_found: |
| | pass |
| | else: |
| | keypoints.append(Keypoint(x=if_not_found_x, y=if_not_found_y)) |
| |
|
| | out_shape = (height, width) |
| | if nb_channels is not None: |
| | out_shape += (nb_channels,) |
| | return KeypointsOnImage(keypoints, shape=out_shape) |
| | - >- |
| | search_document: def choice_action_by_probs(probs=(0.5, 0.5), |
| | action_list=None): |
| | """Choice and return an an action by given the action probability distribution. |
| | |
| | Parameters |
| | ------------ |
| | probs : list of float. |
| | The probability distribution of all actions. |
| | action_list : None or a list of int or others |
| | A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1. |
| |
|
| | Returns |
| | -------- |
| | float int or str |
| | The chosen action. |
| |
|
| | Examples |
| | ---------- |
| | >>> for _ in range(5): |
| | >>> a = choice_action_by_probs([0.2, 0.4, 0.4]) |
| | >>> print(a) |
| | 0 |
| | 1 |
| | 1 |
| | 2 |
| | 1 |
| | >>> for _ in range(3): |
| | >>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b']) |
| | >>> print(a) |
| | a |
| | b |
| | b |
| |
|
| | """ |
| | if action_list is None: |
| | n_action = len(probs) |
| | action_list = np.arange(n_action) |
| | else: |
| | if len(action_list) != len(probs): |
| | raise Exception("number of actions should equal to number of probabilities.") |
| | return np.random.choice(action_list, p=probs) |
| | - |- |
| | search_document: def __validateExperimentControl(self, control): |
| | """ Validates control dictionary for the experiment context""" |
| | # Validate task list |
| | taskList = control.get('tasks', None) |
| | if taskList is not None: |
| | taskLabelsList = [] |
| | |
| | for task in taskList: |
| | validateOpfJsonValue(task, "opfTaskSchema.json") |
| | validateOpfJsonValue(task['taskControl'], "opfTaskControlSchema.json") |
| |
|
| | taskLabel = task['taskLabel'] |
| |
|
| | assert isinstance(taskLabel, types.StringTypes), \ |
| | "taskLabel type: %r" % type(taskLabel) |
| | assert len(taskLabel) > 0, "empty string taskLabel not is allowed" |
| |
|
| | taskLabelsList.append(taskLabel.lower()) |
| |
|
| | taskLabelDuplicates = filter(lambda x: taskLabelsList.count(x) > 1, |
| | taskLabelsList) |
| | assert len(taskLabelDuplicates) == 0, \ |
| | "Duplcate task labels are not allowed: %s" % taskLabelDuplicates |
| |
|
| | return |
| | - source_sentence: |- |
| | search_query: Augment endlessly images in the source queue. |
| | |
| | This is a worker function for that endlessly queries the source queue (input batches), |
| | augments batches in it and sends the result to the output queue. |
| | sentences: |
| | - >- |
| | search_document: def _augment_images_worker(self, augseq, queue_source, |
| | queue_result, seedval): |
| | """ |
| | Augment endlessly images in the source queue. |
| | |
| | This is a worker function for that endlessly queries the source queue (input batches), |
| | augments batches in it and sends the result to the output queue. |
| |
|
| | """ |
| | np.random.seed(seedval) |
| | random.seed(seedval) |
| | augseq.reseed(seedval) |
| | ia.seed(seedval) |
| | |
| | loader_finished = False |
| | |
| | while not loader_finished: |
| | # wait for a new batch in the source queue and load it |
| | try: |
| | batch_str = queue_source.get(timeout=0.1) |
| | batch = pickle.loads(batch_str) |
| | if batch is None: |
| | loader_finished = True |
| | # put it back in so that other workers know that the loading queue is finished |
| | queue_source.put(pickle.dumps(None, protocol=-1)) |
| | else: |
| | batch_aug = augseq.augment_batch(batch) |
| | |
| | # send augmented batch to output queue |
| | batch_str = pickle.dumps(batch_aug, protocol=-1) |
| | queue_result.put(batch_str) |
| | except QueueEmpty: |
| | time.sleep(0.01) |
| | |
| | queue_result.put(pickle.dumps(None, protocol=-1)) |
| | time.sleep(0.01) |
| | - |- |
| | search_document: def show_perf_attrib_stats(returns, |
| | positions, |
| | factor_returns, |
| | factor_loadings, |
| | transactions=None, |
| | pos_in_dollars=True): |
| | """ |
| | Calls `perf_attrib` using inputs, and displays outputs using |
| | `utils.print_table`. |
| | """ |
| | risk_exposures, perf_attrib_data = perf_attrib( |
| | returns, |
| | positions, |
| | factor_returns, |
| | factor_loadings, |
| | transactions, |
| | pos_in_dollars=pos_in_dollars, |
| | ) |
| | |
| | perf_attrib_stats, risk_exposure_stats =\ |
| | create_perf_attrib_stats(perf_attrib_data, risk_exposures) |
| | |
| | percentage_formatter = '{:.2%}'.format |
| | float_formatter = '{:.2f}'.format |
| | |
| | summary_stats = perf_attrib_stats.loc[['Annualized Specific Return', |
| | 'Annualized Common Return', |
| | 'Annualized Total Return', |
| | 'Specific Sharpe Ratio']] |
| | |
| | # Format return rows in summary stats table as percentages. |
| | for col_name in ( |
| | 'Annualized Specific Return', |
| | 'Annualized Common Return', |
| | 'Annualized Total Return', |
| | ): |
| | summary_stats[col_name] = percentage_formatter(summary_stats[col_name]) |
| | |
| | # Display sharpe to two decimal places. |
| | summary_stats['Specific Sharpe Ratio'] = float_formatter( |
| | summary_stats['Specific Sharpe Ratio'] |
| | ) |
| | |
| | print_table(summary_stats, name='Summary Statistics') |
| | |
| | print_table( |
| | risk_exposure_stats, |
| | name='Exposures Summary', |
| | # In exposures table, format exposure column to 2 decimal places, and |
| | # return columns as percentages. |
| | formatters={ |
| | 'Average Risk Factor Exposure': float_formatter, |
| | 'Annualized Return': percentage_formatter, |
| | 'Cumulative Return': percentage_formatter, |
| | }, |
| | ) |
| | - >- |
| | search_document: def binary_cross_entropy(output, target, epsilon=1e-8, |
| | name='bce_loss'): |
| | """Binary cross entropy operation. |
| |
|
| | Parameters |
| | ---------- |
| | output : Tensor |
| | Tensor with type of `float32` or `float64`. |
| | target : Tensor |
| | The target distribution, format the same with `output`. |
| | epsilon : float |
| | A small value to avoid output to be zero. |
| | name : str |
| | An optional name to attach to this function. |
| |
|
| | References |
| | ----------- |
| | - `ericjang-DRAW <https://github.com/ericjang/draw/blob/master/draw.py#L73>`__ |
| |
|
| | """ |
| | # with ops.op_scope([output, target], name, "bce_loss") as name: |
| | |
| | |
| |
|
| | |
| | return tf.reduce_mean( |
| | tf.reduce_sum(-(target * tf.log(output + epsilon) + (1. - target) * tf.log(1. - output + epsilon)), axis=1), |
| | name=name |
| | ) |
| | - source_sentence: 'search_query: episode_batch: array(batch_size x (T or T+1) x dim_key)' |
| | sentences: |
| | - |- |
| | search_document: def get_txn_vol(transactions): |
| | """ |
| | Extract daily transaction data from set of transaction objects. |
| | |
| | Parameters |
| | ---------- |
| | transactions : pd.DataFrame |
| | Time series containing one row per symbol (and potentially |
| | duplicate datetime indices) and columns for amount and |
| | price. |
| |
|
| | Returns |
| | ------- |
| | pd.DataFrame |
| | Daily transaction volume and number of shares. |
| | - See full explanation in tears.create_full_tear_sheet. |
| | """ |
| | |
| | txn_norm = transactions.copy() |
| | txn_norm.index = txn_norm.index.normalize() |
| | amounts = txn_norm.amount.abs() |
| | prices = txn_norm.price |
| | values = amounts * prices |
| | daily_amounts = amounts.groupby(amounts.index).sum() |
| | daily_values = values.groupby(values.index).sum() |
| | daily_amounts.name = "txn_shares" |
| | daily_values.name = "txn_volume" |
| | return pd.concat([daily_values, daily_amounts], axis=1) |
| | - |- |
| | search_document: def deepcopy(self, exterior=None, label=None): |
| | """ |
| | Create a deep copy of the Polygon object. |
| | |
| | Parameters |
| | ---------- |
| | exterior : list of Keypoint or list of tuple or (N,2) ndarray, optional |
| | List of points defining the polygon. See `imgaug.Polygon.__init__` for details. |
| |
|
| | label : None or str |
| | If not None, then the label of the copied object will be set to this value. |
| |
|
| | Returns |
| | ------- |
| | imgaug.Polygon |
| | Deep copy. |
| |
|
| | """ |
| | return Polygon( |
| | exterior=np.copy(self.exterior) if exterior is None else exterior, |
| | label=self.label if label is None else label |
| | ) |
| | - |- |
| | search_document: def store_episode(self, episode_batch): |
| | """episode_batch: array(batch_size x (T or T+1) x dim_key) |
| | """ |
| | batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()] |
| | assert np.all(np.array(batch_sizes) == batch_sizes[0]) |
| | batch_size = batch_sizes[0] |
| | |
| | with self.lock: |
| | idxs = self._get_storage_idx(batch_size) |
| | |
| | # load inputs into buffers |
| | for key in self.buffers.keys(): |
| | self.buffers[key][idxs] = episode_batch[key] |
| | |
| | self.n_transitions_stored += batch_size * self.T |
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| | metrics: |
| | - cosine_accuracy@1 |
| | - cosine_accuracy@3 |
| | - cosine_accuracy@5 |
| | - cosine_accuracy@10 |
| | - cosine_precision@1 |
| | - cosine_precision@3 |
| | - cosine_precision@5 |
| | - cosine_precision@10 |
| | - cosine_recall@1 |
| | - cosine_recall@3 |
| | - cosine_recall@5 |
| | - cosine_recall@10 |
| | - cosine_ndcg@10 |
| | - cosine_mrr@10 |
| | - cosine_map@100 |
| | model-index: |
| | - name: SentenceTransformer |
| | results: |
| | - task: |
| | type: information-retrieval |
| | name: Information Retrieval |
| | dataset: |
| | name: codesearchnet val |
| | type: codesearchnet_val |
| | metrics: |
| | - type: cosine_accuracy@1 |
| | value: 0.8926 |
| | name: Cosine Accuracy@1 |
| | - type: cosine_accuracy@3 |
| | value: 0.9453666666666667 |
| | name: Cosine Accuracy@3 |
| | - type: cosine_accuracy@5 |
| | value: 0.9545 |
| | name: Cosine Accuracy@5 |
| | - type: cosine_accuracy@10 |
| | value: 0.9637666666666667 |
| | name: Cosine Accuracy@10 |
| | - type: cosine_precision@1 |
| | value: 0.8926 |
| | name: Cosine Precision@1 |
| | - type: cosine_precision@3 |
| | value: 0.31512222222222214 |
| | name: Cosine Precision@3 |
| | - type: cosine_precision@5 |
| | value: 0.19090000000000004 |
| | name: Cosine Precision@5 |
| | - type: cosine_precision@10 |
| | value: 0.09637666666666667 |
| | name: Cosine Precision@10 |
| | - type: cosine_recall@1 |
| | value: 0.8926 |
| | name: Cosine Recall@1 |
| | - type: cosine_recall@3 |
| | value: 0.9453666666666667 |
| | name: Cosine Recall@3 |
| | - type: cosine_recall@5 |
| | value: 0.9545 |
| | name: Cosine Recall@5 |
| | - type: cosine_recall@10 |
| | value: 0.9637666666666667 |
| | name: Cosine Recall@10 |
| | - type: cosine_ndcg@10 |
| | value: 0.9313201256618757 |
| | name: Cosine Ndcg@10 |
| | - type: cosine_mrr@10 |
| | value: 0.9206047883597835 |
| | name: Cosine Mrr@10 |
| | - type: cosine_map@100 |
| | value: 0.9212040995599341 |
| | name: Cosine Map@100 |
| | license: mit |
| | datasets: |
| | - code-search-net/code_search_net |
| | language: |
| | - en |
| | base_model: |
| | - answerdotai/ModernBERT-base |
| | --- |
| | |
| | # SentenceTransformer |
| |
|
| | This is a [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model trained on the [code_search_net](https://huggingface.co/datasets/code-search-net/code_search_net) dataset with |
| | [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with in-batch negatives. Model can be used for code retrieval and reranking. |
| |
|
| | ## Perfomance on code retrieval benchmarks |
| |
|
| | **RTEB** |
| |
|
| | On 14.10.2025 the model is **6th** on RTEB leaderbord among models with <500M parameters: |
| | <details> |
| | <summary>Click</summary> |
| | <figure> |
| | <img src="Rteb_top.jpg"> |
| | </figure> |
| | </details> |
| | |
| | Perfomance per task: |
| | | Model | AppsRetrieval | Code1Retrieval (Private) | DS1000Retrieval | FreshStackRetrieval | HumanEvalRetrieval | JapaneseCode1Retrieval (Private)| MBPPRetrieval | WikiSQLRetrieval | |
| | |-------|---------------|----------------|-----------------|---------------------|--------------------|------------------------|---------------|------------------| |
| | | english_code_retriever | 8.04 | 75.36 | 32.42 | 18.30 | 71.82 | 46.59 | 72.06 | 87.92 | |
| |
|
| | **COIR**: |
| | | Model | AppsRetrieval | COIRCodeSearchNetRetrieval | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCCRetrieval | CodeTransOceanContest | CodeTransOceanDL | CosQA | StackOverflowQA | SyntheticText2SQL | |
| | |-------|---------------|----------------------------|----------------|----------------|--------------------------|------------------------|------------------|-------|------------------|-------------------| |
| | | english_code_retriever | 8.04 | 74.23 | 44.01 | 57.79 | 42.71 | 60.68 | 35.16 | 25.56 | 56.53 | 42.79 | |
| |
|
| | more information you cand find [in MTEB leaderbord](https://huggingface.co/spaces/mteb/leaderboard) |
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| | - **Model Type:** Sentence Transformer |
| | - **Maximum Sequence Length:** 8192 tokens |
| | - **Output Dimensionality:** 768 |
| | - **Similarity Function:** Cosine Similarity |
| | - Mean pooling |
| |
|
| | ## Usage |
| |
|
| | Using is easy with Sentence Transformers. |
| |
|
| | Pay attention that model was trained with prefixes 'search_query' for queries and 'search_document' for docs with code. |
| | So using with prefixes will improve model retrieving abilities. |
| |
|
| | ```python |
| | import torch |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | model = SentenceTransformer("fyaronskiy/english_code_retriever").to(device) |
| | |
| | queries = [ |
| | "Write a Python function that calculates the factorial of a number recursively.", |
| | "How to check if a given string reads the same backward and forward?", |
| | "Combine two sorted lists into a single sorted list." |
| | ] |
| | |
| | corpus = [ |
| | # Relevant for Q1 |
| | """def factorial(n): |
| | if n == 0: |
| | return 1 |
| | return n * factorial(n-1)""", |
| | |
| | # Hard negative for Q1 (similar structure but computes sum) |
| | """def sum_recursive(n): |
| | if n == 0: |
| | return 0 |
| | return n + sum_recursive(n-1)""", |
| | |
| | # Relevant for Q2 |
| | """def is_palindrome(s: str) -> bool: |
| | s = s.lower().replace(" ", "") |
| | return s == s[::-1]""", |
| | |
| | # Hard negative for Q2 (string reverse but not palindrome check) |
| | """def reverse_string(s: str) -> str: |
| | return s[::-1]""", |
| | |
| | # Relevant for Q3 |
| | """def merge_sorted_lists(a, b): |
| | result = [] |
| | i = j = 0 |
| | while i < len(a) and j < len(b): |
| | if a[i] < b[j]: |
| | result.append(a[i]) |
| | i += 1 |
| | else: |
| | result.append(b[j]) |
| | j += 1 |
| | result.extend(a[i:]) |
| | result.extend(b[j:]) |
| | return result""", |
| | |
| | # Hard negative for Q3 (similar iteration but sums two lists elementwise) |
| | """def add_lists(a, b): |
| | return [x + y for x, y in zip(a, b)]""" |
| | ] |
| | |
| | |
| | doc_embeddings = model.encode(corpus, prompt_name='search_query', convert_to_tensor=True, device=device) |
| | query_embeddings = model.encode(queries, prompt_name='search_document', convert_to_tensor=True, device=device) |
| | |
| | # Compute cosine similarity and retrieve top-1 |
| | for i, query in enumerate(queries): |
| | scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0] |
| | best_idx = torch.argmax(scores).item() |
| | print(f"\n Query {i+1}: {query}") |
| | print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}") |
| | |
| | ''' Query 1: Write a Python function that calculates the factorial of a number recursively. |
| | Top-1 match (score=0.5983): |
| | def factorial(n): |
| | if n == 0: |
| | return 1 |
| | return n * factorial(n-1) |
| | |
| | Query 2: How to check if a given string reads the same backward and forward? |
| | Top-1 match (score=0.4925): |
| | def is_palindrome(s: str) -> bool: |
| | s = s.lower().replace(" ", "") |
| | return s == s[::-1] |
| | |
| | Query 3: Combine two sorted lists into a single sorted list. |
| | Top-1 match (score=0.6524): |
| | def merge_sorted_lists(a, b): |
| | result = [] |
| | i = j = 0 |
| | while i < len(a) and j < len(b): |
| | if a[i] < b[j]: |
| | result.append(a[i]) |
| | i += 1 |
| | else: |
| | result.append(b[j]) |
| | j += 1 |
| | result.extend(a[i:]) |
| | result.extend(b[j:]) |
| | return result |
| | ''' |
| | ``` |
| |
|
| | Using with Transformers |
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | model_name = "fyaronskiy/english_code_retriever" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModel.from_pretrained(model_name).to(device) |
| | model.eval() |
| | |
| | |
| | |
| | queries = [ |
| | "function of addition of two numbers", |
| | "finding the maximum element in an array", |
| | "sorting a list in ascending order" |
| | ] |
| | |
| | corpus = [ |
| | "def add(a, b): return a + b", |
| | "def find_max(arr): return max(arr)", |
| | "def sort_list(lst): return sorted(lst)" |
| | ] |
| | |
| | def mean_pooling(model_output, attention_mask): |
| | token_embeddings = model_output[0] # (batch_size, seq_len, hidden_dim) |
| | input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
| | return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1).clamp(min=1e-9) |
| | |
| | def encode_texts(texts): |
| | encoded = tokenizer( |
| | texts, |
| | padding=True, |
| | truncation=True, |
| | return_tensors="pt", |
| | max_length=8192 |
| | ).to(device) |
| | with torch.no_grad(): |
| | model_output = model(**encoded) |
| | return mean_pooling(model_output, encoded["attention_mask"]) |
| | |
| | doc_embeddings = encode_texts(["search_document: " + document for document in corpus]) |
| | query_embeddings = encode_texts(["search_query: " + query for query in queries]) |
| | |
| | # Normalize embeddings for cosine similarity |
| | doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1) |
| | query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) |
| | |
| | # Compute cosine similarity and retrieve top-1 |
| | for i, query in enumerate(queries): |
| | scores = torch.matmul(query_embeddings[i], doc_embeddings.T) |
| | best_idx = torch.argmax(scores).item() |
| | print(f"\n Query {i+1}: {query}") |
| | print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}") |
| | |
| | ''' Query 1: function of addition of two numbers |
| | Top-1 match (score=0.6047): |
| | def add(a, b): return a + b |
| | |
| | Query 2: finding the maximum element in an array |
| | Top-1 match (score=0.7772): |
| | def find_max(arr): return max(arr) |
| | |
| | Query 3: sorting a list in ascending order |
| | Top-1 match (score=0.7389): |
| | def sort_list(lst): return sorted(lst) |
| | ''' |
| | ``` |
| |
|
| | <!-- |
| | ### Downstream Usage (Sentence Transformers) |
| |
|
| | You can finetune this model on your own dataset. |
| |
|
| | <details><summary>Click to expand</summary> |
| |
|
| | </details> |
| | --> |
| |
|
| | <!-- |
| | ### Out-of-Scope Use |
| |
|
| | *List how the model may foreseeably be misused and address what users ought not to do with the model.* |
| | --> |
| |
|
| | ## Evaluation |
| |
|
| | ### Metrics |
| |
|
| | #### Information Retrieval |
| |
|
| | * Dataset: validation part of `codesearchnet_val` |
| | * Size: 30,000 evaluation samples |
| |
|
| | * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) |
| |
|
| | | Metric | Value | |
| | |:--------------------|:-----------| |
| | | cosine_accuracy@1 | 0.8926 | |
| | | cosine_accuracy@3 | 0.9454 | |
| | | cosine_accuracy@5 | 0.9545 | |
| | | cosine_accuracy@10 | 0.9638 | |
| | | cosine_precision@1 | 0.8926 | |
| | | cosine_precision@3 | 0.3151 | |
| | | cosine_precision@5 | 0.1909 | |
| | | cosine_precision@10 | 0.0964 | |
| | | cosine_recall@1 | 0.8926 | |
| | | cosine_recall@3 | 0.9454 | |
| | | cosine_recall@5 | 0.9545 | |
| | | cosine_recall@10 | 0.9638 | |
| | | **cosine_ndcg@10** | **0.9313** | |
| | | cosine_mrr@10 | 0.9206 | |
| | | cosine_map@100 | 0.9212 | |
| | |
| | <!-- |
| | ## Bias, Risks and Limitations |
| | |
| | *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
| | --> |
| | |
| | <!-- |
| | ### Recommendations |
| | |
| | *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
| | --> |
| | |
| | ## Training Details |
| | |
| | ### Training Dataset |
| | |
| | #### code_search_net |
| | |
| | * Dataset: train part of code_search_net |
| | * Size: 1,880,853 training samples |
| | * queries - function docstrings in english, relevant document - code of function |
| | * negatives was sampled from batch |
| | * Distribution of programming languages: |
| | * |
| | |
| |  |
| | |
| | |
| | ### Training Hyperparameters |
| | #### Non-Default Hyperparameters |
| | |
| | - `batch_size`: 64 |
| | - `learning_rate`: 2e-05 |
| | - `num_epochs`: 2 |
| | - `warmup_ratio`: 0.1 |
| | |
| | |
| | |
| | ### Framework Versions |
| | - Python: 3.10.11 |
| | - Sentence Transformers: 5.1.0 |
| | - Transformers: 4.52.3 |
| | - PyTorch: 2.6.0+cu124 |
| | - Accelerate: 1.10.0 |
| | - Datasets: 3.6.0 |
| | - Tokenizers: 0.21.4 |
| | |
| | |
| | <!-- |
| | ## Glossary |
| | |
| | *Clearly define terms in order to be accessible across audiences.* |
| | --> |
| | |
| | <!-- |
| | ## Model Card Authors |
| | |
| | *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
| | --> |
| | |
| | <!-- |
| | ## Model Card Contact |
| | |
| | *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
| | --> |