--- tags: - sentence-transformers - code - code-retrieval - retrieval-augmented-generation - rag - python - java - go - php - sentence-similarity - feature-extraction - generated_from_trainer - loss:MultipleNegativesRankingLoss widget: - source_sentence: |- search_query: Finds the top long, short, and absolute positions. Parameters ---------- positions : pd.DataFrame The positions that the strategy takes over time. top : int, optional How many of each to find (default 10). Returns ------- df_top_long : pd.DataFrame Top long positions. df_top_short : pd.DataFrame Top short positions. df_top_abs : pd.DataFrame Top absolute positions. sentences: - >- search_document: def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8): ''' perform symmetric EMA (exponential moving average) smoothing and resampling to an even grid with n points. Does not do extrapolation, so we assume xolds[0] <= low && high <= xolds[-1] Arguments: xolds: array or list - x values of data. Needs to be sorted in ascending order yolds: array of list - y values of data. Has to have the same length as xolds low: float - min value of the new x grid. By default equals to xolds[0] high: float - max value of the new x grid. By default equals to xolds[-1] n: int - number of points in new x grid decay_steps: float - EMA decay factor, expressed in new x grid steps. low_counts_threshold: float or int - y values with counts less than this value will be set to NaN Returns: tuple sum_ys, count_ys where xs - array with new x grid ys - array of EMA of y at each point of the new x grid count_ys - array of EMA of y counts at each point of the new x grid ''' xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0) _, ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0) ys2 = ys2[::-1] count_ys2 = count_ys2[::-1] count_ys = count_ys1 + count_ys2 ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys ys[count_ys < low_counts_threshold] = np.nan return xs, ys, count_ys - |- search_document: def project(self, from_shape, to_shape): """ Project the polygon onto an image with different shape. The relative coordinates of all points remain the same. E.g. a point at (x=20, y=20) on an image (width=100, height=200) will be projected on a new image (width=200, height=100) to (x=40, y=10). This is intended for cases where the original image is resized. It cannot be used for more complex changes (e.g. padding, cropping). Parameters ---------- from_shape : tuple of int Shape of the original image. (Before resize.) to_shape : tuple of int Shape of the new image. (After resize.) Returns ------- imgaug.Polygon Polygon object with new coordinates. """ if from_shape[0:2] == to_shape[0:2]: return self.copy() ls_proj = self.to_line_string(closed=False).project( from_shape, to_shape) return self.copy(exterior=ls_proj.coords) - |- search_document: def get_top_long_short_abs(positions, top=10): """ Finds the top long, short, and absolute positions. Parameters ---------- positions : pd.DataFrame The positions that the strategy takes over time. top : int, optional How many of each to find (default 10). Returns ------- df_top_long : pd.DataFrame Top long positions. df_top_short : pd.DataFrame Top short positions. df_top_abs : pd.DataFrame Top absolute positions. """ positions = positions.drop('cash', axis='columns') df_max = positions.max() df_min = positions.min() df_abs_max = positions.abs().max() df_top_long = df_max[df_max > 0].nlargest(top) df_top_short = df_min[df_min < 0].nsmallest(top) df_top_abs = df_abs_max.nlargest(top) return df_top_long, df_top_short, df_top_abs - source_sentence: |- search_query: Draw text on an image. This uses by default DejaVuSans as its font, which is included in this library. dtype support:: * ``uint8``: yes; fully tested * ``uint16``: no * ``uint32``: no * ``uint64``: no * ``int8``: no * ``int16``: no * ``int32``: no * ``int64``: no * ``float16``: no * ``float32``: yes; not tested * ``float64``: no * ``float128``: no * ``bool``: no TODO check if other dtypes could be enabled Parameters ---------- img : (H,W,3) ndarray The image array to draw text on. Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0). y : int x-coordinate of the top left corner of the text. x : int y- coordinate of the top left corner of the text. text : str The text to draw. color : iterable of int, optional Color of the text to draw. For RGB-images this is expected to be an RGB color. size : int, optional Font size of the text to draw. Returns ------- img_np : (H,W,3) ndarray Input image with text drawn on it. sentences: - >- search_document: def cross_entropy_seq_with_mask(logits, target_seqs, input_mask, return_details=False, name=None): """Returns the expression of cross-entropy of two sequences, implement softmax internally. Normally be used for Dynamic RNN with Synced sequence input and output. Parameters ----------- logits : Tensor 2D tensor with shape of [batch_size * ?, n_classes], `?` means dynamic IDs for each example. - Can be get from `DynamicRNNLayer` by setting ``return_seq_2d`` to `True`. target_seqs : Tensor int of tensor, like word ID. [batch_size, ?], `?` means dynamic IDs for each example. input_mask : Tensor The mask to compute loss, it has the same size with `target_seqs`, normally 0 or 1. return_details : boolean Whether to return detailed losses. - If False (default), only returns the loss. - If True, returns the loss, losses, weights and targets (see source code). Examples -------- >>> batch_size = 64 >>> vocab_size = 10000 >>> embedding_size = 256 >>> input_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="input") >>> target_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="target") >>> input_mask = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="mask") >>> net = tl.layers.EmbeddingInputlayer( ... inputs = input_seqs, ... vocabulary_size = vocab_size, ... embedding_size = embedding_size, ... name = 'seq_embedding') >>> net = tl.layers.DynamicRNNLayer(net, ... cell_fn = tf.contrib.rnn.BasicLSTMCell, ... n_hidden = embedding_size, ... dropout = (0.7 if is_train else None), ... sequence_length = tl.layers.retrieve_seq_length_op2(input_seqs), ... return_seq_2d = True, ... name = 'dynamicrnn') >>> print(net.outputs) (?, 256) >>> net = tl.layers.DenseLayer(net, n_units=vocab_size, name="output") >>> print(net.outputs) (?, 10000) >>> loss = tl.cost.cross_entropy_seq_with_mask(net.outputs, target_seqs, input_mask) """ targets = tf.reshape(target_seqs, [-1]) # to one vector weights = tf.to_float(tf.reshape(input_mask, [-1])) # to one vector like targets losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name) * weights # losses = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name)) # for TF1.0 and others loss = tf.divide( tf.reduce_sum(losses), # loss from mask. reduce_sum before element-wise mul with mask !! tf.reduce_sum(weights), name="seq_loss_with_mask" ) if return_details: return loss, losses, weights, targets else: return loss - |- search_document: def pickle_load(path, compression=False): """Unpickle a possible compressed pickle. Parameters ---------- path: str path to the output file compression: bool if true assumes that pickle was compressed when created and attempts decompression. Returns ------- obj: object the unpickled object """ if compression: with zipfile.ZipFile(path, "r", compression=zipfile.ZIP_DEFLATED) as myzip: with myzip.open("data") as f: return pickle.load(f) else: with open(path, "rb") as f: return pickle.load(f) - |- search_document: def draw_text(img, y, x, text, color=(0, 255, 0), size=25): """ Draw text on an image. This uses by default DejaVuSans as its font, which is included in this library. dtype support:: * ``uint8``: yes; fully tested * ``uint16``: no * ``uint32``: no * ``uint64``: no * ``int8``: no * ``int16``: no * ``int32``: no * ``int64``: no * ``float16``: no * ``float32``: yes; not tested * ``float64``: no * ``float128``: no * ``bool``: no TODO check if other dtypes could be enabled Parameters ---------- img : (H,W,3) ndarray The image array to draw text on. Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0). y : int x-coordinate of the top left corner of the text. x : int y- coordinate of the top left corner of the text. text : str The text to draw. color : iterable of int, optional Color of the text to draw. For RGB-images this is expected to be an RGB color. size : int, optional Font size of the text to draw. Returns ------- img_np : (H,W,3) ndarray Input image with text drawn on it. """ do_assert(img.dtype in [np.uint8, np.float32]) input_dtype = img.dtype if img.dtype == np.float32: img = img.astype(np.uint8) img = PIL_Image.fromarray(img) font = PIL_ImageFont.truetype(DEFAULT_FONT_FP, size) context = PIL_ImageDraw.Draw(img) context.text((x, y), text, fill=tuple(color), font=font) img_np = np.asarray(img) # PIL/asarray returns read only array if not img_np.flags["WRITEABLE"]: try: # this seems to no longer work with np 1.16 (or was pillow updated?) img_np.setflags(write=True) except ValueError as ex: if "cannot set WRITEABLE flag to True of this array" in str(ex): img_np = np.copy(img_np) if img_np.dtype != input_dtype: img_np = img_np.astype(input_dtype) return img_np - source_sentence: >- search_query: Choice and return an an action by given the action probability distribution. Parameters ------------ probs : list of float. The probability distribution of all actions. action_list : None or a list of int or others A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1. Returns -------- float int or str The chosen action. Examples ---------- >>> for _ in range(5): >>> a = choice_action_by_probs([0.2, 0.4, 0.4]) >>> print(a) 0 1 1 2 1 >>> for _ in range(3): >>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b']) >>> print(a) a b b sentences: - >- search_document: def from_keypoint_image(image, if_not_found_coords={"x": -1, "y": -1}, threshold=1, nb_channels=None): # pylint: disable=locally-disabled, dangerous-default-value, line-too-long """ Converts an image generated by ``to_keypoint_image()`` back to a KeypointsOnImage object. Parameters ---------- image : (H,W,N) ndarray The keypoints image. N is the number of keypoints. if_not_found_coords : tuple or list or dict or None, optional Coordinates to use for keypoints that cannot be found in `image`. If this is a list/tuple, it must have two integer values. If it is a dictionary, it must have the keys ``x`` and ``y`` with each containing one integer value. If this is None, then the keypoint will not be added to the final KeypointsOnImage object. threshold : int, optional The search for keypoints works by searching for the argmax in each channel. This parameters contains the minimum value that the max must have in order to be viewed as a keypoint. nb_channels : None or int, optional Number of channels of the image on which the keypoints are placed. Some keypoint augmenters require that information. If set to None, the keypoint's shape will be set to ``(height, width)``, otherwise ``(height, width, nb_channels)``. Returns ------- out : KeypointsOnImage The extracted keypoints. """ ia.do_assert(len(image.shape) == 3) height, width, nb_keypoints = image.shape drop_if_not_found = False if if_not_found_coords is None: drop_if_not_found = True if_not_found_x = -1 if_not_found_y = -1 elif isinstance(if_not_found_coords, (tuple, list)): ia.do_assert(len(if_not_found_coords) == 2) if_not_found_x = if_not_found_coords[0] if_not_found_y = if_not_found_coords[1] elif isinstance(if_not_found_coords, dict): if_not_found_x = if_not_found_coords["x"] if_not_found_y = if_not_found_coords["y"] else: raise Exception("Expected if_not_found_coords to be None or tuple or list or dict, got %s." % ( type(if_not_found_coords),)) keypoints = [] for i in sm.xrange(nb_keypoints): maxidx_flat = np.argmax(image[..., i]) maxidx_ndim = np.unravel_index(maxidx_flat, (height, width)) found = (image[maxidx_ndim[0], maxidx_ndim[1], i] >= threshold) if found: keypoints.append(Keypoint(x=maxidx_ndim[1], y=maxidx_ndim[0])) else: if drop_if_not_found: pass # dont add the keypoint to the result list, i.e. drop it else: keypoints.append(Keypoint(x=if_not_found_x, y=if_not_found_y)) out_shape = (height, width) if nb_channels is not None: out_shape += (nb_channels,) return KeypointsOnImage(keypoints, shape=out_shape) - >- search_document: def choice_action_by_probs(probs=(0.5, 0.5), action_list=None): """Choice and return an an action by given the action probability distribution. Parameters ------------ probs : list of float. The probability distribution of all actions. action_list : None or a list of int or others A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1. Returns -------- float int or str The chosen action. Examples ---------- >>> for _ in range(5): >>> a = choice_action_by_probs([0.2, 0.4, 0.4]) >>> print(a) 0 1 1 2 1 >>> for _ in range(3): >>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b']) >>> print(a) a b b """ if action_list is None: n_action = len(probs) action_list = np.arange(n_action) else: if len(action_list) != len(probs): raise Exception("number of actions should equal to number of probabilities.") return np.random.choice(action_list, p=probs) - |- search_document: def __validateExperimentControl(self, control): """ Validates control dictionary for the experiment context""" # Validate task list taskList = control.get('tasks', None) if taskList is not None: taskLabelsList = [] for task in taskList: validateOpfJsonValue(task, "opfTaskSchema.json") validateOpfJsonValue(task['taskControl'], "opfTaskControlSchema.json") taskLabel = task['taskLabel'] assert isinstance(taskLabel, types.StringTypes), \ "taskLabel type: %r" % type(taskLabel) assert len(taskLabel) > 0, "empty string taskLabel not is allowed" taskLabelsList.append(taskLabel.lower()) taskLabelDuplicates = filter(lambda x: taskLabelsList.count(x) > 1, taskLabelsList) assert len(taskLabelDuplicates) == 0, \ "Duplcate task labels are not allowed: %s" % taskLabelDuplicates return - source_sentence: |- search_query: Augment endlessly images in the source queue. This is a worker function for that endlessly queries the source queue (input batches), augments batches in it and sends the result to the output queue. sentences: - >- search_document: def _augment_images_worker(self, augseq, queue_source, queue_result, seedval): """ Augment endlessly images in the source queue. This is a worker function for that endlessly queries the source queue (input batches), augments batches in it and sends the result to the output queue. """ np.random.seed(seedval) random.seed(seedval) augseq.reseed(seedval) ia.seed(seedval) loader_finished = False while not loader_finished: # wait for a new batch in the source queue and load it try: batch_str = queue_source.get(timeout=0.1) batch = pickle.loads(batch_str) if batch is None: loader_finished = True # put it back in so that other workers know that the loading queue is finished queue_source.put(pickle.dumps(None, protocol=-1)) else: batch_aug = augseq.augment_batch(batch) # send augmented batch to output queue batch_str = pickle.dumps(batch_aug, protocol=-1) queue_result.put(batch_str) except QueueEmpty: time.sleep(0.01) queue_result.put(pickle.dumps(None, protocol=-1)) time.sleep(0.01) - |- search_document: def show_perf_attrib_stats(returns, positions, factor_returns, factor_loadings, transactions=None, pos_in_dollars=True): """ Calls `perf_attrib` using inputs, and displays outputs using `utils.print_table`. """ risk_exposures, perf_attrib_data = perf_attrib( returns, positions, factor_returns, factor_loadings, transactions, pos_in_dollars=pos_in_dollars, ) perf_attrib_stats, risk_exposure_stats =\ create_perf_attrib_stats(perf_attrib_data, risk_exposures) percentage_formatter = '{:.2%}'.format float_formatter = '{:.2f}'.format summary_stats = perf_attrib_stats.loc[['Annualized Specific Return', 'Annualized Common Return', 'Annualized Total Return', 'Specific Sharpe Ratio']] # Format return rows in summary stats table as percentages. for col_name in ( 'Annualized Specific Return', 'Annualized Common Return', 'Annualized Total Return', ): summary_stats[col_name] = percentage_formatter(summary_stats[col_name]) # Display sharpe to two decimal places. summary_stats['Specific Sharpe Ratio'] = float_formatter( summary_stats['Specific Sharpe Ratio'] ) print_table(summary_stats, name='Summary Statistics') print_table( risk_exposure_stats, name='Exposures Summary', # In exposures table, format exposure column to 2 decimal places, and # return columns as percentages. formatters={ 'Average Risk Factor Exposure': float_formatter, 'Annualized Return': percentage_formatter, 'Cumulative Return': percentage_formatter, }, ) - >- search_document: def binary_cross_entropy(output, target, epsilon=1e-8, name='bce_loss'): """Binary cross entropy operation. Parameters ---------- output : Tensor Tensor with type of `float32` or `float64`. target : Tensor The target distribution, format the same with `output`. epsilon : float A small value to avoid output to be zero. name : str An optional name to attach to this function. References ----------- - `ericjang-DRAW `__ """ # with ops.op_scope([output, target], name, "bce_loss") as name: # output = ops.convert_to_tensor(output, name="preds") # target = ops.convert_to_tensor(targets, name="target") # with tf.name_scope(name): return tf.reduce_mean( tf.reduce_sum(-(target * tf.log(output + epsilon) + (1. - target) * tf.log(1. - output + epsilon)), axis=1), name=name ) - source_sentence: 'search_query: episode_batch: array(batch_size x (T or T+1) x dim_key)' sentences: - |- search_document: def get_txn_vol(transactions): """ Extract daily transaction data from set of transaction objects. Parameters ---------- transactions : pd.DataFrame Time series containing one row per symbol (and potentially duplicate datetime indices) and columns for amount and price. Returns ------- pd.DataFrame Daily transaction volume and number of shares. - See full explanation in tears.create_full_tear_sheet. """ txn_norm = transactions.copy() txn_norm.index = txn_norm.index.normalize() amounts = txn_norm.amount.abs() prices = txn_norm.price values = amounts * prices daily_amounts = amounts.groupby(amounts.index).sum() daily_values = values.groupby(values.index).sum() daily_amounts.name = "txn_shares" daily_values.name = "txn_volume" return pd.concat([daily_values, daily_amounts], axis=1) - |- search_document: def deepcopy(self, exterior=None, label=None): """ Create a deep copy of the Polygon object. Parameters ---------- exterior : list of Keypoint or list of tuple or (N,2) ndarray, optional List of points defining the polygon. See `imgaug.Polygon.__init__` for details. label : None or str If not None, then the label of the copied object will be set to this value. Returns ------- imgaug.Polygon Deep copy. """ return Polygon( exterior=np.copy(self.exterior) if exterior is None else exterior, label=self.label if label is None else label ) - |- search_document: def store_episode(self, episode_batch): """episode_batch: array(batch_size x (T or T+1) x dim_key) """ batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()] assert np.all(np.array(batch_sizes) == batch_sizes[0]) batch_size = batch_sizes[0] with self.lock: idxs = self._get_storage_idx(batch_size) # load inputs into buffers for key in self.buffers.keys(): self.buffers[key][idxs] = episode_batch[key] self.n_transitions_stored += batch_size * self.T pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - cosine_accuracy@1 - cosine_accuracy@3 - cosine_accuracy@5 - cosine_accuracy@10 - cosine_precision@1 - cosine_precision@3 - cosine_precision@5 - cosine_precision@10 - cosine_recall@1 - cosine_recall@3 - cosine_recall@5 - cosine_recall@10 - cosine_ndcg@10 - cosine_mrr@10 - cosine_map@100 model-index: - name: SentenceTransformer results: - task: type: information-retrieval name: Information Retrieval dataset: name: codesearchnet val type: codesearchnet_val metrics: - type: cosine_accuracy@1 value: 0.8926 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.9453666666666667 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.9545 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.9637666666666667 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.8926 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.31512222222222214 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.19090000000000004 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.09637666666666667 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.8926 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.9453666666666667 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.9545 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.9637666666666667 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.9313201256618757 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.9206047883597835 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.9212040995599341 name: Cosine Map@100 license: mit datasets: - code-search-net/code_search_net language: - en base_model: - answerdotai/ModernBERT-base --- # SentenceTransformer This is a [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model trained on the [code_search_net](https://huggingface.co/datasets/code-search-net/code_search_net) dataset with [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with in-batch negatives. Model can be used for code retrieval and reranking. ## Perfomance on code retrieval benchmarks **RTEB** On 14.10.2025 the model is **6th** on RTEB leaderbord among models with <500M parameters:
Click
Perfomance per task: | Model | AppsRetrieval | Code1Retrieval (Private) | DS1000Retrieval | FreshStackRetrieval | HumanEvalRetrieval | JapaneseCode1Retrieval (Private)| MBPPRetrieval | WikiSQLRetrieval | |-------|---------------|----------------|-----------------|---------------------|--------------------|------------------------|---------------|------------------| | english_code_retriever | 8.04 | 75.36 | 32.42 | 18.30 | 71.82 | 46.59 | 72.06 | 87.92 | **COIR**: | Model | AppsRetrieval | COIRCodeSearchNetRetrieval | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCCRetrieval | CodeTransOceanContest | CodeTransOceanDL | CosQA | StackOverflowQA | SyntheticText2SQL | |-------|---------------|----------------------------|----------------|----------------|--------------------------|------------------------|------------------|-------|------------------|-------------------| | english_code_retriever | 8.04 | 74.23 | 44.01 | 57.79 | 42.71 | 60.68 | 35.16 | 25.56 | 56.53 | 42.79 | more information you cand find [in MTEB leaderbord](https://huggingface.co/spaces/mteb/leaderboard) ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 768 - **Similarity Function:** Cosine Similarity - Mean pooling ## Usage Using is easy with Sentence Transformers. Pay attention that model was trained with prefixes 'search_query' for queries and 'search_document' for docs with code. So using with prefixes will improve model retrieving abilities. ```python import torch from sentence_transformers import SentenceTransformer, util device = "cuda" if torch.cuda.is_available() else "cpu" model = SentenceTransformer("fyaronskiy/english_code_retriever").to(device) queries = [ "Write a Python function that calculates the factorial of a number recursively.", "How to check if a given string reads the same backward and forward?", "Combine two sorted lists into a single sorted list." ] corpus = [ # Relevant for Q1 """def factorial(n): if n == 0: return 1 return n * factorial(n-1)""", # Hard negative for Q1 (similar structure but computes sum) """def sum_recursive(n): if n == 0: return 0 return n + sum_recursive(n-1)""", # Relevant for Q2 """def is_palindrome(s: str) -> bool: s = s.lower().replace(" ", "") return s == s[::-1]""", # Hard negative for Q2 (string reverse but not palindrome check) """def reverse_string(s: str) -> str: return s[::-1]""", # Relevant for Q3 """def merge_sorted_lists(a, b): result = [] i = j = 0 while i < len(a) and j < len(b): if a[i] < b[j]: result.append(a[i]) i += 1 else: result.append(b[j]) j += 1 result.extend(a[i:]) result.extend(b[j:]) return result""", # Hard negative for Q3 (similar iteration but sums two lists elementwise) """def add_lists(a, b): return [x + y for x, y in zip(a, b)]""" ] doc_embeddings = model.encode(corpus, prompt_name='search_query', convert_to_tensor=True, device=device) query_embeddings = model.encode(queries, prompt_name='search_document', convert_to_tensor=True, device=device) # Compute cosine similarity and retrieve top-1 for i, query in enumerate(queries): scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0] best_idx = torch.argmax(scores).item() print(f"\n Query {i+1}: {query}") print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}") ''' Query 1: Write a Python function that calculates the factorial of a number recursively. Top-1 match (score=0.5983): def factorial(n): if n == 0: return 1 return n * factorial(n-1) Query 2: How to check if a given string reads the same backward and forward? Top-1 match (score=0.4925): def is_palindrome(s: str) -> bool: s = s.lower().replace(" ", "") return s == s[::-1] Query 3: Combine two sorted lists into a single sorted list. Top-1 match (score=0.6524): def merge_sorted_lists(a, b): result = [] i = j = 0 while i < len(a) and j < len(b): if a[i] < b[j]: result.append(a[i]) i += 1 else: result.append(b[j]) j += 1 result.extend(a[i:]) result.extend(b[j:]) return result ''' ``` Using with Transformers ```python import torch from transformers import AutoTokenizer, AutoModel device = "cuda" if torch.cuda.is_available() else "cpu" model_name = "fyaronskiy/english_code_retriever" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name).to(device) model.eval() queries = [ "function of addition of two numbers", "finding the maximum element in an array", "sorting a list in ascending order" ] corpus = [ "def add(a, b): return a + b", "def find_max(arr): return max(arr)", "def sort_list(lst): return sorted(lst)" ] def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] # (batch_size, seq_len, hidden_dim) input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1).clamp(min=1e-9) def encode_texts(texts): encoded = tokenizer( texts, padding=True, truncation=True, return_tensors="pt", max_length=8192 ).to(device) with torch.no_grad(): model_output = model(**encoded) return mean_pooling(model_output, encoded["attention_mask"]) doc_embeddings = encode_texts(["search_document: " + document for document in corpus]) query_embeddings = encode_texts(["search_query: " + query for query in queries]) # Normalize embeddings for cosine similarity doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1) query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) # Compute cosine similarity and retrieve top-1 for i, query in enumerate(queries): scores = torch.matmul(query_embeddings[i], doc_embeddings.T) best_idx = torch.argmax(scores).item() print(f"\n Query {i+1}: {query}") print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}") ''' Query 1: function of addition of two numbers Top-1 match (score=0.6047): def add(a, b): return a + b Query 2: finding the maximum element in an array Top-1 match (score=0.7772): def find_max(arr): return max(arr) Query 3: sorting a list in ascending order Top-1 match (score=0.7389): def sort_list(lst): return sorted(lst) ''' ``` ## Evaluation ### Metrics #### Information Retrieval * Dataset: validation part of `codesearchnet_val` * Size: 30,000 evaluation samples * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.8926 | | cosine_accuracy@3 | 0.9454 | | cosine_accuracy@5 | 0.9545 | | cosine_accuracy@10 | 0.9638 | | cosine_precision@1 | 0.8926 | | cosine_precision@3 | 0.3151 | | cosine_precision@5 | 0.1909 | | cosine_precision@10 | 0.0964 | | cosine_recall@1 | 0.8926 | | cosine_recall@3 | 0.9454 | | cosine_recall@5 | 0.9545 | | cosine_recall@10 | 0.9638 | | **cosine_ndcg@10** | **0.9313** | | cosine_mrr@10 | 0.9206 | | cosine_map@100 | 0.9212 | ## Training Details ### Training Dataset #### code_search_net * Dataset: train part of code_search_net * Size: 1,880,853 training samples * queries - function docstrings in english, relevant document - code of function * negatives was sampled from batch * Distribution of programming languages: * ![image](https://cdn-uploads.huggingface.co/production/uploads/620118ee1283421e22448fc2/IefFD32ihUSTGZjbb78px.png) ### Training Hyperparameters #### Non-Default Hyperparameters - `batch_size`: 64 - `learning_rate`: 2e-05 - `num_epochs`: 2 - `warmup_ratio`: 0.1 ### Framework Versions - Python: 3.10.11 - Sentence Transformers: 5.1.0 - Transformers: 4.52.3 - PyTorch: 2.6.0+cu124 - Accelerate: 1.10.0 - Datasets: 3.6.0 - Tokenizers: 0.21.4