README.md · fyaronskiy/english_code

english_code_retriever / README.md

fyaronskiy

Update README.md

14807ee verified 4 months ago

preview code

raw

history blame contribute delete

38.7 kB

	---
	tags:
	- sentence-transformers
	- code
	- code-retrieval
	- retrieval-augmented-generation
	- rag
	- python
	- java
	- go
	- php
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- loss:MultipleNegativesRankingLoss
	widget:
	- source_sentence: \|-
	search_query: Finds the top long, short, and absolute positions.

	Parameters
	----------
	positions : pd.DataFrame
	The positions that the strategy takes over time.
	top : int, optional
	How many of each to find (default 10).

	Returns
	-------
	df_top_long : pd.DataFrame
	Top long positions.
	df_top_short : pd.DataFrame
	Top short positions.
	df_top_abs : pd.DataFrame
	Top absolute positions.
	sentences:
	- >-
	search_document: def symmetric_ema(xolds, yolds, low=None, high=None, n=512,
	decay_steps=1., low_counts_threshold=1e-8):
	'''
	perform symmetric EMA (exponential moving average)
	smoothing and resampling to an even grid with n points.
	Does not do extrapolation, so we assume
	xolds[0] <= low && high <= xolds[-1]

	Arguments:

	xolds: array or list - x values of data. Needs to be sorted in ascending order
	yolds: array of list - y values of data. Has to have the same length as xolds

	low: float - min value of the new x grid. By default equals to xolds[0]
	high: float - max value of the new x grid. By default equals to xolds[-1]

	n: int - number of points in new x grid

	decay_steps: float - EMA decay factor, expressed in new x grid steps.

	low_counts_threshold: float or int
	- y values with counts less than this value will be set to NaN

	Returns:
	tuple sum_ys, count_ys where
	xs - array with new x grid
	ys - array of EMA of y at each point of the new x grid
	count_ys - array of EMA of y counts at each point of the new x grid

	'''
	xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
	_, ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
	ys2 = ys2[::-1]
	count_ys2 = count_ys2[::-1]
	count_ys = count_ys1 + count_ys2
	ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
	ys[count_ys < low_counts_threshold] = np.nan
	return xs, ys, count_ys
	- \|-
	search_document: def project(self, from_shape, to_shape):
	"""
	Project the polygon onto an image with different shape.

	The relative coordinates of all points remain the same.
	E.g. a point at (x=20, y=20) on an image (width=100, height=200) will be
	projected on a new image (width=200, height=100) to (x=40, y=10).

	This is intended for cases where the original image is resized.
	It cannot be used for more complex changes (e.g. padding, cropping).

	Parameters
	----------
	from_shape : tuple of int
	Shape of the original image. (Before resize.)

	to_shape : tuple of int
	Shape of the new image. (After resize.)

	Returns
	-------
	imgaug.Polygon
	Polygon object with new coordinates.

	"""
	if from_shape[0:2] == to_shape[0:2]:
	return self.copy()
	ls_proj = self.to_line_string(closed=False).project(
	from_shape, to_shape)
	return self.copy(exterior=ls_proj.coords)
	- \|-
	search_document: def get_top_long_short_abs(positions, top=10):
	"""
	Finds the top long, short, and absolute positions.

	Parameters
	----------
	positions : pd.DataFrame
	The positions that the strategy takes over time.
	top : int, optional
	How many of each to find (default 10).

	Returns
	-------
	df_top_long : pd.DataFrame
	Top long positions.
	df_top_short : pd.DataFrame
	Top short positions.
	df_top_abs : pd.DataFrame
	Top absolute positions.
	"""

	positions = positions.drop('cash', axis='columns')
	df_max = positions.max()
	df_min = positions.min()
	df_abs_max = positions.abs().max()
	df_top_long = df_max[df_max > 0].nlargest(top)
	df_top_short = df_min[df_min < 0].nsmallest(top)
	df_top_abs = df_abs_max.nlargest(top)
	return df_top_long, df_top_short, df_top_abs
	- source_sentence: \|-
	search_query: Draw text on an image.

	This uses by default DejaVuSans as its font, which is included in this library.

	dtype support::

	* ``uint8``: yes; fully tested
	* ``uint16``: no
	* ``uint32``: no
	* ``uint64``: no
	* ``int8``: no
	* ``int16``: no
	* ``int32``: no
	* ``int64``: no
	* ``float16``: no
	* ``float32``: yes; not tested
	* ``float64``: no
	* ``float128``: no
	* ``bool``: no

	TODO check if other dtypes could be enabled

	Parameters
	----------
	img : (H,W,3) ndarray
	The image array to draw text on.
	Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0).

	y : int
	x-coordinate of the top left corner of the text.

	x : int
	y- coordinate of the top left corner of the text.

	text : str
	The text to draw.

	color : iterable of int, optional
	Color of the text to draw. For RGB-images this is expected to be an RGB color.

	size : int, optional
	Font size of the text to draw.

	Returns
	-------
	img_np : (H,W,3) ndarray
	Input image with text drawn on it.
	sentences:
	- >-
	search_document: def cross_entropy_seq_with_mask(logits, target_seqs,
	input_mask, return_details=False, name=None):
	"""Returns the expression of cross-entropy of two sequences, implement
	softmax internally. Normally be used for Dynamic RNN with Synced sequence input and output.

	Parameters
	-----------
	logits : Tensor
	2D tensor with shape of [batch_size * ?, n_classes], `?` means dynamic IDs for each example.
	- Can be get from `DynamicRNNLayer` by setting ``return_seq_2d`` to `True`.
	target_seqs : Tensor
	int of tensor, like word ID. [batch_size, ?], `?` means dynamic IDs for each example.
	input_mask : Tensor
	The mask to compute loss, it has the same size with `target_seqs`, normally 0 or 1.
	return_details : boolean
	Whether to return detailed losses.
	- If False (default), only returns the loss.
	- If True, returns the loss, losses, weights and targets (see source code).

	Examples
	--------
	>>> batch_size = 64
	>>> vocab_size = 10000
	>>> embedding_size = 256
	>>> input_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="input")
	>>> target_seqs = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="target")
	>>> input_mask = tf.placeholder(dtype=tf.int64, shape=[batch_size, None], name="mask")
	>>> net = tl.layers.EmbeddingInputlayer(
	... inputs = input_seqs,
	... vocabulary_size = vocab_size,
	... embedding_size = embedding_size,
	... name = 'seq_embedding')
	>>> net = tl.layers.DynamicRNNLayer(net,
	... cell_fn = tf.contrib.rnn.BasicLSTMCell,
	... n_hidden = embedding_size,
	... dropout = (0.7 if is_train else None),
	... sequence_length = tl.layers.retrieve_seq_length_op2(input_seqs),
	... return_seq_2d = True,
	... name = 'dynamicrnn')
	>>> print(net.outputs)
	(?, 256)
	>>> net = tl.layers.DenseLayer(net, n_units=vocab_size, name="output")
	>>> print(net.outputs)
	(?, 10000)
	>>> loss = tl.cost.cross_entropy_seq_with_mask(net.outputs, target_seqs, input_mask)

	"""
	targets = tf.reshape(target_seqs, [-1]) # to one vector
	weights = tf.to_float(tf.reshape(input_mask, [-1])) # to one vector like targets
	losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name) * weights
	# losses = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=targets, name=name)) # for TF1.0 and others

	loss = tf.divide(
	tf.reduce_sum(losses), # loss from mask. reduce_sum before element-wise mul with mask !!
	tf.reduce_sum(weights),
	name="seq_loss_with_mask"
	)

	if return_details:
	return loss, losses, weights, targets
	else:
	return loss
	- \|-
	search_document: def pickle_load(path, compression=False):
	"""Unpickle a possible compressed pickle.

	Parameters
	----------
	path: str
	path to the output file
	compression: bool
	if true assumes that pickle was compressed when created and attempts decompression.

	Returns
	-------
	obj: object
	the unpickled object
	"""

	if compression:
	with zipfile.ZipFile(path, "r", compression=zipfile.ZIP_DEFLATED) as myzip:
	with myzip.open("data") as f:
	return pickle.load(f)
	else:
	with open(path, "rb") as f:
	return pickle.load(f)
	- \|-
	search_document: def draw_text(img, y, x, text, color=(0, 255, 0), size=25):
	"""
	Draw text on an image.

	This uses by default DejaVuSans as its font, which is included in this library.

	dtype support::

	* ``uint8``: yes; fully tested
	* ``uint16``: no
	* ``uint32``: no
	* ``uint64``: no
	* ``int8``: no
	* ``int16``: no
	* ``int32``: no
	* ``int64``: no
	* ``float16``: no
	* ``float32``: yes; not tested
	* ``float64``: no
	* ``float128``: no
	* ``bool``: no

	TODO check if other dtypes could be enabled

	Parameters
	----------
	img : (H,W,3) ndarray
	The image array to draw text on.
	Expected to be of dtype uint8 or float32 (value range 0.0 to 255.0).

	y : int
	x-coordinate of the top left corner of the text.

	x : int
	y- coordinate of the top left corner of the text.

	text : str
	The text to draw.

	color : iterable of int, optional
	Color of the text to draw. For RGB-images this is expected to be an RGB color.

	size : int, optional
	Font size of the text to draw.

	Returns
	-------
	img_np : (H,W,3) ndarray
	Input image with text drawn on it.

	"""
	do_assert(img.dtype in [np.uint8, np.float32])

	input_dtype = img.dtype
	if img.dtype == np.float32:
	img = img.astype(np.uint8)

	img = PIL_Image.fromarray(img)
	font = PIL_ImageFont.truetype(DEFAULT_FONT_FP, size)
	context = PIL_ImageDraw.Draw(img)
	context.text((x, y), text, fill=tuple(color), font=font)
	img_np = np.asarray(img)

	# PIL/asarray returns read only array
	if not img_np.flags["WRITEABLE"]:
	try:
	# this seems to no longer work with np 1.16 (or was pillow updated?)
	img_np.setflags(write=True)
	except ValueError as ex:
	if "cannot set WRITEABLE flag to True of this array" in str(ex):
	img_np = np.copy(img_np)

	if img_np.dtype != input_dtype:
	img_np = img_np.astype(input_dtype)

	return img_np
	- source_sentence: >-
	search_query: Choice and return an an action by given the action probability
	distribution.

	Parameters
	------------
	probs : list of float.
	The probability distribution of all actions.
	action_list : None or a list of int or others
	A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

	Returns
	--------
	float int or str
	The chosen action.

	Examples
	----------
	>>> for _ in range(5):
	>>> a = choice_action_by_probs([0.2, 0.4, 0.4])
	>>> print(a)
	0
	1
	1
	2
	1
	>>> for _ in range(3):
	>>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
	>>> print(a)
	a
	b
	b
	sentences:
	- >-
	search_document: def from_keypoint_image(image, if_not_found_coords={"x":
	-1, "y": -1}, threshold=1, nb_channels=None): # pylint:
	disable=locally-disabled, dangerous-default-value, line-too-long
	"""
	Converts an image generated by ``to_keypoint_image()`` back to a KeypointsOnImage object.

	Parameters
	----------
	image : (H,W,N) ndarray
	The keypoints image. N is the number of keypoints.

	if_not_found_coords : tuple or list or dict or None, optional
	Coordinates to use for keypoints that cannot be found in `image`.
	If this is a list/tuple, it must have two integer values.
	If it is a dictionary, it must have the keys ``x`` and ``y`` with
	each containing one integer value.
	If this is None, then the keypoint will not be added to the final
	KeypointsOnImage object.

	threshold : int, optional
	The search for keypoints works by searching for the argmax in
	each channel. This parameters contains the minimum value that
	the max must have in order to be viewed as a keypoint.

	nb_channels : None or int, optional
	Number of channels of the image on which the keypoints are placed.
	Some keypoint augmenters require that information.
	If set to None, the keypoint's shape will be set
	to ``(height, width)``, otherwise ``(height, width, nb_channels)``.

	Returns
	-------
	out : KeypointsOnImage
	The extracted keypoints.

	"""
	ia.do_assert(len(image.shape) == 3)
	height, width, nb_keypoints = image.shape

	drop_if_not_found = False
	if if_not_found_coords is None:
	drop_if_not_found = True
	if_not_found_x = -1
	if_not_found_y = -1
	elif isinstance(if_not_found_coords, (tuple, list)):
	ia.do_assert(len(if_not_found_coords) == 2)
	if_not_found_x = if_not_found_coords[0]
	if_not_found_y = if_not_found_coords[1]
	elif isinstance(if_not_found_coords, dict):
	if_not_found_x = if_not_found_coords["x"]
	if_not_found_y = if_not_found_coords["y"]
	else:
	raise Exception("Expected if_not_found_coords to be None or tuple or list or dict, got %s." % (
	type(if_not_found_coords),))

	keypoints = []
	for i in sm.xrange(nb_keypoints):
	maxidx_flat = np.argmax(image[..., i])
	maxidx_ndim = np.unravel_index(maxidx_flat, (height, width))
	found = (image[maxidx_ndim[0], maxidx_ndim[1], i] >= threshold)
	if found:
	keypoints.append(Keypoint(x=maxidx_ndim[1], y=maxidx_ndim[0]))
	else:
	if drop_if_not_found:
	pass # dont add the keypoint to the result list, i.e. drop it
	else:
	keypoints.append(Keypoint(x=if_not_found_x, y=if_not_found_y))

	out_shape = (height, width)
	if nb_channels is not None:
	out_shape += (nb_channels,)
	return KeypointsOnImage(keypoints, shape=out_shape)
	- >-
	search_document: def choice_action_by_probs(probs=(0.5, 0.5),
	action_list=None):
	"""Choice and return an an action by given the action probability distribution.

	Parameters
	------------
	probs : list of float.
	The probability distribution of all actions.
	action_list : None or a list of int or others
	A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

	Returns
	--------
	float int or str
	The chosen action.

	Examples
	----------
	>>> for _ in range(5):
	>>> a = choice_action_by_probs([0.2, 0.4, 0.4])
	>>> print(a)
	0
	1
	1
	2
	1
	>>> for _ in range(3):
	>>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
	>>> print(a)
	a
	b
	b

	"""
	if action_list is None:
	n_action = len(probs)
	action_list = np.arange(n_action)
	else:
	if len(action_list) != len(probs):
	raise Exception("number of actions should equal to number of probabilities.")
	return np.random.choice(action_list, p=probs)
	- \|-
	search_document: def __validateExperimentControl(self, control):
	""" Validates control dictionary for the experiment context"""
	# Validate task list
	taskList = control.get('tasks', None)
	if taskList is not None:
	taskLabelsList = []

	for task in taskList:
	validateOpfJsonValue(task, "opfTaskSchema.json")
	validateOpfJsonValue(task['taskControl'], "opfTaskControlSchema.json")

	taskLabel = task['taskLabel']

	assert isinstance(taskLabel, types.StringTypes), \
	"taskLabel type: %r" % type(taskLabel)
	assert len(taskLabel) > 0, "empty string taskLabel not is allowed"

	taskLabelsList.append(taskLabel.lower())

	taskLabelDuplicates = filter(lambda x: taskLabelsList.count(x) > 1,
	taskLabelsList)
	assert len(taskLabelDuplicates) == 0, \
	"Duplcate task labels are not allowed: %s" % taskLabelDuplicates

	return
	- source_sentence: \|-
	search_query: Augment endlessly images in the source queue.

	This is a worker function for that endlessly queries the source queue (input batches),
	augments batches in it and sends the result to the output queue.
	sentences:
	- >-
	search_document: def _augment_images_worker(self, augseq, queue_source,
	queue_result, seedval):
	"""
	Augment endlessly images in the source queue.

	This is a worker function for that endlessly queries the source queue (input batches),
	augments batches in it and sends the result to the output queue.

	"""
	np.random.seed(seedval)
	random.seed(seedval)
	augseq.reseed(seedval)
	ia.seed(seedval)

	loader_finished = False

	while not loader_finished:
	# wait for a new batch in the source queue and load it
	try:
	batch_str = queue_source.get(timeout=0.1)
	batch = pickle.loads(batch_str)
	if batch is None:
	loader_finished = True
	# put it back in so that other workers know that the loading queue is finished
	queue_source.put(pickle.dumps(None, protocol=-1))
	else:
	batch_aug = augseq.augment_batch(batch)

	# send augmented batch to output queue
	batch_str = pickle.dumps(batch_aug, protocol=-1)
	queue_result.put(batch_str)
	except QueueEmpty:
	time.sleep(0.01)

	queue_result.put(pickle.dumps(None, protocol=-1))
	time.sleep(0.01)
	- \|-
	search_document: def show_perf_attrib_stats(returns,
	positions,
	factor_returns,
	factor_loadings,
	transactions=None,
	pos_in_dollars=True):
	"""
	Calls `perf_attrib` using inputs, and displays outputs using
	`utils.print_table`.
	"""
	risk_exposures, perf_attrib_data = perf_attrib(
	returns,
	positions,
	factor_returns,
	factor_loadings,
	transactions,
	pos_in_dollars=pos_in_dollars,
	)

	perf_attrib_stats, risk_exposure_stats =\
	create_perf_attrib_stats(perf_attrib_data, risk_exposures)

	percentage_formatter = '{:.2%}'.format
	float_formatter = '{:.2f}'.format

	summary_stats = perf_attrib_stats.loc[['Annualized Specific Return',
	'Annualized Common Return',
	'Annualized Total Return',
	'Specific Sharpe Ratio']]

	# Format return rows in summary stats table as percentages.
	for col_name in (
	'Annualized Specific Return',
	'Annualized Common Return',
	'Annualized Total Return',
	):
	summary_stats[col_name] = percentage_formatter(summary_stats[col_name])

	# Display sharpe to two decimal places.
	summary_stats['Specific Sharpe Ratio'] = float_formatter(
	summary_stats['Specific Sharpe Ratio']
	)

	print_table(summary_stats, name='Summary Statistics')

	print_table(
	risk_exposure_stats,
	name='Exposures Summary',
	# In exposures table, format exposure column to 2 decimal places, and
	# return columns as percentages.
	formatters={
	'Average Risk Factor Exposure': float_formatter,
	'Annualized Return': percentage_formatter,
	'Cumulative Return': percentage_formatter,
	},
	)
	- >-
	search_document: def binary_cross_entropy(output, target, epsilon=1e-8,
	name='bce_loss'):
	"""Binary cross entropy operation.

	Parameters
	----------
	output : Tensor
	Tensor with type of `float32` or `float64`.
	target : Tensor
	The target distribution, format the same with `output`.
	epsilon : float
	A small value to avoid output to be zero.
	name : str
	An optional name to attach to this function.

	References
	-----------
	- `ericjang-DRAW <https://github.com/ericjang/draw/blob/master/draw.py#L73>`__

	"""
	# with ops.op_scope([output, target], name, "bce_loss") as name:
	# output = ops.convert_to_tensor(output, name="preds")
	# target = ops.convert_to_tensor(targets, name="target")

	# with tf.name_scope(name):
	return tf.reduce_mean(
	tf.reduce_sum(-(target * tf.log(output + epsilon) + (1. - target) * tf.log(1. - output + epsilon)), axis=1),
	name=name
	)
	- source_sentence: 'search_query: episode_batch: array(batch_size x (T or T+1) x dim_key)'
	sentences:
	- \|-
	search_document: def get_txn_vol(transactions):
	"""
	Extract daily transaction data from set of transaction objects.

	Parameters
	----------
	transactions : pd.DataFrame
	Time series containing one row per symbol (and potentially
	duplicate datetime indices) and columns for amount and
	price.

	Returns
	-------
	pd.DataFrame
	Daily transaction volume and number of shares.
	- See full explanation in tears.create_full_tear_sheet.
	"""

	txn_norm = transactions.copy()
	txn_norm.index = txn_norm.index.normalize()
	amounts = txn_norm.amount.abs()
	prices = txn_norm.price
	values = amounts * prices
	daily_amounts = amounts.groupby(amounts.index).sum()
	daily_values = values.groupby(values.index).sum()
	daily_amounts.name = "txn_shares"
	daily_values.name = "txn_volume"
	return pd.concat([daily_values, daily_amounts], axis=1)
	- \|-
	search_document: def deepcopy(self, exterior=None, label=None):
	"""
	Create a deep copy of the Polygon object.

	Parameters
	----------
	exterior : list of Keypoint or list of tuple or (N,2) ndarray, optional
	List of points defining the polygon. See `imgaug.Polygon.__init__` for details.

	label : None or str
	If not None, then the label of the copied object will be set to this value.

	Returns
	-------
	imgaug.Polygon
	Deep copy.

	"""
	return Polygon(
	exterior=np.copy(self.exterior) if exterior is None else exterior,
	label=self.label if label is None else label
	)
	- \|-
	search_document: def store_episode(self, episode_batch):
	"""episode_batch: array(batch_size x (T or T+1) x dim_key)
	"""
	batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()]
	assert np.all(np.array(batch_sizes) == batch_sizes[0])
	batch_size = batch_sizes[0]

	with self.lock:
	idxs = self._get_storage_idx(batch_size)

	# load inputs into buffers
	for key in self.buffers.keys():
	self.buffers[key][idxs] = episode_batch[key]

	self.n_transitions_stored += batch_size * self.T
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- cosine_accuracy@1
	- cosine_accuracy@3
	- cosine_accuracy@5
	- cosine_accuracy@10
	- cosine_precision@1
	- cosine_precision@3
	- cosine_precision@5
	- cosine_precision@10
	- cosine_recall@1
	- cosine_recall@3
	- cosine_recall@5
	- cosine_recall@10
	- cosine_ndcg@10
	- cosine_mrr@10
	- cosine_map@100
	model-index:
	- name: SentenceTransformer
	results:
	- task:
	type: information-retrieval
	name: Information Retrieval
	dataset:
	name: codesearchnet val
	type: codesearchnet_val
	metrics:
	- type: cosine_accuracy@1
	value: 0.8926
	name: Cosine Accuracy@1
	- type: cosine_accuracy@3
	value: 0.9453666666666667
	name: Cosine Accuracy@3
	- type: cosine_accuracy@5
	value: 0.9545
	name: Cosine Accuracy@5
	- type: cosine_accuracy@10
	value: 0.9637666666666667
	name: Cosine Accuracy@10
	- type: cosine_precision@1
	value: 0.8926
	name: Cosine Precision@1
	- type: cosine_precision@3
	value: 0.31512222222222214
	name: Cosine Precision@3
	- type: cosine_precision@5
	value: 0.19090000000000004
	name: Cosine Precision@5
	- type: cosine_precision@10
	value: 0.09637666666666667
	name: Cosine Precision@10
	- type: cosine_recall@1
	value: 0.8926
	name: Cosine Recall@1
	- type: cosine_recall@3
	value: 0.9453666666666667
	name: Cosine Recall@3
	- type: cosine_recall@5
	value: 0.9545
	name: Cosine Recall@5
	- type: cosine_recall@10
	value: 0.9637666666666667
	name: Cosine Recall@10
	- type: cosine_ndcg@10
	value: 0.9313201256618757
	name: Cosine Ndcg@10
	- type: cosine_mrr@10
	value: 0.9206047883597835
	name: Cosine Mrr@10
	- type: cosine_map@100
	value: 0.9212040995599341
	name: Cosine Map@100
	license: mit
	datasets:
	- code-search-net/code_search_net
	language:
	- en
	base_model:
	- answerdotai/ModernBERT-base
	---

	# SentenceTransformer

	This is a [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model trained on the [code_search_net](https://huggingface.co/datasets/code-search-net/code_search_net) dataset with
	[<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with in-batch negatives. Model can be used for code retrieval and reranking.

	## Perfomance on code retrieval benchmarks

	RTEB

	On 14.10.2025 the model is 6th on RTEB leaderbord among models with <500M parameters:
	<details>
	<summary>Click</summary>
	<figure>
	<img src="Rteb_top.jpg">
	</figure>
	</details>

	Perfomance per task:
	\| Model \| AppsRetrieval \| Code1Retrieval (Private) \| DS1000Retrieval \| FreshStackRetrieval \| HumanEvalRetrieval \| JapaneseCode1Retrieval (Private)\| MBPPRetrieval \| WikiSQLRetrieval \|
	\|-------\|---------------\|----------------\|-----------------\|---------------------\|--------------------\|------------------------\|---------------\|------------------\|
	\| english_code_retriever \| 8.04 \| 75.36 \| 32.42 \| 18.30 \| 71.82 \| 46.59 \| 72.06 \| 87.92 \|

	COIR:
	\| Model \| AppsRetrieval \| COIRCodeSearchNetRetrieval \| CodeFeedbackMT \| CodeFeedbackST \| CodeSearchNetCCRetrieval \| CodeTransOceanContest \| CodeTransOceanDL \| CosQA \| StackOverflowQA \| SyntheticText2SQL \|
	\|-------\|---------------\|----------------------------\|----------------\|----------------\|--------------------------\|------------------------\|------------------\|-------\|------------------\|-------------------\|
	\| english_code_retriever \| 8.04 \| 74.23 \| 44.01 \| 57.79 \| 42.71 \| 60.68 \| 35.16 \| 25.56 \| 56.53 \| 42.79 \|

	more information you cand find [in MTEB leaderbord](https://huggingface.co/spaces/mteb/leaderboard)


	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Maximum Sequence Length: 8192 tokens
	- Output Dimensionality: 768
	- Similarity Function: Cosine Similarity
	- Mean pooling

	## Usage

	Using is easy with Sentence Transformers.

	Pay attention that model was trained with prefixes 'search_query' for queries and 'search_document' for docs with code.
	So using with prefixes will improve model retrieving abilities.

	```python
	import torch
	from sentence_transformers import SentenceTransformer, util

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = SentenceTransformer("fyaronskiy/english_code_retriever").to(device)

	queries = [
	"Write a Python function that calculates the factorial of a number recursively.",
	"How to check if a given string reads the same backward and forward?",
	"Combine two sorted lists into a single sorted list."
	]

	corpus = [
	# Relevant for Q1
	"""def factorial(n):
	if n == 0:
	return 1
	return n * factorial(n-1)""",

	# Hard negative for Q1 (similar structure but computes sum)
	"""def sum_recursive(n):
	if n == 0:
	return 0
	return n + sum_recursive(n-1)""",

	# Relevant for Q2
	"""def is_palindrome(s: str) -> bool:
	s = s.lower().replace(" ", "")
	return s == s[::-1]""",

	# Hard negative for Q2 (string reverse but not palindrome check)
	"""def reverse_string(s: str) -> str:
	return s[::-1]""",

	# Relevant for Q3
	"""def merge_sorted_lists(a, b):
	result = []
	i = j = 0
	while i < len(a) and j < len(b):
	if a[i] < b[j]:
	result.append(a[i])
	i += 1
	else:
	result.append(b[j])
	j += 1
	result.extend(a[i:])
	result.extend(b[j:])
	return result""",

	# Hard negative for Q3 (similar iteration but sums two lists elementwise)
	"""def add_lists(a, b):
	return [x + y for x, y in zip(a, b)]"""
	]


	doc_embeddings = model.encode(corpus, prompt_name='search_query', convert_to_tensor=True, device=device)
	query_embeddings = model.encode(queries, prompt_name='search_document', convert_to_tensor=True, device=device)

	# Compute cosine similarity and retrieve top-1
	for i, query in enumerate(queries):
	scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0]
	best_idx = torch.argmax(scores).item()
	print(f"\n Query {i+1}: {query}")
	print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

	''' Query 1: Write a Python function that calculates the factorial of a number recursively.
	Top-1 match (score=0.5983):
	def factorial(n):
	if n == 0:
	return 1
	return n * factorial(n-1)

	Query 2: How to check if a given string reads the same backward and forward?
	Top-1 match (score=0.4925):
	def is_palindrome(s: str) -> bool:
	s = s.lower().replace(" ", "")
	return s == s[::-1]

	Query 3: Combine two sorted lists into a single sorted list.
	Top-1 match (score=0.6524):
	def merge_sorted_lists(a, b):
	result = []
	i = j = 0
	while i < len(a) and j < len(b):
	if a[i] < b[j]:
	result.append(a[i])
	i += 1
	else:
	result.append(b[j])
	j += 1
	result.extend(a[i:])
	result.extend(b[j:])
	return result
	'''
	```

	Using with Transformers
	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	device = "cuda" if torch.cuda.is_available() else "cpu"

	model_name = "fyaronskiy/english_code_retriever"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name).to(device)
	model.eval()



	queries = [
	"function of addition of two numbers",
	"finding the maximum element in an array",
	"sorting a list in ascending order"
	]

	corpus = [
	"def add(a, b): return a + b",
	"def find_max(arr): return max(arr)",
	"def sort_list(lst): return sorted(lst)"
	]

	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] # (batch_size, seq_len, hidden_dim)
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1).clamp(min=1e-9)

	def encode_texts(texts):
	encoded = tokenizer(
	texts,
	padding=True,
	truncation=True,
	return_tensors="pt",
	max_length=8192
	).to(device)
	with torch.no_grad():
	model_output = model(**encoded)
	return mean_pooling(model_output, encoded["attention_mask"])

	doc_embeddings = encode_texts(["search_document: " + document for document in corpus])
	query_embeddings = encode_texts(["search_query: " + query for query in queries])

	# Normalize embeddings for cosine similarity
	doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
	query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)

	# Compute cosine similarity and retrieve top-1
	for i, query in enumerate(queries):
	scores = torch.matmul(query_embeddings[i], doc_embeddings.T)
	best_idx = torch.argmax(scores).item()
	print(f"\n Query {i+1}: {query}")
	print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

	''' Query 1: function of addition of two numbers
	Top-1 match (score=0.6047):
	def add(a, b): return a + b

	Query 2: finding the maximum element in an array
	Top-1 match (score=0.7772):
	def find_max(arr): return max(arr)

	Query 3: sorting a list in ascending order
	Top-1 match (score=0.7389):
	def sort_list(lst): return sorted(lst)
	'''
	```

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Information Retrieval

	* Dataset: validation part of `codesearchnet_val`
	* Size: 30,000 evaluation samples

	* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)

	\| Metric \| Value \|
	\|:--------------------\|:-----------\|
	\| cosine_accuracy@1 \| 0.8926 \|
	\| cosine_accuracy@3 \| 0.9454 \|
	\| cosine_accuracy@5 \| 0.9545 \|
	\| cosine_accuracy@10 \| 0.9638 \|
	\| cosine_precision@1 \| 0.8926 \|
	\| cosine_precision@3 \| 0.3151 \|
	\| cosine_precision@5 \| 0.1909 \|
	\| cosine_precision@10 \| 0.0964 \|
	\| cosine_recall@1 \| 0.8926 \|
	\| cosine_recall@3 \| 0.9454 \|
	\| cosine_recall@5 \| 0.9545 \|
	\| cosine_recall@10 \| 0.9638 \|
	\| cosine_ndcg@10 \| 0.9313 \|
	\| cosine_mrr@10 \| 0.9206 \|
	\| cosine_map@100 \| 0.9212 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### code_search_net

	* Dataset: train part of code_search_net
	* Size: 1,880,853 training samples
	* queries - function docstrings in english, relevant document - code of function
	* negatives was sampled from batch
	* Distribution of programming languages:
	*

	![image](https://cdn-uploads.huggingface.co/production/uploads/620118ee1283421e22448fc2/IefFD32ihUSTGZjbb78px.png)


	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `batch_size`: 64
	- `learning_rate`: 2e-05
	- `num_epochs`: 2
	- `warmup_ratio`: 0.1



	### Framework Versions
	- Python: 3.10.11
	- Sentence Transformers: 5.1.0
	- Transformers: 4.52.3
	- PyTorch: 2.6.0+cu124
	- Accelerate: 1.10.0
	- Datasets: 3.6.0
	- Tokenizers: 0.21.4


	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->