SentenceTransformer

This is a sentence-transformers model trained on the cornstack_python, cornstack_python_pairs, codesearchnet, codesearchnet_pairs and solyanka_qa datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space.

Model can be used for text-to-code, code-to-text retrieval tasks where text is in Russian/English and code is in Python/Java/Javascript/Go/Php/Ruby. Queries, documents also can be mix of natural language text and code. The quality of the model in code-to-code tasks wasn't measured.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: RuModernBERT-base
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 dimensions (can be trimmed to 512/256/128/64)
Similarity Function: Cosine Similarity
Training Datasets:
- cornstack_python
- cornstack_python_pairs
- codesearchnet
- codesearchnet_pairs
- solyanka_qa

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

import torch
from sentence_transformers import SentenceTransformer, util

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("fyaronskiy/code_retriever_ru_en").to(device)

queries_ru = [
    "Напиши функцию на Python, которая рекурсивно вычисляет факториал числа.",
    "Как проверить, является ли строка палиндромом?",
    "Объедини два отсортированных списка в один отсортированный список."
]

corpus_ru = [
    # Релевантный для Q1
    """def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)""",

    # Hard negative для Q1
    """def sum_recursive(n):
    if n == 0:
        return 0
    return n + sum_recursive(n - 1)""",

    # Релевантный для Q2
    """def is_palindrome(s: str) -> bool:
    s = s.lower().replace(" ", "")
    return s == s[::-1]""",

    # Hard negative для Q2
    """def reverse_string(s: str) -> str:
    return s[::-1]""",

    # Релевантный для Q3
    """def merge_sorted_lists(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result""",

    # Hard negative для Q3
    """def add_lists(a, b):
    return [x + y for x, y in zip(a, b)]"""
]

doc_embeddings = model.encode(corpus_ru, convert_to_tensor=True, device=device)
query_embeddings = model.encode(queries_ru, convert_to_tensor=True, device=device)

# Выполняем поиск по каждому запросу
for i, query in enumerate(queries_ru):
    scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0]
    best_idx = torch.argmax(scores).item()
    print(f"\nЗапрос {i+1}: {query}")
    print('Скоры всех документов в корпусе: ', scores)
    print(f"Наиболее подходящий документ (Скор={scores[best_idx]:.4f}):\n{corpus_ru[best_idx]}")

Model was trained with Matryoshka Loss with dims: 768, 512, 256, 128, 64. So for decreasing memory for your vector databaset and make inference faster you can truncate embeddings.

To do this you need to initialize model as follows:

matryoshka_dim = 128
model = SentenceTransformer("fyaronskiy/code_retriever_ru_en", truncate_dim=matryoshka_dim).to(device)

Evaluation

Perfomance on code retrieval benchmark ruCoIR:

	code_retriever_ru_en	code_retriever_ru_en_512d	code_retriever_ru_en_256d	code_retriever_ru_en_128d	code_retriever_ru_en_64d
CodeSearchNet-python.	0.91	0.9	0.9	0.88	0.84
codefeedback-st.	0.81	0.8	0.79	0.76	0.7
CodeSearchNet-php.	0.83	0.83	0.82	0.8	0.75
stackoverflow-qa.	0.81	0.8	0.8	0.77	0.72
CodeSearchNet-ruby.	0.8	0.8	0.79	0.74	0.72
cosqa.	0.25	0.24	0.24	0.21	0.19
CodeSearchNet-go.	0.76	0.75	0.74	0.74	0.67
apps.	0.14	0.13	0.12	0.11	0.07
CodeSearchNet-java.	0.74	0.74	0.73	0.71	0.67
CodeSearchNet-javascript.	0.57	0.56	0.54	0.52	0.47
mean	0.66	0.66	0.65	0.62	0.58

Metrics

Information Retrieval

Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.8684
cosine_accuracy@3	0.9439
cosine_accuracy@5	0.9566
cosine_accuracy@10	0.9668
cosine_precision@1	0.8684
cosine_precision@3	0.3146
cosine_precision@5	0.1913
cosine_precision@10	0.0967
cosine_recall@1	0.8684
cosine_recall@3	0.9439
cosine_recall@5	0.9566
cosine_recall@10	0.9668
cosine_ndcg@10	0.9224
cosine_mrr@10	0.9076
cosine_map@100	0.9083

Information Retrieval

Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.8742
cosine_accuracy@3	0.9425
cosine_accuracy@5	0.9549
cosine_accuracy@10	0.9644
cosine_precision@1	0.8742
cosine_precision@3	0.3142
cosine_precision@5	0.191
cosine_precision@10	0.0964
cosine_recall@1	0.8742
cosine_recall@3	0.9425
cosine_recall@5	0.9549
cosine_recall@10	0.9644
cosine_ndcg@10	0.9234
cosine_mrr@10	0.9098
cosine_map@100	0.9105

Training Details

Training Datasets

cornstack_python

Dataset: cornstack_python
Size: 2,869,969 training samples
Columns: ru_query, document, negative_0, negative_1, negative_2, negative_3, negative_4, negative_5, negative_6, negative_7, negative_8, negative_9, negative_10, negative_11, negative_12, negative_13, negative_14, and negative_15

Approximate statistics based on the first 1000 samples:

	ru_query	document	negative_0	negative_1	negative_2	negative_3	negative_4	negative_5	negative_6	negative_7	negative_8	negative_9	negative_10	negative_11	negative_12	negative_13	negative_14	negative_15
type	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string	string
details	min: 7 tokens mean: 27.46 tokens max: 162 tokens	min: 6 tokens mean: 304.38 tokens max: 5574 tokens	min: 6 tokens mean: 237.08 tokens max: 3627 tokens	min: 6 tokens mean: 229.94 tokens max: 6691 tokens	min: 6 tokens mean: 230.06 tokens max: 6229 tokens	min: 7 tokens mean: 230.7 tokens max: 4876 tokens	min: 8 tokens mean: 220.57 tokens max: 4876 tokens	min: 7 tokens mean: 236.08 tokens max: 5880 tokens	min: 6 tokens mean: 247.91 tokens max: 6621 tokens	min: 6 tokens mean: 207.62 tokens max: 3350 tokens	min: 6 tokens mean: 222.54 tokens max: 6863 tokens	min: 6 tokens mean: 221.53 tokens max: 4976 tokens	min: 7 tokens mean: 216.06 tokens max: 4876 tokens	min: 7 tokens mean: 197.03 tokens max: 4763 tokens	min: 6 tokens mean: 200.83 tokens max: 8192 tokens	min: 6 tokens mean: 204.94 tokens max: 3210 tokens	min: 6 tokens mean: 188.51 tokens max: 2754 tokens	min: 6 tokens mean: 188.27 tokens max: 4876 tokens

Samples:

ru_query	document	negative_0	negative_1	negative_2	negative_3	negative_4	negative_5	negative_6	negative_7	negative_8	negative_9	negative_10	negative_11	negative_12	negative_13	negative_14	negative_15
`установите значение business_id сообщения данных в конкретное значение`	`def step_impl_the_ru_is_set_to(context, business_id): context.bdd_helper.message_data["business_id"] = business_id`	`def business_id(self, business_id): self._business_id = business_id`	`def business_phone(self, business_phone): self._business_phone = business_phone`	`def business_phone_number(self, business_phone_number): self._business_phone_number = business_phone_number`	`def bus_ob_id(self, bus_ob_id): self._bus_ob_id = bus_ob_id`	`def bus_ob_id(self, bus_ob_id): self._bus_ob_id = bus_ob_id`	`def _set_id(self, value): pass`	`def business_email(self, business_email): self._business_email = business_email`	`def mailing_id(self, val: str): self._mailing_id = val`	`def message_id(self, val: str): self._message_id = val`	`def business_model(self, business_model): self._business_model = business_model`	`def business_account(self, business_account): self._business_account = business_account`	def update_business(current_user, businessId): business = Business.query.get(int(businessId)) if not business: return make_json_reply('message', 'Business id does not exist'), 404 if business.user_id != current_user.id: return make_json_reply('message', 'Cannot update business'), 400 data = request.get_json(force=True) name = location = category = description = None if 'name' in data.keys(): name = data['name'] if 'location' in data.keys(): location = data['location'] if 'category' in data.keys(): category = data['category'] if 'description' in data.keys(): description = data['description'] if check_validity_of_input(name=name): business.name = name if check_validity_of_input(location=location): business.location = location if check_validity_of_input(category=category): business.category = category if check_validity_of_input(description=description): ...	`def set_company_id_value(self, company_id_value): self.company_id_value = company_id_value`	`def id(self, value): self._id = value`	`def set_bribe(self, bribe_amount):`
self.bribe = bribe_amount	`def business_owner(self, business_owner): self._business_owner = business_owner`
`Установить состояние правил sid`	`def set_state_sid_request(ruleset_name, sid): message = json.loads(request.stream.read().decode('utf-8')) message['sid'] = sid result = host.patch_state(ruleset_name, message) return jsonify(result)`	`def sid(self, sid): self._sid = sid`	`def set_state(self,s): self.state = s`	`def set_state(self, state: int):`	`def setstate(self, state): self.set(DER = state)`	`def set_rule(self, rule): self.rule.load_state_dict(rule, strict=True)`	`def _set_state(self, state): #print("** set state from %d to %d" % (self.state, state)) self.state = state`	`def set_state( self ):`	`def set_ident(self, new_ident: int): if not isinstance(new_ident, int): raise TypeError("Spectrum set identifiers may ONLY be positive integers") self._set_ident = new_ident`	`def set_state(self, state): #print("ComponentBase.set_state") for k,v in state.items(): #print(" Set {:14s} to {:s}".format(k,str(v))) if k == "connectors": for con_state in v: self.add_connector() self.connectors[-1].set_state(con_state) else: setattr(self, k, v)`	`def setstate(self, state): self.list = state`	`def setstate(self, state): self.list = state`	`def state_id(self, state_id): self._state_id = state_id`	`def set_state(self, state: int): self.state = state`	`def set_domain_sid(self, sid): dsdb._samdb_set_domain_sid(self, sid)`	`def set_state(self,state): self.__state = state`	`def set_srid(self, srid: ir.IntegerValue) -> GeoSpatialValue: return ops.GeoSetSRID(self, srid=srid).to_expr()`
`Отправить события sid в ruleset`	`def post_sid_events(ruleset_name, sid): message = json.loads(request.stream.read().decode('utf-8')) message['sid'] = sid result = host.post(ruleset_name, message) return jsonify(result)`	`def post_events(ruleset_name): message = json.loads(request.stream.read().decode('utf-8')) result = host.post(ruleset_name, message) return jsonify(result)`	`def set_state_sid_request(ruleset_name, sid): message = json.loads(request.stream.read().decode('utf-8')) message['sid'] = sid result = host.patch_state(ruleset_name, message) return jsonify(result)`	`def sid(self, sid): self._sid = sid`	`def post(self, request, args, *kwargs): id = args[0] if args else list(kwargs.values())[0] try: ssn = Subscription.objects.get(id=id) except Subscription.DoesNotExist: logger.error( f'Received unwanted subscription {id} POST request! Sending status ' '410 back to hub.' ) return Response('Unwanted subscription', status=410) ssn.update(time_last_event_received=now()) self.handler_task.delay(request.data) return Response('') # TODO`	def informed_consent_on_post_save(sender, instance, raw, created, **kwargs): if not raw: if created: pass # instance.registration_update_or_create() # update_model_fields(instance=instance, # model_cls=['subject_identifier', instance.subject_identifier]) try: OnSchedule.objects.get( subject_identifier=instance.subject_identifier, ) except OnSchedule.DoesNotExist: onschedule_model = 'training_subject.onschedule' put_on_schedule(schedule_name='training_subject_visit_schedule', instance=instance, onschedule_model=onschedule_model)	`def post_event(self, event):`
from evennia.scripts.models import ScriptDB

if event.public_event:
event_manager = ScriptDB.objects.get(db_key="Event Manager")
event_manager.post_event(event, self.owner.player, event.display())	`def post(self, event, args, *kwargs): self.inq.Signal((event, args, kwargs))`	`def post(self, request): return self.serviceHandler.addEvent(request.data)`	`def register_to_event(request): pass`	def setFilterOnRule(request): logger = logging.getLogger(name) # Get some initial post values for processing. ruleIds = request.POST.getlist('id') sensors = request.POST.getlist('sensors') commentString = request.POST['comment'] force = request.POST['force'] response = [] # If the ruleIds list is empty, it means a SID has been entered manually. if len(ruleIds) == 0: # Grab the value from the POST. ruleSID = request.POST['sid'] # Match the GID:SID pattern, if its not there, throw exception. try: matchPattern = r"(\d+):(\d+)" pattern = re.compile(matchPattern) result = pattern.match(ruleSID) ruleGID = result.group(1) ruleSID = result.group(2) except: response.append({'response': 'invalidGIDSIDFormat', 'text': 'Please format in the GID:SID syntax.'}) logger.warning("Invalid GID:SID syntax provided: "+str(ruleSID)+".") return HttpResponse(json.dumps(response)) # Try to find a generator object with the GID supplied, if it does...	`def store_event(self, violations): current_time = datetime.now().strftime("%Y/%m/%d %H:%M:%S") insert_query = """INSERT INTO social_distancing (Location, Local_Time, Violations) VALUES ('{}', '{}', {})""".format(self.location, current_time, violations) self.off_chain.insert(insert_query) event_id = self.off_chain.select("""SELECT LAST_INSERT_ID() FROM social_distancing""")[0][0] self.on_chain.store_hash(event_id, self.location, current_time, violations)`	`def test_post_event_on_schedule_page(self): json_data = { 'title': 'Test Event', 'start': '2017-8-8T12:00:00', 'end': '2017-8-8T12:00:00', 'group': '3' } response = self.app.post("/saveEvent", data=json.dumps(json_data), content_type='application/json') self.assertTrue(response.status_code, 200)`	`def _push(self, server): defns = [self.get_id(ident) for ident in list(self.ids)] #for ident in list(self.ids): # defn = self.get_id(ident) if len(defns) == 0: return self.app.logger.info(f"Updating {server} with {len(defns)} records") url = f"{server}/add_record" try: resp = requests.post(url, json=defns) except Exception as e: self.app.logger.error(str(e)) return if not resp.ok: self.app.logger.error(f"{resp.reason} {resp.content}") return self._server_updated[server] = True`	`def post(self, slug = None, eid = None): uid = self.request.form.get("uid") status = self.request.form.get("status") # can be join, maybe, notgoubg event = self.barcamp.get_event(eid) user = self.app.module_map.userbase.get_user_by_id(uid) reg = RegistrationService(self, user) try: status = reg.set_status(eid, status, force=True) except RegistrationError, e: print "a registration error occurred", e raise ProcessingError(str(e)) return return {'status' : 'success', 'reload' : True}`	`def events(self):`	def post(self): # we need a unique tx number so we can look these back up again # as well as for logging # FIXME: how can we guarantee uniqueness here? tx = int(time.time() * 100000) + random.randrange(10000, 99999) log.info("EVENTS [{}]: Creating events".format(tx)) try: user = self.jbody["user"] if not EMAIL_REGEX.match(user): user += "@" + self.domain event_type_id = self.jbody.get("eventTypeId", None) category = self.jbody.get("category", None) state = self.jbody.get("state", None) note = self.jbody.get("note", None) except KeyError as err: raise exc.BadRequest( "Missing Required Argument: {}".format(err.message) ) except ValueError as err: raise exc.BadRequest(err.message) if not event_type_id and (not category and not state): raise exc.BadRequest( ...

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

cornstack_python_pairs

Dataset: cornstack_python_pairs
Size: 1,434,984 training samples
Columns: en_query, ru_query, and label
Approximate statistics based on the first 1000 samples:
en_query ru_query label
type string string float
details
min: 7 tokens
mean: 26.96 tokens
max: 150 tokens

min: 7 tokens
mean: 27.46 tokens
max: 162 tokens

min: 1.0
mean: 1.0
max: 1.0

Samples:

en_query	ru_query	label
`set the message data business_id to a specific value`	`установите значение business_id сообщения данных в конкретное значение`	`1.0`
`Set ruleset state sid`	`Установить состояние правил sid`	`1.0`
`Post sid events to the ruleset`	`Отправить события sid в ruleset`	`1.0`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CoSENTLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

codesearchnet

Dataset: codesearchnet at 3f90200
Size: 1,880,853 training samples
Columns: ru_func_documentation_string and func_code_string
Approximate statistics based on the first 1000 samples:
ru_func_documentation_string func_code_string
type string string
details
min: 5 tokens
mean: 95.0 tokens
max: 619 tokens

min: 62 tokens
mean: 522.56 tokens
max: 8192 tokens

	ru_func_documentation_string	func_code_string
type	string	string
details	min: 5 tokens mean: 95.0 tokens max: 619 tokens	min: 62 tokens mean: 522.56 tokens max: 8192 tokens

Samples:

ru_func_documentation_string	func_code_string
`Мультипроцессинг-целевой объект для устройства очереди zmq`	def zmq_device(self): ''' Multiprocessing target for the zmq queue device ''' self.__setup_signals() salt.utils.process.appendproctitle('MWorkerQueue') self.context = zmq.Context(self.opts['worker_threads']) # Prepare the zeromq sockets self.uri = 'tcp://{interface}:{ret_port}'.format(**self.opts) self.clients = self.context.socket(zmq.ROUTER) if self.opts['ipv6'] is True and hasattr(zmq, 'IPV4ONLY'): # IPv6 sockets work for both IPv6 and IPv4 addresses self.clients.setsockopt(zmq.IPV4ONLY, 0) self.clients.setsockopt(zmq.BACKLOG, self.opts.get('zmq_backlog', 1000)) self._start_zmq_monitor() self.workers = self.context.socket(zmq.DEALER) if self.opts.get('ipc_mode', '') == 'tcp': self.w_uri = 'tcp://127.0.0.1:{0}'.format( self.opts.get('tcp_master_workers', 4515) ) else: self.w_uri = 'ipc:...
`Чисто завершите работу сокета роутера`	def close(self): ''' Cleanly shutdown the router socket ''' if self._closing: return log.info('MWorkerQueue under PID %s is closing', os.getpid()) self._closing = True # pylint: disable=E0203 if getattr(self, '_monitor', None) is not None: self._monitor.stop() self._monitor = None if getattr(self, '_w_monitor', None) is not None: self._w_monitor.stop() self._w_monitor = None if hasattr(self, 'clients') and self.clients.closed is False: self.clients.close() if hasattr(self, 'workers') and self.workers.closed is False: self.workers.close() if hasattr(self, 'stream'): self.stream.close() if hasattr(self, '_socket') and self._socket.closed is False: self._socket.close() if hasattr(self, 'context') and self.context.closed is False: self.context.term()
`До форка нам нужно создать устройство zmq роутера :param func process_manager: Экземпляр класса salt.utils.process.ProcessManager`	`def pre_fork(self, process_manager): ''' Pre-fork we need to create the zmq router device :param func process_manager: An instance of salt.utils.process.ProcessManager ''' salt.transport.mixins.auth.AESReqServerMixin.pre_fork(self, process_manager) process_manager.add_process(self.zmq_device)`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

codesearchnet_pairs

Dataset: codesearchnet_pairs at 3f90200
Size: 940,426 training samples
Columns: en_func_documentation_string, ru_func_documentation_string, and label

Approximate statistics based on the first 1000 samples:

	en_func_documentation_string	ru_func_documentation_string	label
type	string	string	float
details	min: 5 tokens mean: 102.69 tokens max: 1485 tokens	min: 5 tokens mean: 95.0 tokens max: 619 tokens	min: 1.0 mean: 1.0 max: 1.0

Samples:

en_func_documentation_string	ru_func_documentation_string	label
`Multiprocessing target for the zmq queue device`	`Мультипроцессинг-целевой объект для устройства очереди zmq`	`1.0`
`Cleanly shutdown the router socket`	`Чисто завершите работу сокета роутера`	`1.0`
`Pre-fork we need to create the zmq router device :param func process_manager: An instance of salt.utils.process.ProcessManager`	`До форка нам нужно создать устройство zmq роутера :param func process_manager: Экземпляр класса salt.utils.process.ProcessManager`	`1.0`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CoSENTLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

solyanka_qa

Dataset: solyanka_qa at deeac62
Size: 85,523 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 19 tokens
mean: 202.49 tokens
max: 518 tokens

min: 16 tokens
mean: 196.36 tokens
max: 524 tokens

	anchor	positive
type	string	string
details	min: 19 tokens mean: 202.49 tokens max: 518 tokens	min: 16 tokens mean: 196.36 tokens max: 524 tokens

Samples:

anchor	positive
Как происходит взаимодействие нескольких языков программирования? Понятно, что большинство (если не все) крупные энтерпрайз сервисы, приложения и тд. (не только веб) написаны с использованием не одного языка программирования, а нескольких. И эти составные части, написанные на разных языках, как-то взаимодействуют между собой (фронт, бизнес-логика, еще что-то). Опыта разработки подобных систем у меня нет, поэтому не совсем могу представить, как это происходит. Подозреваю, что взаимодействие идет через независимые от языков средства. Например, нечто написанное на одном языке, шлет через TCP-IP пакет, который ловится и обрабатывается чем-то написанным на другом языке. Либо через HTTP запросы. Либо через запись/чтение из БД. Либо через файловый обмен, XML например. Хотелось бы, чтобы знающие люди привели пару примеров, как это обычно происходит. Не просто в двух словах, мол "фронт на яваскрипте, бэк на яве", а с техническими нюансами. Заранее спасибо.	Несколько языков могут сосуществовать как в рамках одного процесса, так и в рамках нескольких. Проще всего сосуществовать в рамках нескольких процессов: если процессы обмениваются данными, то совершенно всё равно (ну, в известных рамках), на каком языке эти данные были созданы, и какой язык их читает. Например, вы можете генерировать данные в виде HTML сервером на ASP.NET, а читать браузером, написанным на C++. (Да, пара из сервера и клиента — тоже взаимодействие языков.) Теперь, если мы хотим взаимодействие в рамках одного процесса, нам нужно уметь вызывать друг друга. Для этого нужен общий стандарт вызова. Часто таким общим стандартом являются бинарные соглашения C (extern "C", экспорт из DLL в Windows). Ещё пример общего стандарта — COM: COM-объекты можно писать на многих языках, так что если в языке есть часть, реализующая стандарт COM, он может вполне пользоваться им. Отдельная возможность, популярная сейчас — языки, компилирующиеся в общий промежуточный код. Например, Java и Sc...
`Слэши и ковычки после использования stringify Есть подобный скрипт: [code] var output = { lol: [ {name: "hahaha"} ] }; console.log(output); output = JSON.stringify(output); console.log(output); [/code] в итоге получаем почему он вставил слэши и кавычки там, где не надо?`	`Может сразу сделать валидный JSON [code] var output = { lol: {name: "hahaha"} }; console.log(output); output = JSON.stringify(output); console.log(output); [/code] Правда я незнаю что за переменная name`
Оптимизация поиска числа в списке Есть функция. Она принимает число от 1 до 9 (мы ищем, есть ли оно в списке), и список, в котором мы его ищем) [code] def is_number_already_in(number, line): equality = False for i in line: if i == number: equality = True if equality: return True else: return False [/code] Как можно этот код оптимизировать и как называется способ (тема) оптимизации, чтобы я мог загуглить Только не через лямбду, пожалуйста)	> [code] > if equality: > return True > else: > return False > [/code] [code] return equality [/code] > [code] > equality = False > for i in line: > if i == number: > equality = True > [/code] [code] equality = any(i == number for i in line) [/code] Всё целиком: [code] def is_number_already_in(number, line): return any(i == number for i in line) [/code] Хотя на самом деле вроде бы можно гораздо проще [code] def is_number_already_in(number, line): return number in line [/code] PS: Не проверял, но в любом случае идея должна быть понятна.

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Evaluation Datasets

codesearchnet

Dataset: codesearchnet at 3f90200
Size: 30,000 evaluation samples
Columns: ru_func_documentation_string and func_code_string
Approximate statistics based on the first 1000 samples:
ru_func_documentation_string func_code_string
type string string
details
min: 6 tokens
mean: 194.76 tokens
max: 1278 tokens

min: 58 tokens
mean: 580.66 tokens
max: 8192 tokens

	ru_func_documentation_string	func_code_string
type	string	string
details	min: 6 tokens mean: 194.76 tokens max: 1278 tokens	min: 58 tokens mean: 580.66 tokens max: 8192 tokens

Samples:

ru_func_documentation_string	func_code_string
Обучить модель deepq. Параметры ------- env: gym.Env среда для обучения network: строка или функция нейронная сеть, используемая в качестве аппроксиматора функции Q. Если строка, она должна быть одной из имен зарегистрированных моделей в baselines.common.models (mlp, cnn, conv_only). Если функция, она должна принимать тензор наблюдения и возвращать тензор скрытой переменной, которая будет отображена в головы функции Q (см. build_q_func в baselines.deepq.models для деталей по этому поводу) seed: int или None seed генератора случайных чисел. Запуски с одинаковым seed "должны" давать одинаковые результаты. Если None, используется отсутствие семени. lr: float скорость обучения для оптимизатора Adam total_timesteps: int количество шагов среды для оптимизации buffer_size: int размер буфера воспроизведения exploration_fraction: float доля всего периода обучения, в течение которого прои...	def learn(env, network, seed=None, lr=5e-4, total_timesteps=100000, buffer_size=50000, exploration_fraction=0.1, exploration_final_eps=0.02, train_freq=1, batch_size=32, print_freq=100, checkpoint_freq=10000, checkpoint_path=None, learning_starts=1000, gamma=1.0, target_network_update_freq=500, prioritized_replay=False, prioritized_replay_alpha=0.6, prioritized_replay_beta0=0.4, prioritized_replay_beta_iters=None, prioritized_replay_eps=1e-6, param_noise=False, callback=None, load_path=None, **network_kwargs ): """Train a deepq model. Parameters ------- env: gym.Env environment to train on network: string or a function neural network to use as a q function approximator. If string, has to be one of the ...
`Сохранить модель в pickle, расположенный по пути path`	def save_act(self, path=None): """Save model to a pickle located at path""" if path is None: path = os.path.join(logger.get_dir(), "model.pkl") with tempfile.TemporaryDirectory() as td: save_variables(os.path.join(td, "model")) arc_name = os.path.join(td, "packed.zip") with zipfile.ZipFile(arc_name, 'w') as zipf: for root, dirs, files in os.walk(td): for fname in files: file_path = os.path.join(root, fname) if file_path != arc_name: zipf.write(file_path, os.path.relpath(file_path, td)) with open(arc_name, "rb") as f: model_data = f.read() with open(path, "wb") as f: cloudpickle.dump((model_data, self._act_params), f)
`CNN из статьи Nature.`	def nature_cnn(unscaled_images, conv_kwargs): """ CNN from Nature paper. """ scaled_images = tf.cast(unscaled_images, tf.float32) / 255. activ = tf.nn.relu h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2), conv_kwargs)) h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), conv_kwargs)) h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), conv_kwargs)) h3 = conv_to_fc(h3) return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

codesearchnet_en

Dataset: codesearchnet_en at 3f90200
Size: 30,000 evaluation samples
Columns: en_func_documentation_string and func_code_string
Approximate statistics based on the first 1000 samples:
en_func_documentation_string func_code_string
type string string
details
min: 6 tokens
mean: 200.33 tokens
max: 2498 tokens

min: 58 tokens
mean: 580.66 tokens
max: 8192 tokens

	en_func_documentation_string	func_code_string
type	string	string
details	min: 6 tokens mean: 200.33 tokens max: 2498 tokens	min: 58 tokens mean: 580.66 tokens max: 8192 tokens

Samples:

en_func_documentation_string	func_code_string
Train a deepq model. Parameters ------- env: gym.Env environment to train on network: string or a function neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models (mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that) seed: int or None prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used. lr: float learning rate for adam optimizer total_timesteps: int number of env steps to optimizer for buffer_size: int size of the replay buffer exploration_fraction: float fraction of entire training period over which the exploration rate is annealed exploration_final_eps: float final value of ra...	def learn(env, network, seed=None, lr=5e-4, total_timesteps=100000, buffer_size=50000, exploration_fraction=0.1, exploration_final_eps=0.02, train_freq=1, batch_size=32, print_freq=100, checkpoint_freq=10000, checkpoint_path=None, learning_starts=1000, gamma=1.0, target_network_update_freq=500, prioritized_replay=False, prioritized_replay_alpha=0.6, prioritized_replay_beta0=0.4, prioritized_replay_beta_iters=None, prioritized_replay_eps=1e-6, param_noise=False, callback=None, load_path=None, **network_kwargs ): """Train a deepq model. Parameters ------- env: gym.Env environment to train on network: string or a function neural network to use as a q function approximator. If string, has to be one of the ...
`Save model to a pickle located at path`	def save_act(self, path=None): """Save model to a pickle located at path""" if path is None: path = os.path.join(logger.get_dir(), "model.pkl") with tempfile.TemporaryDirectory() as td: save_variables(os.path.join(td, "model")) arc_name = os.path.join(td, "packed.zip") with zipfile.ZipFile(arc_name, 'w') as zipf: for root, dirs, files in os.walk(td): for fname in files: file_path = os.path.join(root, fname) if file_path != arc_name: zipf.write(file_path, os.path.relpath(file_path, td)) with open(arc_name, "rb") as f: model_data = f.read() with open(path, "wb") as f: cloudpickle.dump((model_data, self._act_params), f)
`CNN from Nature paper.`	def nature_cnn(unscaled_images, conv_kwargs): """ CNN from Nature paper. """ scaled_images = tf.cast(unscaled_images, tf.float32) / 255. activ = tf.nn.relu h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2), conv_kwargs)) h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), conv_kwargs)) h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), conv_kwargs)) h3 = conv_to_fc(h3) return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

codesearchnet_pairs

Dataset: codesearchnet_pairs at 3f90200
Size: 30,000 evaluation samples
Columns: en_func_documentation_string, ru_func_documentation_string, and label

Approximate statistics based on the first 1000 samples:

	en_func_documentation_string	ru_func_documentation_string	label
type	string	string	float
details	min: 6 tokens mean: 200.33 tokens max: 2498 tokens	min: 6 tokens mean: 194.76 tokens max: 1278 tokens	min: 1.0 mean: 1.0 max: 1.0

Samples:

en_func_documentation_string	ru_func_documentation_string	label
Train a deepq model. Parameters ------- env: gym.Env environment to train on network: string or a function neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models (mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that) seed: int or None prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used. lr: float learning rate for adam optimizer total_timesteps: int number of env steps to optimizer for buffer_size: int size of the replay buffer exploration_fraction: float fraction of entire training period over which the exploration rate is annealed exploration_final_eps: float final value of ra...	Обучить модель deepq. Параметры ------- env: gym.Env среда для обучения network: строка или функция нейронная сеть, используемая в качестве аппроксиматора функции Q. Если строка, она должна быть одной из имен зарегистрированных моделей в baselines.common.models (mlp, cnn, conv_only). Если функция, она должна принимать тензор наблюдения и возвращать тензор скрытой переменной, которая будет отображена в головы функции Q (см. build_q_func в baselines.deepq.models для деталей по этому поводу) seed: int или None seed генератора случайных чисел. Запуски с одинаковым seed "должны" давать одинаковые результаты. Если None, используется отсутствие семени. lr: float скорость обучения для оптимизатора Adam total_timesteps: int количество шагов среды для оптимизации buffer_size: int размер буфера воспроизведения exploration_fraction: float доля всего периода обучения, в течение которого прои...	`1.0`
`Save model to a pickle located at path`	`Сохранить модель в pickle, расположенный по пути path`	`1.0`
`CNN from Nature paper.`	`CNN из статьи Nature.`	`1.0`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CoSENTLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

solyanka_qa

Dataset: solyanka_qa at deeac62
Size: 5,000 evaluation samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 17 tokens
mean: 200.35 tokens
max: 533 tokens

min: 19 tokens
mean: 202.53 tokens
max: 525 tokens

	anchor	positive
type	string	string
details	min: 17 tokens mean: 200.35 tokens max: 533 tokens	min: 19 tokens mean: 202.53 tokens max: 525 tokens

Samples:

anchor	positive
Atom IDE произвольное изменение строк Пользуюсь Atom IDE, установлены плагины для GIT'а, использую тему Material theme (может быть кому то это что то даст), в общем проблема такая, что в php файлах при сохранении файла, даже если я изменил всего один символ, он добавляет изменения очень странные,берет 2-3 строки (хз как выбирает) и удаляет их, а потом вставялет их же, без каких то либо изменений. При этом GIT фиксирует это изменение... Вот скрин в blob формате: "blob:https://web.telegram.org/04094604-204d-47b0-a083-f8cd090bdfa0"	`Проблема заключалась в том, что все IDE испльзуют свой символ перехода на следующую строку, если в команде разработчики используют разные IDE, у которых разный перенос строки, то при сохранении файла чужие переносы строк будут заменяться на свои :)`
print() с частью текста и форматированием как переменная Python3 Есть повторяющаяся функция print('\n' + f'{" ЗАПУСКАЕМ ТЕСТ ":=^120}' + '\n') на выходе получаем чтото типа ================ ЗАПУСКАЕМ ТЕСТ ================ или с другим текстом ================= КОНЕЦ ТЕСТА ================== Текст внутри может меняться, форматирование - нет. Как обернуть print('\n' + f'{"":=^120}' + '\n') в переменную, с возможностью подставлять нужный текст, типа print_var('ПРИМЕР ТЕКСТА')?	`[code] def print_var(str): print(f'\n{" " + str + " ":=^120}\n') [/code] В результате: [code] >>> print_var('КАКОЙ_ТО ТЕКСТ') ===================================================== КАКОЙ_ТО ТЕКСТ ===================================================== [/code]`
Не получается перегрузить оператор присваивания в шаблонном классе Нужно перегрузить оператор присваивания в шаблонном классе, не могу понять, почему не работает стандартный синтаксис, при реализации выдает эту ошибку (/home/anton/Programming/tree/tree.h:96: ошибка: overloaded 'operator=' must be a binary operator (has 1 parameter)). Объявление и реализация в одном .h файле. Объявление: [code] tree& operator = (tree &other); [/code] реалицация: [code] template tree& operator = (tree &other) { } [/code]	`Ну надо указать, какому классу он принадлежит... А так вы пытались реализовать унарный оператор =... [code] template tree& tree::operator = (tree &other) { } [/code] И еще - вы точно планируете при присваивании менять присваиваемое? Может, лучше [code] template tree& tree::operator = (const tree &other) { } [/code]`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CachedMultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 4
per_device_eval_batch_size: 16
gradient_accumulation_steps: 32
learning_rate: 2e-05
num_train_epochs: 2
warmup_ratio: 0.1
bf16: True
resume_from_checkpoint: ../models/RuModernBERT-base_bs128_lr_2e-05_2nd_epoch/checkpoint-27400
auto_find_batch_size: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 32
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: ../models/RuModernBERT-base_bs128_lr_2e-05_2nd_epoch/checkpoint-27400
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: True
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Framework Versions

Python: 3.10.11
Sentence Transformers: 5.1.2
Transformers: 4.52.3
PyTorch: 2.6.0+cu124
Accelerate: 1.12.0
Datasets: 4.0.0
Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@article{10531646,
    author={Huang, Xiang and Peng, Hao and Zou, Dongcheng and Liu, Zhiwei and Li, Jianxin and Liu, Kay and Wu, Jia and Su, Jianlin and Yu, Philip S.},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    title={CoSENT: Consistent Sentence Embedding via Similarity Ranking},
    year={2024},
    doi={10.1109/TASLP.2024.3402087}
}