SentenceTransformer
This is a sentence-transformers model trained on the cornstack_python, cornstack_python_pairs, codesearchnet, codesearchnet_pairs and solyanka_qa datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space.
Model can be used for text-to-code, code-to-text retrieval tasks where text is in Russian/English and code is in Python/Java/Javascript/Go/Php/Ruby. Queries, documents also can be mix of natural language text and code. Perfomance of code-to-code tasks wasn't measured.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: RuModernBERT-base
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Datasets:
- cornstack_python
- cornstack_python_pairs
- codesearchnet
- codesearchnet_pairs
- solyanka_qa
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
import torch
from sentence_transformers import SentenceTransformer, util
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("fyaronskiy/code_retriever_ru_en").to(device)
queries_ru = [
"Напиши функцию на Python, которая рекурсивно вычисляет факториал числа.",
"Как проверить, является ли строка палиндромом?",
"Объедини два отсортированных списка в один отсортированный список."
]
corpus_ru = [
# Релевантный для Q1
"""def factorial(n):
if n == 0:
return 1
return n * factorial(n - 1)""",
# Hard negative для Q1
"""def sum_recursive(n):
if n == 0:
return 0
return n + sum_recursive(n - 1)""",
# Релевантный для Q2
"""def is_palindrome(s: str) -> bool:
s = s.lower().replace(" ", "")
return s == s[::-1]""",
# Hard negative для Q2
"""def reverse_string(s: str) -> str:
return s[::-1]""",
# Релевантный для Q3
"""def merge_sorted_lists(a, b):
result = []
i = j = 0
while i < len(a) and j < len(b):
if a[i] < b[j]:
result.append(a[i])
i += 1
else:
result.append(b[j])
j += 1
result.extend(a[i:])
result.extend(b[j:])
return result""",
# Hard negative для Q3
"""def add_lists(a, b):
return [x + y for x, y in zip(a, b)]"""
]
doc_embeddings = model.encode(corpus_ru, convert_to_tensor=True, device=device)
query_embeddings = model.encode(queries_ru, convert_to_tensor=True, device=device)
# Выполняем поиск по каждому запросу
for i, query in enumerate(queries_ru):
scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0]
best_idx = torch.argmax(scores).item()
print(f"\nЗапрос {i+1}: {query}")
print('Скоры всех документов в корпусе: ', scores)
print(f"Наиболее подходящий документ (Скор={scores[best_idx]:.4f}):\n{corpus_ru[best_idx]}")
Model was trained with Matryoshka Loss with dims: 768, 512, 256, 128, 64. So for decreasing memory for your vector databaset and make inference faster you can truncate embeddings.
To do this you need to initialize model as follows:
matryoshka_dim = 128
model = SentenceTransformer("fyaronskiy/code_retriever_ru_en", truncate_dim=matryoshka_dim).to(device)
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8684 |
| cosine_accuracy@3 | 0.9439 |
| cosine_accuracy@5 | 0.9566 |
| cosine_accuracy@10 | 0.9668 |
| cosine_precision@1 | 0.8684 |
| cosine_precision@3 | 0.3146 |
| cosine_precision@5 | 0.1913 |
| cosine_precision@10 | 0.0967 |
| cosine_recall@1 | 0.8684 |
| cosine_recall@3 | 0.9439 |
| cosine_recall@5 | 0.9566 |
| cosine_recall@10 | 0.9668 |
| cosine_ndcg@10 | 0.9224 |
| cosine_mrr@10 | 0.9076 |
| cosine_map@100 | 0.9083 |
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8742 |
| cosine_accuracy@3 | 0.9425 |
| cosine_accuracy@5 | 0.9549 |
| cosine_accuracy@10 | 0.9644 |
| cosine_precision@1 | 0.8742 |
| cosine_precision@3 | 0.3142 |
| cosine_precision@5 | 0.191 |
| cosine_precision@10 | 0.0964 |
| cosine_recall@1 | 0.8742 |
| cosine_recall@3 | 0.9425 |
| cosine_recall@5 | 0.9549 |
| cosine_recall@10 | 0.9644 |
| cosine_ndcg@10 | 0.9234 |
| cosine_mrr@10 | 0.9098 |
| cosine_map@100 | 0.9105 |
Training Details
Training Datasets
cornstack_python
cornstack_python
- Dataset: cornstack_python
- Size: 2,869,969 training samples
- Columns:
ru_query,document,negative_0,negative_1,negative_2,negative_3,negative_4,negative_5,negative_6,negative_7,negative_8,negative_9,negative_10,negative_11,negative_12,negative_13,negative_14, andnegative_15 - Approximate statistics based on the first 1000 samples:
ru_query document negative_0 negative_1 negative_2 negative_3 negative_4 negative_5 negative_6 negative_7 negative_8 negative_9 negative_10 negative_11 negative_12 negative_13 negative_14 negative_15 type string string string string string string string string string string string string string string string string string string details - min: 7 tokens
- mean: 27.46 tokens
- max: 162 tokens
- min: 6 tokens
- mean: 304.38 tokens
- max: 5574 tokens
- min: 6 tokens
- mean: 237.08 tokens
- max: 3627 tokens
- min: 6 tokens
- mean: 229.94 tokens
- max: 6691 tokens
- min: 6 tokens
- mean: 230.06 tokens
- max: 6229 tokens
- min: 7 tokens
- mean: 230.7 tokens
- max: 4876 tokens
- min: 8 tokens
- mean: 220.57 tokens
- max: 4876 tokens
- min: 7 tokens
- mean: 236.08 tokens
- max: 5880 tokens
- min: 6 tokens
- mean: 247.91 tokens
- max: 6621 tokens
- min: 6 tokens
- mean: 207.62 tokens
- max: 3350 tokens
- min: 6 tokens
- mean: 222.54 tokens
- max: 6863 tokens
- min: 6 tokens
- mean: 221.53 tokens
- max: 4976 tokens
- min: 7 tokens
- mean: 216.06 tokens
- max: 4876 tokens
- min: 7 tokens
- mean: 197.03 tokens
- max: 4763 tokens
- min: 6 tokens
- mean: 200.83 tokens
- max: 8192 tokens
- min: 6 tokens
- mean: 204.94 tokens
- max: 3210 tokens
- min: 6 tokens
- mean: 188.51 tokens
- max: 2754 tokens
- min: 6 tokens
- mean: 188.27 tokens
- max: 4876 tokens
- Samples:
ru_query document negative_0 negative_1 negative_2 negative_3 negative_4 negative_5 negative_6 negative_7 negative_8 negative_9 negative_10 negative_11 negative_12 negative_13 negative_14 negative_15 установите значение business_id сообщения данных в конкретное значениеdef step_impl_the_ru_is_set_to(context, business_id):
context.bdd_helper.message_data["business_id"] = business_iddef business_id(self, business_id):
self._business_id = business_iddef business_phone(self, business_phone):
self._business_phone = business_phonedef business_phone_number(self, business_phone_number):
self._business_phone_number = business_phone_numberdef bus_ob_id(self, bus_ob_id):
self._bus_ob_id = bus_ob_iddef bus_ob_id(self, bus_ob_id):
self._bus_ob_id = bus_ob_iddef _set_id(self, value):
passdef business_email(self, business_email):
self._business_email = business_emaildef mailing_id(self, val: str):
self._mailing_id = valdef message_id(self, val: str):
self._message_id = valdef business_model(self, business_model):
self._business_model = business_modeldef business_account(self, business_account):
self._business_account = business_accountdef update_business(current_user, businessId):
business = Business.query.get(int(businessId))
if not business:
return make_json_reply('message', 'Business id does not exist'), 404
if business.user_id != current_user.id:
return make_json_reply('message', 'Cannot update business'), 400
data = request.get_json(force=True)
name = location = category = description = None
if 'name' in data.keys():
name = data['name']
if 'location' in data.keys():
location = data['location']
if 'category' in data.keys():
category = data['category']
if 'description' in data.keys():
description = data['description']
if check_validity_of_input(name=name):
business.name = name
if check_validity_of_input(location=location):
business.location = location
if check_validity_of_input(category=category):
business.category = category
if check_validity_of_input(description=description):
...def set_company_id_value(self, company_id_value):
self.company_id_value = company_id_valuedef id(self, value):
self._id = valuedef set_bribe(self, bribe_amount):
self.bribe = bribe_amountdef business_owner(self, business_owner):
self._business_owner = business_ownerУстановить состояние правил siddef set_state_sid_request(ruleset_name, sid):
message = json.loads(request.stream.read().decode('utf-8'))
message['sid'] = sid
result = host.patch_state(ruleset_name, message)
return jsonify(result)def sid(self, sid):
self._sid = siddef set_state(self,s):
self.state = sdef set_state(self, state: int):def setstate(self, state):
self.set(DER = state)def set_rule(self, rule):
self.rule.load_state_dict(rule, strict=True)def _set_state(self, state):
#print("** set state from %d to %d" % (self.state, state))
self.state = statedef set_state( self ):def set_ident(self, new_ident: int):
if not isinstance(new_ident, int):
raise TypeError("Spectrum set identifiers may ONLY be positive integers")
self._set_ident = new_identdef set_state(self, state):
#print("ComponentBase.set_state")
for k,v in state.items():
#print(" Set {:14s} to {:s}".format(k,str(v)))
if k == "connectors":
for con_state in v:
self.add_connector()
self.connectors[-1].set_state(con_state)
else:
setattr(self, k, v)def setstate(self, state):
self.list = statedef setstate(self, state):
self.list = statedef state_id(self, state_id):
self._state_id = state_iddef set_state(self, state: int):
self.state = statedef set_domain_sid(self, sid):
dsdb._samdb_set_domain_sid(self, sid)def set_state(self,state):
self.__state = statedef set_srid(self, srid: ir.IntegerValue) -> GeoSpatialValue:
return ops.GeoSetSRID(self, srid=srid).to_expr()Отправить события sid в rulesetdef post_sid_events(ruleset_name, sid):
message = json.loads(request.stream.read().decode('utf-8'))
message['sid'] = sid
result = host.post(ruleset_name, message)
return jsonify(result)def post_events(ruleset_name):
message = json.loads(request.stream.read().decode('utf-8'))
result = host.post(ruleset_name, message)
return jsonify(result)def set_state_sid_request(ruleset_name, sid):
message = json.loads(request.stream.read().decode('utf-8'))
message['sid'] = sid
result = host.patch_state(ruleset_name, message)
return jsonify(result)def sid(self, sid):
self._sid = siddef post(self, request, *args, **kwargs):
id = args[0] if args else list(kwargs.values())[0]
try:
ssn = Subscription.objects.get(id=id)
except Subscription.DoesNotExist:
logger.error(
f'Received unwanted subscription {id} POST request! Sending status '
'410 back to hub.'
)
return Response('Unwanted subscription', status=410)
ssn.update(time_last_event_received=now())
self.handler_task.delay(request.data)
return Response('') # TODOdef informed_consent_on_post_save(sender, instance, raw, created, **kwargs):
if not raw:
if created:
pass
# instance.registration_update_or_create()
# update_model_fields(instance=instance,
# model_cls=['subject_identifier', instance.subject_identifier])
try:
OnSchedule.objects.get(
subject_identifier=instance.subject_identifier, )
except OnSchedule.DoesNotExist:
onschedule_model = 'training_subject.onschedule'
put_on_schedule(schedule_name='training_subject_visit_schedule', instance=instance, onschedule_model=onschedule_model)def post_event(self, event):
from evennia.scripts.models import ScriptDB
if event.public_event:
event_manager = ScriptDB.objects.get(db_key="Event Manager")
event_manager.post_event(event, self.owner.player, event.display())def post(self, event, *args, **kwargs):
self.inq.Signal((event, args, kwargs))def post(self, request):
return self.serviceHandler.addEvent(request.data)def register_to_event(request):
passdef setFilterOnRule(request):
logger = logging.getLogger(name)
# Get some initial post values for processing.
ruleIds = request.POST.getlist('id')
sensors = request.POST.getlist('sensors')
commentString = request.POST['comment']
force = request.POST['force']
response = []
# If the ruleIds list is empty, it means a SID has been entered manually.
if len(ruleIds) == 0:
# Grab the value from the POST.
ruleSID = request.POST['sid']
# Match the GID:SID pattern, if its not there, throw exception.
try:
matchPattern = r"(\d+):(\d+)"
pattern = re.compile(matchPattern)
result = pattern.match(ruleSID)
ruleGID = result.group(1)
ruleSID = result.group(2)
except:
response.append({'response': 'invalidGIDSIDFormat', 'text': 'Please format in the GID:SID syntax.'})
logger.warning("Invalid GID:SID syntax provided: "+str(ruleSID)+".")
return HttpResponse(json.dumps(response))
# Try to find a generator object with the GID supplied, if it does...def store_event(self, violations):
current_time = datetime.now().strftime("%Y/%m/%d %H:%M:%S")
insert_query = """INSERT INTO social_distancing (Location, Local_Time, Violations) VALUES ('{}', '{}', {})""".format(self.location, current_time, violations)
self.off_chain.insert(insert_query)
event_id = self.off_chain.select("""SELECT LAST_INSERT_ID() FROM social_distancing""")[0][0]
self.on_chain.store_hash(event_id, self.location, current_time, violations)def test_post_event_on_schedule_page(self):
json_data = {
'title': 'Test Event',
'start': '2017-8-8T12:00:00',
'end': '2017-8-8T12:00:00',
'group': '3'
}
response = self.app.post("/saveEvent", data=json.dumps(json_data),
content_type='application/json')
self.assertTrue(response.status_code, 200)def _push(self, server):
defns = [self.get_id(ident) for ident in list(self.ids)]
#for ident in list(self.ids):
# defn = self.get_id(ident)
if len(defns) == 0:
return
self.app.logger.info(f"Updating {server} with {len(defns)} records")
url = f"{server}/add_record"
try:
resp = requests.post(url, json=defns)
except Exception as e:
self.app.logger.error(str(e))
return
if not resp.ok:
self.app.logger.error(f"{resp.reason} {resp.content}")
return
self._server_updated[server] = Truedef post(self, slug = None, eid = None):
uid = self.request.form.get("uid")
status = self.request.form.get("status") # can be join, maybe, notgoubg
event = self.barcamp.get_event(eid)
user = self.app.module_map.userbase.get_user_by_id(uid)
reg = RegistrationService(self, user)
try:
status = reg.set_status(eid, status, force=True)
except RegistrationError, e:
print "a registration error occurred", e
raise ProcessingError(str(e))
return
return {'status' : 'success', 'reload' : True}def events(self):def post(self):
# we need a unique tx number so we can look these back up again
# as well as for logging
# FIXME: how can we guarantee uniqueness here?
tx = int(time.time() * 100000) + random.randrange(10000, 99999)
log.info("EVENTS [{}]: Creating events".format(tx))
try:
user = self.jbody["user"]
if not EMAIL_REGEX.match(user):
user += "@" + self.domain
event_type_id = self.jbody.get("eventTypeId", None)
category = self.jbody.get("category", None)
state = self.jbody.get("state", None)
note = self.jbody.get("note", None)
except KeyError as err:
raise exc.BadRequest(
"Missing Required Argument: {}".format(err.message)
)
except ValueError as err:
raise exc.BadRequest(err.message)
if not event_type_id and (not category and not state):
raise exc.BadRequest(
... - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
cornstack_python_pairs
cornstack_python_pairs
- Dataset: cornstack_python_pairs
- Size: 1,434,984 training samples
- Columns:
en_query,ru_query, andlabel - Approximate statistics based on the first 1000 samples:
en_query ru_query label type string string float details - min: 7 tokens
- mean: 26.96 tokens
- max: 150 tokens
- min: 7 tokens
- mean: 27.46 tokens
- max: 162 tokens
- min: 1.0
- mean: 1.0
- max: 1.0
- Samples:
en_query ru_query label set the message data business_id to a specific valueустановите значение business_id сообщения данных в конкретное значение1.0Set ruleset state sidУстановить состояние правил sid1.0Post sid events to the rulesetОтправить события sid в ruleset1.0 - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CoSENTLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
codesearchnet
codesearchnet
- Dataset: codesearchnet at 3f90200
- Size: 1,880,853 training samples
- Columns:
ru_func_documentation_stringandfunc_code_string - Approximate statistics based on the first 1000 samples:
ru_func_documentation_string func_code_string type string string details - min: 5 tokens
- mean: 95.0 tokens
- max: 619 tokens
- min: 62 tokens
- mean: 522.56 tokens
- max: 8192 tokens
- Samples:
ru_func_documentation_string func_code_string Мультипроцессинг-целевой объект для устройства очереди zmqdef zmq_device(self):
'''
Multiprocessing target for the zmq queue device
'''
self.__setup_signals()
salt.utils.process.appendproctitle('MWorkerQueue')
self.context = zmq.Context(self.opts['worker_threads'])
# Prepare the zeromq sockets
self.uri = 'tcp://{interface}:{ret_port}'.format(**self.opts)
self.clients = self.context.socket(zmq.ROUTER)
if self.opts['ipv6'] is True and hasattr(zmq, 'IPV4ONLY'):
# IPv6 sockets work for both IPv6 and IPv4 addresses
self.clients.setsockopt(zmq.IPV4ONLY, 0)
self.clients.setsockopt(zmq.BACKLOG, self.opts.get('zmq_backlog', 1000))
self._start_zmq_monitor()
self.workers = self.context.socket(zmq.DEALER)
if self.opts.get('ipc_mode', '') == 'tcp':
self.w_uri = 'tcp://127.0.0.1:{0}'.format(
self.opts.get('tcp_master_workers', 4515)
)
else:
self.w_uri = 'ipc:...Чисто завершите работу сокета роутераdef close(self):
'''
Cleanly shutdown the router socket
'''
if self._closing:
return
log.info('MWorkerQueue under PID %s is closing', os.getpid())
self._closing = True
# pylint: disable=E0203
if getattr(self, '_monitor', None) is not None:
self._monitor.stop()
self._monitor = None
if getattr(self, '_w_monitor', None) is not None:
self._w_monitor.stop()
self._w_monitor = None
if hasattr(self, 'clients') and self.clients.closed is False:
self.clients.close()
if hasattr(self, 'workers') and self.workers.closed is False:
self.workers.close()
if hasattr(self, 'stream'):
self.stream.close()
if hasattr(self, '_socket') and self._socket.closed is False:
self._socket.close()
if hasattr(self, 'context') and self.context.closed is False:
self.context.term()До форка нам нужно создать устройство zmq роутера
:param func process_manager: Экземпляр класса salt.utils.process.ProcessManagerdef pre_fork(self, process_manager):
'''
Pre-fork we need to create the zmq router device
:param func process_manager: An instance of salt.utils.process.ProcessManager
'''
salt.transport.mixins.auth.AESReqServerMixin.pre_fork(self, process_manager)
process_manager.add_process(self.zmq_device) - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
codesearchnet_pairs
codesearchnet_pairs
- Dataset: codesearchnet_pairs at 3f90200
- Size: 940,426 training samples
- Columns:
en_func_documentation_string,ru_func_documentation_string, andlabel - Approximate statistics based on the first 1000 samples:
en_func_documentation_string ru_func_documentation_string label type string string float details - min: 5 tokens
- mean: 102.69 tokens
- max: 1485 tokens
- min: 5 tokens
- mean: 95.0 tokens
- max: 619 tokens
- min: 1.0
- mean: 1.0
- max: 1.0
- Samples:
en_func_documentation_string ru_func_documentation_string label Multiprocessing target for the zmq queue deviceМультипроцессинг-целевой объект для устройства очереди zmq1.0Cleanly shutdown the router socketЧисто завершите работу сокета роутера1.0Pre-fork we need to create the zmq router device
:param func process_manager: An instance of salt.utils.process.ProcessManagerДо форка нам нужно создать устройство zmq роутера
:param func process_manager: Экземпляр класса salt.utils.process.ProcessManager1.0 - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CoSENTLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
solyanka_qa
solyanka_qa
- Dataset: solyanka_qa at deeac62
- Size: 85,523 training samples
- Columns:
anchorandpositive - Approximate statistics based on the first 1000 samples:
anchor positive type string string details - min: 19 tokens
- mean: 202.49 tokens
- max: 518 tokens
- min: 16 tokens
- mean: 196.36 tokens
- max: 524 tokens
- Samples:
anchor positive Как происходит взаимодействие нескольких языков программирования? Понятно, что большинство (если не все) крупные энтерпрайз сервисы, приложения и тд. (не только веб) написаны с использованием не одного языка программирования, а нескольких. И эти составные части, написанные на разных языках, как-то взаимодействуют между собой (фронт, бизнес-логика, еще что-то).
Опыта разработки подобных систем у меня нет, поэтому не совсем могу представить, как это происходит. Подозреваю, что взаимодействие идет через независимые от языков средства. Например, нечто написанное на одном языке, шлет через TCP-IP пакет, который ловится и обрабатывается чем-то написанным на другом языке. Либо через HTTP запросы. Либо через запись/чтение из БД. Либо через файловый обмен, XML например.
Хотелось бы, чтобы знающие люди привели пару примеров, как это обычно происходит. Не просто в двух словах, мол "фронт на яваскрипте, бэк на яве", а с техническими нюансами. Заранее спасибо.Несколько языков могут сосуществовать как в рамках одного процесса, так и в рамках нескольких.
Проще всего сосуществовать в рамках нескольких процессов: если процессы обмениваются данными, то совершенно всё равно (ну, в известных рамках), на каком языке эти данные были созданы, и какой язык их читает. Например, вы можете генерировать данные в виде HTML сервером на ASP.NET, а читать браузером, написанным на C++. (Да, пара из сервера и клиента — тоже взаимодействие языков.)
Теперь, если мы хотим взаимодействие в рамках одного процесса, нам нужно уметь вызывать друг друга. Для этого нужен общий стандарт вызова. Часто таким общим стандартом являются бинарные соглашения C (extern "C", экспорт из DLL в Windows).
Ещё пример общего стандарта — COM: COM-объекты можно писать на многих языках, так что если в языке есть часть, реализующая стандарт COM, он может вполне пользоваться им.
Отдельная возможность, популярная сейчас — языки, компилирующиеся в общий промежуточный код. Например, Java и Sc...Слэши и ковычки после использования stringify Есть подобный скрипт:
[code]
var output = {
lol: [
{name: "hahaha"}
]
};
console.log(output);
output = JSON.stringify(output);
console.log(output);
[/code]
в итоге получаем
почему он вставил слэши и кавычки там, где не надо?Может сразу сделать валидный JSON
[code]
var output = {
lol: {name: "hahaha"}
};
console.log(output);
output = JSON.stringify(output);
console.log(output);
[/code]
Правда я незнаю что за переменнаяnameОптимизация поиска числа в списке Есть функция. Она принимает число от 1 до 9 (мы ищем, есть ли оно в списке), и список, в котором мы его ищем)
[code]
def is_number_already_in(number, line):
equality = False
for i in line:
if i == number:
equality = True
if equality:
return True
else:
return False
[/code]
Как можно этот код оптимизировать и как называется способ (тема) оптимизации, чтобы я мог загуглить
Только не через лямбду, пожалуйста)>
[code]
> if equality:
> return True
> else:
> return False
>
[/code]
[code]
return equality
[/code]
>
[code]
> equality = False
> for i in line:
> if i == number:
> equality = True
>
[/code]
[code]
equality = any(i == number for i in line)
[/code]
Всё целиком:
[code]
def is_number_already_in(number, line):
return any(i == number for i in line)
[/code]
Хотя на самом деле вроде бы можно гораздо проще
[code]
def is_number_already_in(number, line):
return number in line
[/code]
PS: Не проверял, но в любом случае идея должна быть понятна. - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Evaluation Datasets
codesearchnet
codesearchnet
- Dataset: codesearchnet at 3f90200
- Size: 30,000 evaluation samples
- Columns:
ru_func_documentation_stringandfunc_code_string - Approximate statistics based on the first 1000 samples:
ru_func_documentation_string func_code_string type string string details - min: 6 tokens
- mean: 194.76 tokens
- max: 1278 tokens
- min: 58 tokens
- mean: 580.66 tokens
- max: 8192 tokens
- Samples:
ru_func_documentation_string func_code_string Обучить модель deepq.
Параметры
-------
env: gym.Env
среда для обучения
network: строка или функция
нейронная сеть, используемая в качестве аппроксиматора функции Q. Если строка, она должна быть одной из имен зарегистрированных моделей в baselines.common.models
(mlp, cnn, conv_only). Если функция, она должна принимать тензор наблюдения и возвращать тензор скрытой переменной, которая
будет отображена в головы функции Q (см. build_q_func в baselines.deepq.models для деталей по этому поводу)
seed: int или None
seed генератора случайных чисел. Запуски с одинаковым seed "должны" давать одинаковые результаты. Если None, используется отсутствие семени.
lr: float
скорость обучения для оптимизатора Adam
total_timesteps: int
количество шагов среды для оптимизации
buffer_size: int
размер буфера воспроизведения
exploration_fraction: float
доля всего периода обучения, в течение которого прои...def learn(env,
network,
seed=None,
lr=5e-4,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
train_freq=1,
batch_size=32,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
prioritized_replay=False,
prioritized_replay_alpha=0.6,
prioritized_replay_beta0=0.4,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
param_noise=False,
callback=None,
load_path=None,
**network_kwargs
):
"""Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
network: string or a function
neural network to use as a q function approximator. If string, has to be one of the ...Сохранить модель в pickle, расположенный по путиpathdef save_act(self, path=None):
"""Save model to a pickle located atpath"""
if path is None:
path = os.path.join(logger.get_dir(), "model.pkl")
with tempfile.TemporaryDirectory() as td:
save_variables(os.path.join(td, "model"))
arc_name = os.path.join(td, "packed.zip")
with zipfile.ZipFile(arc_name, 'w') as zipf:
for root, dirs, files in os.walk(td):
for fname in files:
file_path = os.path.join(root, fname)
if file_path != arc_name:
zipf.write(file_path, os.path.relpath(file_path, td))
with open(arc_name, "rb") as f:
model_data = f.read()
with open(path, "wb") as f:
cloudpickle.dump((model_data, self._act_params), f)CNN из статьи Nature.def nature_cnn(unscaled_images, **conv_kwargs):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
**conv_kwargs))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))) - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
codesearchnet_en
codesearchnet_en
- Dataset: codesearchnet_en at 3f90200
- Size: 30,000 evaluation samples
- Columns:
en_func_documentation_stringandfunc_code_string - Approximate statistics based on the first 1000 samples:
en_func_documentation_string func_code_string type string string details - min: 6 tokens
- mean: 200.33 tokens
- max: 2498 tokens
- min: 58 tokens
- mean: 580.66 tokens
- max: 8192 tokens
- Samples:
en_func_documentation_string func_code_string Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
network: string or a function
neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models
(mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which
will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that)
seed: int or None
prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used.
lr: float
learning rate for adam optimizer
total_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
exploration_fraction: float
fraction of entire training period over which the exploration rate is annealed
exploration_final_eps: float
final value of ra...def learn(env,
network,
seed=None,
lr=5e-4,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
train_freq=1,
batch_size=32,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
prioritized_replay=False,
prioritized_replay_alpha=0.6,
prioritized_replay_beta0=0.4,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
param_noise=False,
callback=None,
load_path=None,
**network_kwargs
):
"""Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
network: string or a function
neural network to use as a q function approximator. If string, has to be one of the ...Save model to a pickle located atpathdef save_act(self, path=None):
"""Save model to a pickle located atpath"""
if path is None:
path = os.path.join(logger.get_dir(), "model.pkl")
with tempfile.TemporaryDirectory() as td:
save_variables(os.path.join(td, "model"))
arc_name = os.path.join(td, "packed.zip")
with zipfile.ZipFile(arc_name, 'w') as zipf:
for root, dirs, files in os.walk(td):
for fname in files:
file_path = os.path.join(root, fname)
if file_path != arc_name:
zipf.write(file_path, os.path.relpath(file_path, td))
with open(arc_name, "rb") as f:
model_data = f.read()
with open(path, "wb") as f:
cloudpickle.dump((model_data, self._act_params), f)CNN from Nature paper.def nature_cnn(unscaled_images, **conv_kwargs):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
**conv_kwargs))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))) - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
codesearchnet_pairs
codesearchnet_pairs
- Dataset: codesearchnet_pairs at 3f90200
- Size: 30,000 evaluation samples
- Columns:
en_func_documentation_string,ru_func_documentation_string, andlabel - Approximate statistics based on the first 1000 samples:
en_func_documentation_string ru_func_documentation_string label type string string float details - min: 6 tokens
- mean: 200.33 tokens
- max: 2498 tokens
- min: 6 tokens
- mean: 194.76 tokens
- max: 1278 tokens
- min: 1.0
- mean: 1.0
- max: 1.0
- Samples:
en_func_documentation_string ru_func_documentation_string label Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
network: string or a function
neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models
(mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which
will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that)
seed: int or None
prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used.
lr: float
learning rate for adam optimizer
total_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
exploration_fraction: float
fraction of entire training period over which the exploration rate is annealed
exploration_final_eps: float
final value of ra...Обучить модель deepq.
Параметры
-------
env: gym.Env
среда для обучения
network: строка или функция
нейронная сеть, используемая в качестве аппроксиматора функции Q. Если строка, она должна быть одной из имен зарегистрированных моделей в baselines.common.models
(mlp, cnn, conv_only). Если функция, она должна принимать тензор наблюдения и возвращать тензор скрытой переменной, которая
будет отображена в головы функции Q (см. build_q_func в baselines.deepq.models для деталей по этому поводу)
seed: int или None
seed генератора случайных чисел. Запуски с одинаковым seed "должны" давать одинаковые результаты. Если None, используется отсутствие семени.
lr: float
скорость обучения для оптимизатора Adam
total_timesteps: int
количество шагов среды для оптимизации
buffer_size: int
размер буфера воспроизведения
exploration_fraction: float
доля всего периода обучения, в течение которого прои...1.0Save model to a pickle located atpathСохранить модель в pickle, расположенный по путиpath1.0CNN from Nature paper.CNN из статьи Nature.1.0 - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CoSENTLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
solyanka_qa
solyanka_qa
- Dataset: solyanka_qa at deeac62
- Size: 5,000 evaluation samples
- Columns:
anchorandpositive - Approximate statistics based on the first 1000 samples:
anchor positive type string string details - min: 17 tokens
- mean: 200.35 tokens
- max: 533 tokens
- min: 19 tokens
- mean: 202.53 tokens
- max: 525 tokens
- Samples:
anchor positive Atom IDE произвольное изменение строк Пользуюсь Atom IDE, установлены плагины для GIT'а, использую тему Material theme (может быть кому то это что то даст), в общем проблема такая, что в php файлах при сохранении файла, даже если я изменил всего один символ, он добавляет изменения очень странные,берет 2-3 строки (хз как выбирает) и удаляет их, а потом вставялет их же, без каких то либо изменений. При этом GIT фиксирует это изменение...
Вот скрин в blob формате: "blob:https://web.telegram.org/04094604-204d-47b0-a083-f8cd090bdfa0"Проблема заключалась в том, что все IDE испльзуют свой символ перехода на следующую строку, если в команде разработчики используют разные IDE, у которых разный перенос строки, то при сохранении файла чужие переносы строк будут заменяться на свои :)print() с частью текста и форматированием как переменная Python3 Есть повторяющаяся функцияprint('\n' + f'{" ЗАПУСКАЕМ ТЕСТ ":=^120}' + '\n')
на выходе получаем чтото типа
================ ЗАПУСКАЕМ ТЕСТ ================
или с другим текстом
================= КОНЕЦ ТЕСТА ==================
Текст внутри может меняться, форматирование - нет.
Как обернутьprint('\n' + f'{"":=^120}' + '\n')в переменную, с возможностью подставлять нужный текст, типаprint_var('ПРИМЕР ТЕКСТА')?[code]
def print_var(str):
print(f'\n{" " + str + " ":=^120}\n')
[/code]
В результате:
[code]
>>> print_var('КАКОЙ_ТО ТЕКСТ')
===================================================== КАКОЙ_ТО ТЕКСТ =====================================================
[/code]Не получается перегрузить оператор присваивания в шаблонном классе Нужно перегрузить оператор присваивания в шаблонном классе, не могу понять, почему не работает стандартный синтаксис, при реализации выдает эту ошибку (/home/anton/Programming/tree/tree.h:96: ошибка: overloaded 'operator=' must be a binary operator (has 1 parameter)). Объявление и реализация в одном .h файле.
Объявление:
[code]
tree& operator = (tree &other);
[/code]
реалицация:
[code]
template
tree& operator = (tree &other)
{
}
[/code]Ну надо указать, какому классу он принадлежит... А так вы пытались реализовать унарный оператор=...
[code]
template
tree& tree::operator = (tree &other)
{
}
[/code]
И еще - вы точно планируете при присваивании менять присваиваемое? Может, лучше
[code]
template
tree& tree::operator = (const tree &other)
{
}
[/code] - Loss:
MatryoshkaLosswith these parameters:{ "loss": "CachedMultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: stepsper_device_train_batch_size: 4per_device_eval_batch_size: 16gradient_accumulation_steps: 32learning_rate: 2e-05num_train_epochs: 2warmup_ratio: 0.1bf16: Trueresume_from_checkpoint: ../models/RuModernBERT-base_bs128_lr_2e-05_2nd_epoch/checkpoint-27400auto_find_batch_size: Truebatch_sampler: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 4per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 32eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 2max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: ../models/RuModernBERT-base_bs128_lr_2e-05_2nd_epoch/checkpoint-27400hub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Truefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
Framework Versions
- Python: 3.10.11
- Sentence Transformers: 5.1.2
- Transformers: 4.52.3
- PyTorch: 2.6.0+cu124
- Accelerate: 1.12.0
- Datasets: 4.0.0
- Tokenizers: 0.21.4
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
CachedMultipleNegativesRankingLoss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
CoSENTLoss
@article{10531646,
author={Huang, Xiang and Peng, Hao and Zou, Dongcheng and Liu, Zhiwei and Li, Jianxin and Liu, Kay and Wu, Jia and Su, Jianlin and Yu, Philip S.},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={CoSENT: Consistent Sentence Embedding via Similarity Ranking},
year={2024},
doi={10.1109/TASLP.2024.3402087}
}
- Downloads last month
- 16
Model tree for fyaronskiy/code_retriever_ru_en
Base model
deepvk/RuModernBERT-baseDatasets used to train fyaronskiy/code_retriever_ru_en
Collection including fyaronskiy/code_retriever_ru_en
Papers for fyaronskiy/code_retriever_ru_en
Matryoshka Representation Learning
Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Evaluation results
- Cosine Accuracy@1 on Unknownself-reported0.868
- Cosine Accuracy@3 on Unknownself-reported0.944
- Cosine Accuracy@5 on Unknownself-reported0.957
- Cosine Accuracy@10 on Unknownself-reported0.967
- Cosine Precision@1 on Unknownself-reported0.868
- Cosine Recall@1 on Unknownself-reported0.868
- Cosine Recall@3 on Unknownself-reported0.944
- Cosine Recall@5 on Unknownself-reported0.957
- Cosine Recall@10 on Unknownself-reported0.967
- Cosine Ndcg@10 on Unknownself-reported0.922
- Cosine Mrr@10 on Unknownself-reported0.908
- Cosine Map@100 on Unknownself-reported0.908
- Cosine Accuracy@1 on Unknownself-reported0.874
- Cosine Accuracy@3 on Unknownself-reported0.943
- Cosine Accuracy@5 on Unknownself-reported0.955
- Cosine Accuracy@10 on Unknownself-reported0.964
- Cosine Precision@1 on Unknownself-reported0.874
- Cosine Recall@1 on Unknownself-reported0.874
- Cosine Recall@3 on Unknownself-reported0.943
- Cosine Recall@5 on Unknownself-reported0.955