ThienLe commited on
Commit
8aaa77b
·
verified ·
1 Parent(s): 7f17332

Add new CrossEncoder model

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +341 -0
  3. config.json +73 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +3 -0
  6. tokenizer_config.json +14 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - cross-encoder
5
+ - reranker
6
+ - generated_from_trainer
7
+ - dataset_size:49346
8
+ - loss:CachedMultipleNegativesRankingLoss
9
+ pipeline_tag: text-ranking
10
+ library_name: sentence-transformers
11
+ ---
12
+
13
+ # CrossEncoder
14
+
15
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model trained on the json dataset using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+ - **Model Type:** Cross Encoder
21
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
22
+ - **Maximum Sequence Length:** 4096 tokens
23
+ - **Number of Output Labels:** 1 label
24
+ - **Training Dataset:**
25
+ - json
26
+ <!-- - **Language:** Unknown -->
27
+ <!-- - **License:** Unknown -->
28
+
29
+ ### Model Sources
30
+
31
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
32
+ - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
33
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
34
+ - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
35
+
36
+ ## Usage
37
+
38
+ ### Direct Usage (Sentence Transformers)
39
+
40
+ First install the Sentence Transformers library:
41
+
42
+ ```bash
43
+ pip install -U sentence-transformers
44
+ ```
45
+
46
+ Then you can load this model and run inference.
47
+ ```python
48
+ from sentence_transformers import CrossEncoder
49
+
50
+ # Download from the 🤗 Hub
51
+ model = CrossEncoder("ThienLe/Qwen3-SecRerank")
52
+ # Get scores for pairs of texts
53
+ pairs = [
54
+ ['<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Security Account Manager\nAdversaries may attempt to extract credential material from the Security Account Manager (SAM) database either through in-memory techniques or through the Windows Registry where the SAM database is stored. The SAM is a database file that contains local accounts for the host, typically those found with the net user command. Enumerating the SAM database requires SYSTEM level access. A number of tools can be used to retrieve the SAM file through in-memory techniques: Alternatively, the SAM can be extracted from the Registry with Reg: Creddump7 can then be used to process the SAM database locally to retrieve hashes.[1] Notes:\n', "<Document>: APT29\nAPT29 is threat group that has been attributed to Russia's Foreign Intelligence Service (SVR).[1][2] They have operated since at least 2008, often targeting government networks in Europe and NATO member countries, research institutes, and think tanks. APT29 reportedly compromised the Democratic National Committee starting in the summer of 2015.[3][4][5][6] In April 2021, the US and UK governments attributed the SolarWinds Compromise to the SVR; public statements included citations to APT29, Cozy Bear, and The Dukes.[7][8] Industry reporting also referred to the actors involved in this campaign as UNC2452, NOBELIUM, StellarParticle, Dark Halo, and SolarStorm.[9][10][11][12][13][14]\nAPT29 has used the reg save command to save registry hives.[4]<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"],
55
+ ['<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Why don\'t we use MAC address instead of IP address?\n\nI can use the system function in PHP to get the MAC address of site visitors (probably most of you know). Why do we use IP addresss to check whether someone is stealing a cookie or not?\nDoes the system function have more overhead, or is it still insecure when we don\'t send any parameter to the function?\nI know there are some situations in which users change their MAC address, but it happens less than IP address.\nCould you shed some light on it?\n', "<Document>: The reason for that is very simple: You won't get the MAC address of your website visitor over the Internet, because they are lost when the packets are routed. You can only get the MAC addresses from your subnet (through, for example, ARP).<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"],
56
+ ['<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Native API\nAdversaries may interact with the native OS application programming interface (API) to execute behaviors. Native APIs provide a controlled means of calling low-level OS services within the kernel, such as those involving hardware/devices, memory, and processes.[1][2] These native APIs are leveraged by the OS during system boot (when other system components are not yet initialized) as well as carrying out tasks and requests during routine operations. Adversaries may abuse these OS API functions as a means of executing behaviors. Similar to Command and Scripting Interpreter, the native API and its hierarchy of interfaces provide mechanisms to interact with and utilize various components of a victimized system. Native API functions (such as NtCreateProcess) may be directed invoked via system calls / syscalls, but these features are also often exposed to user-mode applications via interfaces and libraries.[3][4][5] For example, functions such as the Windows API CreateProcess() or GNU fork() will allow programs and scripts to start other processes.[6][7] This may allow API callers to execute a binary, run a CLI command, load modules, etc. as thousands of similar API functions exist for various system operations.[8][9][10] Higher level software frameworks, such as Microsoft .NET and macOS Cocoa, are also available to interact with native APIs. These frameworks typically provide language wrappers/abstractions to API functionalities and are designed for ease-of-use/portability of code.[11][12][13][14] Adversaries may use assembly to directly or in-directly invoke syscalls in an attempt to subvert defensive sensors and detection signatures such as user mode API-hooks.[15] Adversaries may also attempt to tamper with sensors and defensive tools associated with API monitoring, such as unhooking monitored functions via Disable or Modify Tools.\n', '<Document>: Kapeka\nKapeka is a backdoor written in C++ used against victims in Eastern Europe since at least mid-2022. Kapeka has technical overlaps with Exaramel for Windows and Prestige malware variants, both of which are linked to Sandworm Team. Kapeka may have been used in advance of Prestige deployment in late 2022.[1][2]\nKapeka utilizes WinAPI calls to gather victim system information.[124]<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'],
57
+ ['<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Why aren\'t infinite-depth wildcard certificates allowed?\n\nAs far as I can tell, an SSL certificate for *.example.com is good for foo.example.com and bar.example.com, but not foo.bar.example.com.\nWildcards certificates cannot have *.*.example.com as their subject. I guess this is due to the fact that certificates like example.* aren\'t allowed -- allowing characters before the wildcard can lead to a malicious user matching their certificate with the wrong domain.\nHowever, I don\'t see any problem with allowing certificates of the *.example.com variety to apply to all subdomains, including sub-subdomains to an infinite depth. I don\'t see any use case where the subdomains of a site are "trusted" but the sub-subdomains are not.\nThis probably causes many problems. As far as I can tell, there\'s no way to cleanly get certificates for all sub-subdomains; you either become a CA, or you buy certificates for each subdomain.\nWhat\'s the reasoning, if any, behind restricting *.example.com to single-depth subdomains only?\nBonus question: Similarly, is there a reason behind the blanket ban on characters before a wildcard? After all, if you allow only dots and asterisks before a wildcard, there\'s no way that the site from a different domain can be spoofed.\n', '<Document>: Technically, usage of wildcards is defined in RFC 2818, which does allow names like "*.*.example.com" or "foo.*.bar.*.example.com" or even "*.*.*". However, between theory and practice, there can be, let\'s say, practical differences (theory and practice match perfectly only in theory, not in practice). Web browsers have implemented stricter rules, because:\n\nImplementing multi-level wildcard matching takes a good five minutes more than implementing matching of names with a single wildcard.\nBrowser vendors did not trust existing CA for never issuing an "*.*.com" certificate.\nDevelopers are human beings, thus very good at not seeing what they cannot imagine, so multi-wildcard names were not implemented by people who did not realize that they were possible.\n\nSo Web browsers will apply restrictions, which RFC 6125 tries to at least document. Most RFC are pragmatist: if reality does not match specification, amend the specification, not reality. Note that browsers will also enforce extra rules, like forbidding "*.co.uk" (not all browsers use the same rules, though, and they are not documented either).\nProfessional CA also enter the dance with their own constraints, such as identity checking tests before issuing certificates, or simply unwillingness to issue too broad wildcard certificates: the core business of a professional CA is to sell many certificates, and wildcard certificates don\'t help for that. Quite the opposite, in fact. People want to buy wildcard certificates precisely to avoid buying many individual certificates.\n\nAnother theory which failed to make it into practice is Name Constraints. With this X.509 extension, it would be possible for a CA to issue a certificate to a sub-CA, restricting that sub-CA so that it may issue server certificates only in a given domain tree. A Name Constraints extension with an "explicit subtree" of value ".example.com" would allow www.example.com and foo.bar.example.com. In that model, the owner of the example.com domain would run his own CA, restricted by its über-CA to only names in the example.com subtree. That would be fine and dandy.\nUnfortunately, anything you do with X.509 certificates is completely worthless if deployed implementations (i.e. Web browsers) don\'t support it, and existing browsers don\'t support name constraints. They don\'t, because there is no certificate with name constraints to process (so that would be useless code), and there is no such certificate because Web browsers would not be able to process them anyway. To bootstrap things, someone must start the cycle, but browser vendors wait after professional CA, and professional CA are unwilling to support name constraints, for the same reasons as previously (which all come down to money, in the long run).<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'],
58
+ ['<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Are Kismet (on OpenWRT) and Snort IDS (on a linux server) compatitble?\n\nI\'m trying to develop an IDS/IPS system project to include these elements:\nA router running OpenWRT running Kismet drone (Attitude Adjustment 12.09rc1)\nA Linux server (Running Kismet server + client)\nI have successfully installed Kismet drone/server on their respective platforms.\nBut I have heard that Snort is rather more well-implemented for an IDPS system.\n\nIs there a way to pass packets captured by Kismet to Snort IDS? I have looked on the internet but I have only found outdated and incomplete answers.\nWould it be a good idea to develop this system by just using Kismet as an IDS, without using Snort at all?\nAny other ideas and suggests are mostly welcome, thank you.\n', "<Document>: I think libpcap and/or tcpdump are what you're looking for. Kismet is a wireless analyzer that will display the 802.11 metadata. There was some activity awhile ago to add 802.11 intrusion detection capabilities (snort-wireless). You might still find useful remnanats of that.\nKismet pulls the 802.11 frame data and analyzes that, then examines some of the TCP/IP data (when it's unencrypted or decrypted) for additional network activity reporting. Kismet can also log the captured data, which you could then feed into an IDS (or any other process) using tcpreplay.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"],
59
+ ]
60
+ scores = model.predict(pairs)
61
+ print(scores.shape)
62
+ # (5,)
63
+
64
+ # Or rank different texts based on similarity to a single text
65
+ ranks = model.rank(
66
+ '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n<Instruct>: Given a web search query, retrieve relevant passages that answer the query\n<Query>: Security Account Manager\nAdversaries may attempt to extract credential material from the Security Account Manager (SAM) database either through in-memory techniques or through the Windows Registry where the SAM database is stored. The SAM is a database file that contains local accounts for the host, typically those found with the net user command. Enumerating the SAM database requires SYSTEM level access. A number of tools can be used to retrieve the SAM file through in-memory techniques: Alternatively, the SAM can be extracted from the Registry with Reg: Creddump7 can then be used to process the SAM database locally to retrieve hashes.[1] Notes:\n',
67
+ [
68
+ "<Document>: APT29\nAPT29 is threat group that has been attributed to Russia's Foreign Intelligence Service (SVR).[1][2] They have operated since at least 2008, often targeting government networks in Europe and NATO member countries, research institutes, and think tanks. APT29 reportedly compromised the Democratic National Committee starting in the summer of 2015.[3][4][5][6] In April 2021, the US and UK governments attributed the SolarWinds Compromise to the SVR; public statements included citations to APT29, Cozy Bear, and The Dukes.[7][8] Industry reporting also referred to the actors involved in this campaign as UNC2452, NOBELIUM, StellarParticle, Dark Halo, and SolarStorm.[9][10][11][12][13][14]\nAPT29 has used the reg save command to save registry hives.[4]<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n",
69
+ "<Document>: The reason for that is very simple: You won't get the MAC address of your website visitor over the Internet, because they are lost when the packets are routed. You can only get the MAC addresses from your subnet (through, for example, ARP).<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n",
70
+ '<Document>: Kapeka\nKapeka is a backdoor written in C++ used against victims in Eastern Europe since at least mid-2022. Kapeka has technical overlaps with Exaramel for Windows and Prestige malware variants, both of which are linked to Sandworm Team. Kapeka may have been used in advance of Prestige deployment in late 2022.[1][2]\nKapeka utilizes WinAPI calls to gather victim system information.[124]<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n',
71
+ '<Document>: Technically, usage of wildcards is defined in RFC 2818, which does allow names like "*.*.example.com" or "foo.*.bar.*.example.com" or even "*.*.*". However, between theory and practice, there can be, let\'s say, practical differences (theory and practice match perfectly only in theory, not in practice). Web browsers have implemented stricter rules, because:\n\nImplementing multi-level wildcard matching takes a good five minutes more than implementing matching of names with a single wildcard.\nBrowser vendors did not trust existing CA for never issuing an "*.*.com" certificate.\nDevelopers are human beings, thus very good at not seeing what they cannot imagine, so multi-wildcard names were not implemented by people who did not realize that they were possible.\n\nSo Web browsers will apply restrictions, which RFC 6125 tries to at least document. Most RFC are pragmatist: if reality does not match specification, amend the specification, not reality. Note that browsers will also enforce extra rules, like forbidding "*.co.uk" (not all browsers use the same rules, though, and they are not documented either).\nProfessional CA also enter the dance with their own constraints, such as identity checking tests before issuing certificates, or simply unwillingness to issue too broad wildcard certificates: the core business of a professional CA is to sell many certificates, and wildcard certificates don\'t help for that. Quite the opposite, in fact. People want to buy wildcard certificates precisely to avoid buying many individual certificates.\n\nAnother theory which failed to make it into practice is Name Constraints. With this X.509 extension, it would be possible for a CA to issue a certificate to a sub-CA, restricting that sub-CA so that it may issue server certificates only in a given domain tree. A Name Constraints extension with an "explicit subtree" of value ".example.com" would allow www.example.com and foo.bar.example.com. In that model, the owner of the example.com domain would run his own CA, restricted by its über-CA to only names in the example.com subtree. That would be fine and dandy.\nUnfortunately, anything you do with X.509 certificates is completely worthless if deployed implementations (i.e. Web browsers) don\'t support it, and existing browsers don\'t support name constraints. They don\'t, because there is no certificate with name constraints to process (so that would be useless code), and there is no such certificate because Web browsers would not be able to process them anyway. To bootstrap things, someone must start the cycle, but browser vendors wait after professional CA, and professional CA are unwilling to support name constraints, for the same reasons as previously (which all come down to money, in the long run).<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n',
72
+ "<Document>: I think libpcap and/or tcpdump are what you're looking for. Kismet is a wireless analyzer that will display the 802.11 metadata. There was some activity awhile ago to add 802.11 intrusion detection capabilities (snort-wireless). You might still find useful remnanats of that.\nKismet pulls the 802.11 frame data and analyzes that, then examines some of the TCP/IP data (when it's unencrypted or decrypted) for additional network activity reporting. Kismet can also log the captured data, which you could then feed into an IDS (or any other process) using tcpreplay.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n",
73
+ ]
74
+ )
75
+ # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
76
+ ```
77
+
78
+ <!--
79
+ ### Direct Usage (Transformers)
80
+
81
+ <details><summary>Click to see the direct usage in Transformers</summary>
82
+
83
+ </details>
84
+ -->
85
+
86
+ <!--
87
+ ### Downstream Usage (Sentence Transformers)
88
+
89
+ You can finetune this model on your own dataset.
90
+
91
+ <details><summary>Click to expand</summary>
92
+
93
+ </details>
94
+ -->
95
+
96
+ <!--
97
+ ### Out-of-Scope Use
98
+
99
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
100
+ -->
101
+
102
+ <!--
103
+ ## Bias, Risks and Limitations
104
+
105
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
106
+ -->
107
+
108
+ <!--
109
+ ### Recommendations
110
+
111
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
112
+ -->
113
+
114
+ ## Training Details
115
+
116
+ ### Training Dataset
117
+
118
+ #### json
119
+
120
+ * Dataset: json
121
+ * Size: 49,346 training samples
122
+ * Columns: <code>sentence1</code> and <code>sentence2</code>
123
+ * Approximate statistics based on the first 1000 samples:
124
+ | | sentence1 | sentence2 |
125
+ |:--------|:----------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------|
126
+ | type | string | string |
127
+ | details | <ul><li>min: 414 characters</li><li>mean: 1301.6 characters</li><li>max: 19975 characters</li></ul> | <ul><li>min: 84 characters</li><li>mean: 1069.28 characters</li><li>max: 9940 characters</li></ul> |
128
+ * Samples:
129
+ | sentence1 | sentence2 |
130
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
131
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: Is it possible to send data from an open-source program but make it impossible for a user with source code to do the same?<br><br>If I want to store a global scoreboard for a game running locally on the user's computer and I want to make sure that all the requests coming to the server are really generated by the game and not spoofed by the users, is there anything that can be done to prevent cheating? The program is open-source, so no obfuscation can be used.<br>The problem I have with coming up with a solution is that whatever I choose to implement in code must necessarily include all the components necessary for creating a spoofed program that can send any data user wants, especially any encryption keys and hash...</code> | <code><Document>: In a word, no. The best option is to move all the actual game logic to the server, and have the client be a thin client that just displays state and sends input. That wouldn't prevent various types of cheating (such as game automation tools) but it's the only way to comprehensively avoid the user sending fraudulent game results. It's a popular approach in multi-player, too, as it avoids telling the user anything they don't need to know. However, it's considerably more expensive in server hardware.<br>The next option is to make it possible to validate high scores. One option would be to record, for every game, the RNG seed value followed by every input that the user provides combined with a timestamp of some sort (for turn-based games this is just an action count; for real-time games it would probably be a game engine frame/tick count). Transmitting all of that back to the server would be a lot, but - combined with the game version and level or whatever - allow the server to ef...</code> |
132
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: Encrypted/Encoded File<br>Adversaries may encrypt or encode files to obfuscate strings, bytes, and other specific patterns to impede detection. Encrypting and/or encoding file content aims to conceal malicious artifacts within a file used in an intrusion. Many other techniques, such as Software Packing, Steganography, and Embedded Payloads, share this same broad objective. Encrypting and/or encoding files could lead to a lapse in detection of static signatures, only for this malicious content to be revealed (i.e., Deobfuscate/Decode Files or Information) at the time of execution/use. This type of file obfuscation can be applied to many file artifacts present on victim hosts, such as malware log/configuration...</code> | <code><Document>: Dark Caracal<br>Dark Caracal is threat group that has been attributed to the Lebanese General Directorate of General Security (GDGS) and has operated since at least 2012. [1]<br>Dark Caracal has obfuscated strings in Bandook by base64 encoding, and then encrypting them.[58]<\|im_end\|><br><\|im_start\|>assistant<br><think><br><br></think><br><br></code> |
133
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: OS Credential Dumping<br>Adversaries may attempt to dump credentials to obtain account login and credential material, normally in the form of a hash or a clear text password. Credentials can be obtained from OS caches, memory, or structures.[1] Credentials can then be used to perform Lateral Movement and access restricted information. Several of the tools mentioned in associated sub-techniques may be used by both adversaries and professional security testers. Additional custom tools likely exist as well.<br></code> | <code><Document>: Ember Bear<br>Ember Bear is a Russian state-sponsored cyber espionage group that has been active since at least 2020, linked to Russia's General Staff Main Intelligence Directorate (GRU) 161st Specialist Training Center (Unit 29155).[1] Ember Bear has primarily focused operations against Ukrainian government and telecommunication entities, but has also operated against critical infrastructure entities in Europe and the Americas.[2] Ember Bear conducted the WhisperGate destructive wiper attacks against Ukraine in early 2022.[3][4][1] There is some confusion as to whether Ember Bear overlaps with another Russian-linked entity referred to as Saint Bear. At present available evidence strongly suggests these are distinct activities with different behavioral profiles.[2][5]<br>Ember Bear gathers credential material from target systems, such as SSH keys, to facilitate access to victim environments.[12]<\|im_end\|><br><\|im_start\|>assistant<br><think><br><br></think><br><br></code> |
134
+ * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
135
+ ```json
136
+ {
137
+ "scale": 10.0,
138
+ "num_negatives": 4,
139
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
140
+ "mini_batch_size": 32
141
+ }
142
+ ```
143
+
144
+ ### Evaluation Dataset
145
+
146
+ #### json
147
+
148
+ * Dataset: json
149
+ * Size: 12,337 evaluation samples
150
+ * Columns: <code>sentence1</code> and <code>sentence2</code>
151
+ * Approximate statistics based on the first 1000 samples:
152
+ | | sentence1 | sentence2 |
153
+ |:--------|:-----------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------|
154
+ | type | string | string |
155
+ | details | <ul><li>min: 352 characters</li><li>mean: 1300.22 characters</li><li>max: 10030 characters</li></ul> | <ul><li>min: 85 characters</li><li>mean: 1080.48 characters</li><li>max: 9526 characters</li></ul> |
156
+ * Samples:
157
+ | sentence1 | sentence2 |
158
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
159
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: Security Account Manager<br>Adversaries may attempt to extract credential material from the Security Account Manager (SAM) database either through in-memory techniques or through the Windows Registry where the SAM database is stored. The SAM is a database file that contains local accounts for the host, typically those found with the net user command. Enumerating the SAM database requires SYSTEM level access. A number of tools can be used to retrieve the SAM file through in-memory techniques: Alternatively, the SAM can be extracted from the Registry with Reg: Creddump7 can then be used to process the SAM database locally to retrieve hashes.[1] Notes:<br></code> | <code><Document>: APT29<br>APT29 is threat group that has been attributed to Russia's Foreign Intelligence Service (SVR).[1][2] They have operated since at least 2008, often targeting government networks in Europe and NATO member countries, research institutes, and think tanks. APT29 reportedly compromised the Democratic National Committee starting in the summer of 2015.[3][4][5][6] In April 2021, the US and UK governments attributed the SolarWinds Compromise to the SVR; public statements included citations to APT29, Cozy Bear, and The Dukes.[7][8] Industry reporting also referred to the actors involved in this campaign as UNC2452, NOBELIUM, StellarParticle, Dark Halo, and SolarStorm.[9][10][11][12][13][14]<br>APT29 has used the reg save command to save registry hives.[4]<\|im_end\|><br><\|im_start\|>assistant<br><think><br><br></think><br><br></code> |
160
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: Why don't we use MAC address instead of IP address?<br><br>I can use the system function in PHP to get the MAC address of site visitors (probably most of you know). Why do we use IP addresss to check whether someone is stealing a cookie or not?<br>Does the system function have more overhead, or is it still insecure when we don't send any parameter to the function?<br>I know there are some situations in which users change their MAC address, but it happens less than IP address.<br>Could you shed some light on it?<br></code> | <code><Document>: The reason for that is very simple: You won't get the MAC address of your website visitor over the Internet, because they are lost when the packets are routed. You can only get the MAC addresses from your subnet (through, for example, ARP).<\|im_end\|><br><\|im_start\|>assistant<br><think><br><br></think><br><br></code> |
161
+ | <code><\|im_start\|>system<br>Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<\|im_end\|><br><\|im_start\|>user<br><Instruct>: Given a web search query, retrieve relevant passages that answer the query<br><Query>: Native API<br>Adversaries may interact with the native OS application programming interface (API) to execute behaviors. Native APIs provide a controlled means of calling low-level OS services within the kernel, such as those involving hardware/devices, memory, and processes.[1][2] These native APIs are leveraged by the OS during system boot (when other system components are not yet initialized) as well as carrying out tasks and requests during routine operations. Adversaries may abuse these OS API functions as a means of executing behaviors. Similar to Command and Scripting Interpreter, the native API and its hierarchy of interfaces provide mechanisms to interact with and utilize various components of a vict...</code> | <code><Document>: Kapeka<br>Kapeka is a backdoor written in C++ used against victims in Eastern Europe since at least mid-2022. Kapeka has technical overlaps with Exaramel for Windows and Prestige malware variants, both of which are linked to Sandworm Team. Kapeka may have been used in advance of Prestige deployment in late 2022.[1][2]<br>Kapeka utilizes WinAPI calls to gather victim system information.[124]<\|im_end\|><br><\|im_start\|>assistant<br><think><br><br></think><br><br></code> |
162
+ * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
163
+ ```json
164
+ {
165
+ "scale": 10.0,
166
+ "num_negatives": 4,
167
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
168
+ "mini_batch_size": 32
169
+ }
170
+ ```
171
+
172
+ ### Training Hyperparameters
173
+ #### Non-Default Hyperparameters
174
+
175
+ - `num_train_epochs`: 1
176
+ - `warmup_steps`: 200
177
+ - `optim`: adamw_8bit
178
+ - `gradient_accumulation_steps`: 2
179
+ - `bf16`: True
180
+ - `gradient_checkpointing`: True
181
+ - `eval_strategy`: steps
182
+ - `dataloader_num_workers`: 4
183
+ - `dataloader_pin_memory`: False
184
+
185
+ #### All Hyperparameters
186
+ <details><summary>Click to expand</summary>
187
+
188
+ - `per_device_train_batch_size`: 8
189
+ - `num_train_epochs`: 1
190
+ - `max_steps`: -1
191
+ - `learning_rate`: 5e-05
192
+ - `lr_scheduler_type`: linear
193
+ - `lr_scheduler_kwargs`: None
194
+ - `warmup_steps`: 200
195
+ - `optim`: adamw_8bit
196
+ - `optim_args`: None
197
+ - `weight_decay`: 0.0
198
+ - `adam_beta1`: 0.9
199
+ - `adam_beta2`: 0.999
200
+ - `adam_epsilon`: 1e-08
201
+ - `optim_target_modules`: None
202
+ - `gradient_accumulation_steps`: 2
203
+ - `average_tokens_across_devices`: True
204
+ - `max_grad_norm`: 1.0
205
+ - `label_smoothing_factor`: 0.0
206
+ - `bf16`: True
207
+ - `fp16`: False
208
+ - `bf16_full_eval`: False
209
+ - `fp16_full_eval`: False
210
+ - `tf32`: None
211
+ - `gradient_checkpointing`: True
212
+ - `gradient_checkpointing_kwargs`: None
213
+ - `torch_compile`: False
214
+ - `torch_compile_backend`: None
215
+ - `torch_compile_mode`: None
216
+ - `use_liger_kernel`: False
217
+ - `liger_kernel_config`: None
218
+ - `use_cache`: False
219
+ - `neftune_noise_alpha`: None
220
+ - `torch_empty_cache_steps`: None
221
+ - `auto_find_batch_size`: False
222
+ - `log_on_each_node`: True
223
+ - `logging_nan_inf_filter`: True
224
+ - `include_num_input_tokens_seen`: no
225
+ - `log_level`: passive
226
+ - `log_level_replica`: warning
227
+ - `disable_tqdm`: False
228
+ - `project`: huggingface
229
+ - `trackio_space_id`: trackio
230
+ - `eval_strategy`: steps
231
+ - `per_device_eval_batch_size`: 8
232
+ - `prediction_loss_only`: True
233
+ - `eval_on_start`: False
234
+ - `eval_do_concat_batches`: True
235
+ - `eval_use_gather_object`: False
236
+ - `eval_accumulation_steps`: None
237
+ - `include_for_metrics`: []
238
+ - `batch_eval_metrics`: False
239
+ - `save_only_model`: False
240
+ - `save_on_each_node`: False
241
+ - `enable_jit_checkpoint`: False
242
+ - `push_to_hub`: False
243
+ - `hub_private_repo`: None
244
+ - `hub_model_id`: None
245
+ - `hub_strategy`: every_save
246
+ - `hub_always_push`: False
247
+ - `hub_revision`: None
248
+ - `load_best_model_at_end`: False
249
+ - `ignore_data_skip`: False
250
+ - `restore_callback_states_from_checkpoint`: False
251
+ - `full_determinism`: False
252
+ - `seed`: 42
253
+ - `data_seed`: None
254
+ - `use_cpu`: False
255
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
256
+ - `parallelism_config`: None
257
+ - `dataloader_drop_last`: False
258
+ - `dataloader_num_workers`: 4
259
+ - `dataloader_pin_memory`: False
260
+ - `dataloader_persistent_workers`: False
261
+ - `dataloader_prefetch_factor`: None
262
+ - `remove_unused_columns`: True
263
+ - `label_names`: None
264
+ - `train_sampling_strategy`: random
265
+ - `length_column_name`: length
266
+ - `ddp_find_unused_parameters`: None
267
+ - `ddp_bucket_cap_mb`: None
268
+ - `ddp_broadcast_buffers`: False
269
+ - `ddp_backend`: None
270
+ - `ddp_timeout`: 1800
271
+ - `fsdp`: []
272
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
273
+ - `deepspeed`: None
274
+ - `debug`: []
275
+ - `skip_memory_metrics`: True
276
+ - `do_predict`: False
277
+ - `resume_from_checkpoint`: None
278
+ - `warmup_ratio`: None
279
+ - `local_rank`: -1
280
+ - `prompts`: None
281
+ - `batch_sampler`: batch_sampler
282
+ - `multi_dataset_batch_sampler`: proportional
283
+ - `router_mapping`: {}
284
+ - `learning_rate_mapping`: {}
285
+
286
+ </details>
287
+
288
+ ### Training Logs
289
+ | Epoch | Step | Training Loss | Validation Loss |
290
+ |:------:|:----:|:-------------:|:---------------:|
291
+ | 0.2594 | 800 | 0.0868 | 0.1018 |
292
+ | 0.3890 | 1200 | 0.0697 | 0.0742 |
293
+ | 0.5187 | 1600 | 0.0502 | 0.0609 |
294
+ | 0.6484 | 2000 | 0.0468 | 0.0405 |
295
+ | 0.7781 | 2400 | 0.0361 | 0.0316 |
296
+ | 0.9078 | 2800 | 0.0257 | 0.0303 |
297
+
298
+
299
+ ### Framework Versions
300
+ - Python: 3.12.12
301
+ - Sentence Transformers: 5.2.3
302
+ - Transformers: 5.2.0
303
+ - PyTorch: 2.10.0+cu128
304
+ - Accelerate: 1.12.0
305
+ - Datasets: 4.5.0
306
+ - Tokenizers: 0.22.2
307
+
308
+ ## Citation
309
+
310
+ ### BibTeX
311
+
312
+ #### Sentence Transformers
313
+ ```bibtex
314
+ @inproceedings{reimers-2019-sentence-bert,
315
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
316
+ author = "Reimers, Nils and Gurevych, Iryna",
317
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
318
+ month = "11",
319
+ year = "2019",
320
+ publisher = "Association for Computational Linguistics",
321
+ url = "https://arxiv.org/abs/1908.10084",
322
+ }
323
+ ```
324
+
325
+ <!--
326
+ ## Glossary
327
+
328
+ *Clearly define terms in order to be accessible across audiences.*
329
+ -->
330
+
331
+ <!--
332
+ ## Model Card Authors
333
+
334
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
335
+ -->
336
+
337
+ <!--
338
+ ## Model Card Contact
339
+
340
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
341
+ -->
config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3ForSequenceClassification"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 151645,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "LABEL_0"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "LABEL_0": 0
20
+ },
21
+ "layer_types": [
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention"
50
+ ],
51
+ "max_position_embeddings": 40960,
52
+ "max_window_layers": 28,
53
+ "model_type": "qwen3",
54
+ "num_attention_heads": 16,
55
+ "num_hidden_layers": 28,
56
+ "num_key_value_heads": 8,
57
+ "pad_token_id": 151643,
58
+ "rms_norm_eps": 1e-06,
59
+ "rope_parameters": {
60
+ "rope_theta": 1000000,
61
+ "rope_type": "default"
62
+ },
63
+ "sentence_transformers": {
64
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
65
+ "version": "5.2.3"
66
+ },
67
+ "sliding_window": null,
68
+ "tie_word_embeddings": true,
69
+ "transformers_version": "5.2.0",
70
+ "use_cache": false,
71
+ "use_sliding_window": false,
72
+ "vocab_size": 151669
73
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02048200d717a682d4b10db3b26c2ea976159a21622d4f9978d72a7a5c23931b
3
+ size 2383145520
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f875f0221c8864f51fc5e5c0a65eeb68415a42431247ffbd2b9018dd46b4e2d
3
+ size 11422917
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "is_local": true,
9
+ "model_max_length": 4096,
10
+ "pad_token": "<|endoftext|>",
11
+ "split_special_tokens": false,
12
+ "tokenizer_class": "Qwen2Tokenizer",
13
+ "unk_token": null
14
+ }