Instructions to use ThienLe/Qwen3-SecEmbed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThienLe/Qwen3-SecEmbed with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ThienLe/Qwen3-SecEmbed")

sentences = [
    "How can I prevent forwarding a manipulated email?\n\nHow can I prevent someone from modifying the contents of an email they received and then forwarding it to others?  Some employees cheat managers by changing the content of emails and forwarding the modified email to them.  I need  a policy  that  prevents this backdoor.",
    "They're identical\nThey both implement the same algorithm, so it's not like one can be faster than the other. Use whichever tool is available on whichever platform you use.\nIn Windows one uses certUtil as\ncertUtil -hashfile <PATH_TO_FILE> <HASH_ALGORITHM>\nand, available hash algorithms are MD2 MD4 MD5 SHA1 SHA256 SHA384 SHA512. These are different hash algorithms with different output sizes and they provide different security/insecurity levels. One should not use MD2, MD4, MD5, or SHA-1 as long as they really know what they are doing.\nBe aware of encoding, even some of the online hashings are not directly compatible, as we can see in  StackOverflow some questions are about the interoperability of the sites and libraries.\nAnd never use online hashing for your secret/private files.",
    "Direct Network Flood Detection across IaaS, Linux, Windows, and macOS\nWindows\nHigh-volume packet generation by local processes (e.g., PowerShell, cmd, curl.exe) or network service processes resulting in excessive outbound traffic over short time window, correlated with abnormal resource usage or degraded host responsiveness.",
    "Email is unsafe -- deal with it.\nEmail can be made safe for an adequately defined value of \"safe\", through the use of signatures (S/MIME or OpenPGP). This is not as easy as it seems (I mean, it does not look easy, but in reality it is worse). The cornerstone of the system is that unsigned emails should be rejected automatically; human users should never see them at all, because if they read them, they will always believe them a little, regardless of how much you may have explained to them how insecure and unsafe plain emails are. Therefore, switching to signed emails is like a big jump into the unknown. In practice, it is essentially a way to break emails (or to induce users to switch to gmail...).\nWhat you can do is to educate and then to educate again:\n\nThe smooth education: explain to your users how untrustworthy email is as a medium. Show how easy it is to forge an email (e.g. with this answer). Try to prevent the \"wizardry effect\" which makes most human beings lose common sense as soon as a computer is involved (as Clarke was putting it, computers are beyond the \"magical horizon\" of most people -- solution is to make them understand how a computer works). As a bonus, this makes the users more resilient to phishing.\n\nThe less smooth education: let all the might of the Law fall on wannabe fraudsters. Have it known that the slightest phony game with email is a shooting offense; the guilty will be fired, jailed, shot and flogged (not necessarily in that order). The idea is to make faking emails not worth it. This works well: this is how the non-computer world deals with handwritten signatures, and it has done so for several centuries."
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B

This is a sentence-transformers model finetuned from Qwen/Qwen3-Embedding-0.6B on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Qwen/Qwen3-Embedding-0.6B
Maximum Sequence Length: 2048 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Qwen3Model'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ThienLe/Qwen3-SecEmbed")
# Run inference
queries = [
    "Scheduled Task\nAdversaries may abuse the Windows Task Scheduler to perform task scheduling for initial or recurring execution of malicious code. There are multiple ways to access the Task Scheduler in Windows. The schtasks utility can be run directly on the command line, or the Task Scheduler can be opened through the GUI within the Administrator Tools section of the Control Panel.[1] In some cases, adversaries have used a .NET wrapper for the Windows Task Scheduler, and alternatively, adversaries have used the Windows netapi32 library and Windows Management Instrumentation (WMI) to create a scheduled task. Adversaries may also utilize the Powershell Cmdlet Invoke-CimMethod, which leverages WMI class PS_ScheduledTask to create a scheduled task via an XML path.[2] An adversary may use Windows Task Scheduler to execute programs at system startup or on a scheduled basis for persistence. The Windows Task Scheduler can also be abused to conduct remote Execution as part of Lateral Movement and/or to run a process under the context of a specified account (such as SYSTEM). Similar to System Binary Proxy Execution, adversaries have also abused the Windows Task Scheduler to potentially mask one-time execution under signed/trusted system processes.[3] Adversaries may also create \"hidden\" scheduled tasks (i.e. Hide Artifacts) that may not be visible to defender tools and manual queries used to enumerate tasks. Specifically, an adversary may hide a task from schtasks /query and the Task Scheduler by deleting the associated Security Descriptor (SD) registry value (where deletion of this value must be completed using SYSTEM permissions).[4][5] Adversaries may also employ alternate methods to hide tasks, such as altering the metadata (e.g., Index value) within associated registry keys.[6]",
]
documents = [
    'BlackByte\nBlackByte is a ransomware threat actor operating since at least 2021. BlackByte is associated with several versions of ransomware also labeled BlackByte Ransomware. BlackByte ransomware operations initially used a common encryption key allowing for the development of a universal decryptor, but subsequent versions such as BlackByte 2.0 Ransomware use more robust encryption mechanisms. BlackByte is notable for operations targeting critical infrastructure entities among other targets across North America.[1][2][3][4][5]\nBlackByte created scheduled tasks for payload execution.[38][39]',
    'APT32\nAPT32 is a suspected Vietnam-based threat group that has been active since at least 2014. The group has targeted multiple private sector industries as well as foreign governments, dissidents, and journalists with a strong focus on Southeast Asian countries like Vietnam, the Philippines, Laos, and Cambodia. They have extensively used strategic web compromises to compromise victims.[1][2][3]\nAPT32 used the net view command to show all shares available, including the administrative shares such as C$ and ADMIN$.[5]',
    'RFC 4880, the standard for the format of PGP messages, says:\n\nThe high 16 bits (first two octets) of the hash are included in the\n\nSignature packet to provide a quick test to reject some invalid\nsignatures.\n\nHowever, you are thinking it wrong. Signatures are not encryption, and signatures are not encrypted. In fact, given the value of the public key (which is public) and the signature itself, one can recompute the complete hash value of the message which is signed (at least for RSA, which is technically known as a signature algorithm with recovery). The first 16 bits are just a helper so that software can avoid many modular exponentiations when it is looking for the "correct" public key among a set of candidates; they save a few milliseconds worth of computation, that\'s all.\n\nA generic note is that signatures can leak information on that which is signed; so if you sign and encrypt a confidential message, then you should, conceptually, either sign the encrypted message, or encrypt the signature along with the message contents as well. OpenPGP uses the latter method.\nWhen a message is just signed, not encrypted, then it makes no sense to hide the message hash, since the message itself, by definition, is transmitted as cleartext.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1024] [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.8511,  0.1452, -0.0968]])

Training Details

Training Dataset

json

Dataset: json
Size: 49,346 training samples
Columns: sentence1 and sentence2
Approximate statistics based on the first 1000 samples:
sentence1 sentence2
type string string
details
min: 27 tokens
mean: 220.82 tokens
max: 2048 tokens

min: 13 tokens
mean: 220.84 tokens
max: 1945 tokens

	sentence1	sentence2
type	string	string
details	min: 27 tokens mean: 220.82 tokens max: 2048 tokens	min: 13 tokens mean: 220.84 tokens max: 1945 tokens

Samples:

sentence1	sentence2
Is it possible to send data from an open-source program but make it impossible for a user with source code to do the same? If I want to store a global scoreboard for a game running locally on the user's computer and I want to make sure that all the requests coming to the server are really generated by the game and not spoofed by the users, is there anything that can be done to prevent cheating? The program is open-source, so no obfuscation can be used. The problem I have with coming up with a solution is that whatever I choose to implement in code must necessarily include all the components necessary for creating a spoofed program that can send any data user wants, especially any encryption keys and hashing algorithms used.	In a word, no. The best option is to move all the actual game logic to the server, and have the client be a thin client that just displays state and sends input. That wouldn't prevent various types of cheating (such as game automation tools) but it's the only way to comprehensively avoid the user sending fraudulent game results. It's a popular approach in multi-player, too, as it avoids telling the user anything they don't need to know. However, it's considerably more expensive in server hardware. The next option is to make it possible to validate high scores. One option would be to record, for every game, the RNG seed value followed by every input that the user provides combined with a timestamp of some sort (for turn-based games this is just an action count; for real-time games it would probably be a game engine frame/tick count). Transmitting all of that back to the server would be a lot, but - combined with the game version and level or whatever - allow the server to effectively re...
Encrypted/Encoded File Adversaries may encrypt or encode files to obfuscate strings, bytes, and other specific patterns to impede detection. Encrypting and/or encoding file content aims to conceal malicious artifacts within a file used in an intrusion. Many other techniques, such as Software Packing, Steganography, and Embedded Payloads, share this same broad objective. Encrypting and/or encoding files could lead to a lapse in detection of static signatures, only for this malicious content to be revealed (i.e., Deobfuscate/Decode Files or Information) at the time of execution/use. This type of file obfuscation can be applied to many file artifacts present on victim hosts, such as malware log/configuration and payload files.[1] Files can be encrypted with a hardcoded or user-supplied key, as well as otherwise obfuscated using standard encoding schemes such as Base64. The entire content of a file may be obfuscated, or just specific functions or values (such as C2 addresses). Encryption a...	`Dark Caracal Dark Caracal is threat group that has been attributed to the Lebanese General Directorate of General Security (GDGS) and has operated since at least 2012. [1] Dark Caracal has obfuscated strings in Bandook by base64 encoding, and then encrypting them.[58]`
OS Credential Dumping Adversaries may attempt to dump credentials to obtain account login and credential material, normally in the form of a hash or a clear text password. Credentials can be obtained from OS caches, memory, or structures.[1] Credentials can then be used to perform Lateral Movement and access restricted information. Several of the tools mentioned in associated sub-techniques may be used by both adversaries and professional security testers. Additional custom tools likely exist as well.	Ember Bear Ember Bear is a Russian state-sponsored cyber espionage group that has been active since at least 2020, linked to Russia's General Staff Main Intelligence Directorate (GRU) 161st Specialist Training Center (Unit 29155).[1] Ember Bear has primarily focused operations against Ukrainian government and telecommunication entities, but has also operated against critical infrastructure entities in Europe and the Americas.[2] Ember Bear conducted the WhisperGate destructive wiper attacks against Ukraine in early 2022.[3][4][1] There is some confusion as to whether Ember Bear overlaps with another Russian-linked entity referred to as Saint Bear. At present available evidence strongly suggests these are distinct activities with different behavioral profiles.[2][5] Ember Bear gathers credential material from target systems, such as SSH keys, to facilitate access to victim environments.[12]

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Evaluation Dataset

json

Dataset: json
Size: 12,337 evaluation samples
Columns: sentence1 and sentence2
Approximate statistics based on the first 1000 samples:
sentence1 sentence2
type string string
details
min: 13 tokens
mean: 222.63 tokens
max: 2048 tokens

min: 13 tokens
mean: 226.13 tokens
max: 2048 tokens

	sentence1	sentence2
type	string	string
details	min: 13 tokens mean: 222.63 tokens max: 2048 tokens	min: 13 tokens mean: 226.13 tokens max: 2048 tokens

Samples:

sentence1	sentence2
Security Account Manager Adversaries may attempt to extract credential material from the Security Account Manager (SAM) database either through in-memory techniques or through the Windows Registry where the SAM database is stored. The SAM is a database file that contains local accounts for the host, typically those found with the net user command. Enumerating the SAM database requires SYSTEM level access. A number of tools can be used to retrieve the SAM file through in-memory techniques: Alternatively, the SAM can be extracted from the Registry with Reg: Creddump7 can then be used to process the SAM database locally to retrieve hashes.[1] Notes:	APT29 APT29 is threat group that has been attributed to Russia's Foreign Intelligence Service (SVR).[1][2] They have operated since at least 2008, often targeting government networks in Europe and NATO member countries, research institutes, and think tanks. APT29 reportedly compromised the Democratic National Committee starting in the summer of 2015.[3][4][5][6] In April 2021, the US and UK governments attributed the SolarWinds Compromise to the SVR; public statements included citations to APT29, Cozy Bear, and The Dukes.[7][8] Industry reporting also referred to the actors involved in this campaign as UNC2452, NOBELIUM, StellarParticle, Dark Halo, and SolarStorm.[9][10][11][12][13][14] APT29 has used the reg save command to save registry hives.[4]
`Why don't we use MAC address instead of IP address? I can use the system function in PHP to get the MAC address of site visitors (probably most of you know). Why do we use IP addresss to check whether someone is stealing a cookie or not? Does the system function have more overhead, or is it still insecure when we don't send any parameter to the function? I know there are some situations in which users change their MAC address, but it happens less than IP address. Could you shed some light on it?`	`The reason for that is very simple: You won't get the MAC address of your website visitor over the Internet, because they are lost when the packets are routed. You can only get the MAC addresses from your subnet (through, for example, ARP).`
Native API Adversaries may interact with the native OS application programming interface (API) to execute behaviors. Native APIs provide a controlled means of calling low-level OS services within the kernel, such as those involving hardware/devices, memory, and processes.[1][2] These native APIs are leveraged by the OS during system boot (when other system components are not yet initialized) as well as carrying out tasks and requests during routine operations. Adversaries may abuse these OS API functions as a means of executing behaviors. Similar to Command and Scripting Interpreter, the native API and its hierarchy of interfaces provide mechanisms to interact with and utilize various components of a victimized system. Native API functions (such as NtCreateProcess) may be directed invoked via system calls / syscalls, but these features are also often exposed to user-mode applications via interfaces and libraries.[3][4][5] For example, functions such as the Windows API CreateProcess() o...	`Kapeka Kapeka is a backdoor written in C++ used against victims in Eastern Europe since at least mid-2022. Kapeka has technical overlaps with Exaramel for Windows and Prestige malware variants, both of which are linked to Sandworm Team. Kapeka may have been used in advance of Prestige deployment in late 2022.[1][2] Kapeka utilizes WinAPI calls to gather victim system information.[124]`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 16
num_train_epochs: 1
warmup_steps: 300
optim: adamw_8bit
gradient_accumulation_steps: 2
bf16: True
gradient_checkpointing: True
eval_strategy: steps
per_device_eval_batch_size: 16
dataloader_num_workers: 4
dataloader_pin_memory: False

All Hyperparameters

Click to expand

per_device_train_batch_size: 16
num_train_epochs: 1
max_steps: -1
learning_rate: 5e-05
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_steps: 300
optim: adamw_8bit
optim_args: None
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 2
average_tokens_across_devices: True
max_grad_norm: 1.0
label_smoothing_factor: 0.0
bf16: True
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: True
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
project: huggingface
trackio_space_id: trackio
eval_strategy: steps
per_device_eval_batch_size: 16
prediction_loss_only: True
eval_on_start: False
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: None
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_on_each_node: False
enable_jit_checkpoint: False
push_to_hub: False
hub_private_repo: None
hub_model_id: None
hub_strategy: every_save
hub_always_push: False
hub_revision: None
load_best_model_at_end: False
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 42
data_seed: None
use_cpu: False
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
dataloader_drop_last: False
dataloader_num_workers: 4
dataloader_pin_memory: False
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: None
local_rank: -1
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss
0.0648	100	0.2915	-
0.1297	200	0.0912	0.0981
0.1945	300	0.1002	-
0.2593	400	0.1025	0.0940
0.3241	500	0.0851	-
0.3890	600	0.0707	0.0785
0.4538	700	0.0498	-
0.5186	800	0.0683	0.0609
0.5835	900	0.0536	-
0.6483	1000	0.0484	0.0562
0.7131	1100	0.0406	-
0.7780	1200	0.0468	0.0503
0.8428	1300	0.0392	-
0.9076	1400	0.0386	0.0486
0.9724	1500	0.0406	-

Framework Versions

Python: 3.12.12
Sentence Transformers: 5.2.3
Transformers: 5.2.0
PyTorch: 2.10.0+cu128
Accelerate: 1.12.0
Datasets: 4.5.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}