Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use ThienLe/Qwen3-SecEmbed with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ThienLe/Qwen3-SecEmbed")
sentences = [
"How can I prevent forwarding a manipulated email?\n\nHow can I prevent someone from modifying the contents of an email they received and then forwarding it to others? Some employees cheat managers by changing the content of emails and forwarding the modified email to them. I need a policy that prevents this backdoor.",
"They're identical\nThey both implement the same algorithm, so it's not like one can be faster than the other. Use whichever tool is available on whichever platform you use.\nIn Windows one uses certUtil as\ncertUtil -hashfile <PATH_TO_FILE> <HASH_ALGORITHM>\nand, available hash algorithms are MD2 MD4 MD5 SHA1 SHA256 SHA384 SHA512. These are different hash algorithms with different output sizes and they provide different security/insecurity levels. One should not use MD2, MD4, MD5, or SHA-1 as long as they really know what they are doing.\nBe aware of encoding, even some of the online hashings are not directly compatible, as we can see in StackOverflow some questions are about the interoperability of the sites and libraries.\nAnd never use online hashing for your secret/private files.",
"Direct Network Flood Detection across IaaS, Linux, Windows, and macOS\nWindows\nHigh-volume packet generation by local processes (e.g., PowerShell, cmd, curl.exe) or network service processes resulting in excessive outbound traffic over short time window, correlated with abnormal resource usage or degraded host responsiveness.",
"Email is unsafe -- deal with it.\nEmail can be made safe for an adequately defined value of \"safe\", through the use of signatures (S/MIME or OpenPGP). This is not as easy as it seems (I mean, it does not look easy, but in reality it is worse). The cornerstone of the system is that unsigned emails should be rejected automatically; human users should never see them at all, because if they read them, they will always believe them a little, regardless of how much you may have explained to them how insecure and unsafe plain emails are. Therefore, switching to signed emails is like a big jump into the unknown. In practice, it is essentially a way to break emails (or to induce users to switch to gmail...).\nWhat you can do is to educate and then to educate again:\n\nThe smooth education: explain to your users how untrustworthy email is as a medium. Show how easy it is to forge an email (e.g. with this answer). Try to prevent the \"wizardry effect\" which makes most human beings lose common sense as soon as a computer is involved (as Clarke was putting it, computers are beyond the \"magical horizon\" of most people -- solution is to make them understand how a computer works). As a bonus, this makes the users more resilient to phishing.\n\nThe less smooth education: let all the might of the Law fall on wannabe fraudsters. Have it known that the slightest phony game with email is a shooting offense; the guilty will be fired, jailed, shot and flogged (not necessarily in that order). The idea is to make faking emails not worth it. This works well: this is how the non-computer world deals with handwritten signatures, and it has done so for several centuries."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Qwen/Qwen3-Embedding-0.6B on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Qwen3Model'})
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ThienLe/Qwen3-SecEmbed")
# Run inference
queries = [
"Scheduled Task\nAdversaries may abuse the Windows Task Scheduler to perform task scheduling for initial or recurring execution of malicious code. There are multiple ways to access the Task Scheduler in Windows. The schtasks utility can be run directly on the command line, or the Task Scheduler can be opened through the GUI within the Administrator Tools section of the Control Panel.[1] In some cases, adversaries have used a .NET wrapper for the Windows Task Scheduler, and alternatively, adversaries have used the Windows netapi32 library and Windows Management Instrumentation (WMI) to create a scheduled task. Adversaries may also utilize the Powershell Cmdlet Invoke-CimMethod, which leverages WMI class PS_ScheduledTask to create a scheduled task via an XML path.[2] An adversary may use Windows Task Scheduler to execute programs at system startup or on a scheduled basis for persistence. The Windows Task Scheduler can also be abused to conduct remote Execution as part of Lateral Movement and/or to run a process under the context of a specified account (such as SYSTEM). Similar to System Binary Proxy Execution, adversaries have also abused the Windows Task Scheduler to potentially mask one-time execution under signed/trusted system processes.[3] Adversaries may also create \"hidden\" scheduled tasks (i.e. Hide Artifacts) that may not be visible to defender tools and manual queries used to enumerate tasks. Specifically, an adversary may hide a task from schtasks /query and the Task Scheduler by deleting the associated Security Descriptor (SD) registry value (where deletion of this value must be completed using SYSTEM permissions).[4][5] Adversaries may also employ alternate methods to hide tasks, such as altering the metadata (e.g., Index value) within associated registry keys.[6]",
]
documents = [
'BlackByte\nBlackByte is a ransomware threat actor operating since at least 2021. BlackByte is associated with several versions of ransomware also labeled BlackByte Ransomware. BlackByte ransomware operations initially used a common encryption key allowing for the development of a universal decryptor, but subsequent versions such as BlackByte 2.0 Ransomware use more robust encryption mechanisms. BlackByte is notable for operations targeting critical infrastructure entities among other targets across North America.[1][2][3][4][5]\nBlackByte created scheduled tasks for payload execution.[38][39]',
'APT32\nAPT32 is a suspected Vietnam-based threat group that has been active since at least 2014. The group has targeted multiple private sector industries as well as foreign governments, dissidents, and journalists with a strong focus on Southeast Asian countries like Vietnam, the Philippines, Laos, and Cambodia. They have extensively used strategic web compromises to compromise victims.[1][2][3]\nAPT32 used the net view command to show all shares available, including the administrative shares such as C$ and ADMIN$.[5]',
'RFC 4880, the standard for the format of PGP messages, says:\n\nThe high 16 bits (first two octets) of the hash are included in the\n\nSignature packet to provide a quick test to reject some invalid\nsignatures.\n\nHowever, you are thinking it wrong. Signatures are not encryption, and signatures are not encrypted. In fact, given the value of the public key (which is public) and the signature itself, one can recompute the complete hash value of the message which is signed (at least for RSA, which is technically known as a signature algorithm with recovery). The first 16 bits are just a helper so that software can avoid many modular exponentiations when it is looking for the "correct" public key among a set of candidates; they save a few milliseconds worth of computation, that\'s all.\n\nA generic note is that signatures can leak information on that which is signed; so if you sign and encrypt a confidential message, then you should, conceptually, either sign the encrypted message, or encrypt the signature along with the message contents as well. OpenPGP uses the latter method.\nWhen a message is just signed, not encrypted, then it makes no sense to hide the message hash, since the message itself, by definition, is transmitted as cleartext.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1024] [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.8511, 0.1452, -0.0968]])
sentence1 and sentence2| sentence1 | sentence2 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence1 | sentence2 |
|---|---|
Is it possible to send data from an open-source program but make it impossible for a user with source code to do the same? |
In a word, no. The best option is to move all the actual game logic to the server, and have the client be a thin client that just displays state and sends input. That wouldn't prevent various types of cheating (such as game automation tools) but it's the only way to comprehensively avoid the user sending fraudulent game results. It's a popular approach in multi-player, too, as it avoids telling the user anything they don't need to know. However, it's considerably more expensive in server hardware. |
Encrypted/Encoded File |
Dark Caracal |
OS Credential Dumping |
Ember Bear |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false
}
sentence1 and sentence2| sentence1 | sentence2 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence1 | sentence2 |
|---|---|
Security Account Manager |
APT29 |
Why don't we use MAC address instead of IP address? |
The reason for that is very simple: You won't get the MAC address of your website visitor over the Internet, because they are lost when the packets are routed. You can only get the MAC addresses from your subnet (through, for example, ARP). |
Native API |
Kapeka |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false
}
per_device_train_batch_size: 16num_train_epochs: 1warmup_steps: 300optim: adamw_8bitgradient_accumulation_steps: 2bf16: Truegradient_checkpointing: Trueeval_strategy: stepsper_device_eval_batch_size: 16dataloader_num_workers: 4dataloader_pin_memory: Falseper_device_train_batch_size: 16num_train_epochs: 1max_steps: -1learning_rate: 5e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 300optim: adamw_8bitoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 2average_tokens_across_devices: Truemax_grad_norm: 1.0label_smoothing_factor: 0.0bf16: Truefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Truegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: stepsper_device_eval_batch_size: 16prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 4dataloader_pin_memory: Falsedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 0.0648 | 100 | 0.2915 | - |
| 0.1297 | 200 | 0.0912 | 0.0981 |
| 0.1945 | 300 | 0.1002 | - |
| 0.2593 | 400 | 0.1025 | 0.0940 |
| 0.3241 | 500 | 0.0851 | - |
| 0.3890 | 600 | 0.0707 | 0.0785 |
| 0.4538 | 700 | 0.0498 | - |
| 0.5186 | 800 | 0.0683 | 0.0609 |
| 0.5835 | 900 | 0.0536 | - |
| 0.6483 | 1000 | 0.0484 | 0.0562 |
| 0.7131 | 1100 | 0.0406 | - |
| 0.7780 | 1200 | 0.0468 | 0.0503 |
| 0.8428 | 1300 | 0.0392 | - |
| 0.9076 | 1400 | 0.0386 | 0.0486 |
| 0.9724 | 1500 | 0.0406 | - |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}