Training update: 161,623/163,996 rows (98.55%) | +78 new @ 2025-11-12 02:06:02

7d87938 verified 3 months ago

3.46 kB

	---
	language:
	- en
	- id
	tags:
	- bert
	- text-classification
	- token-classification
	- cybersecurity
	- fill-mask
	- named-entity-recognition
	- transformers
	- tensorflow
	- pytorch
	- masked-language-modeling
	base_model: boltuix/bert-micro
	library_name: transformers
	pipeline_tag: fill-mask
	---
	# bert-micro-cybersecurity

	## 1. Model Details
	Model description
	"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
	- Model type: fine-tuned lightweight BERT variant
	- Languages: English & Indonesia
	- Finetuned from: `boltuix/bert-micro`
	- Status: Early version — trained on 98.55% of planned data.

	Model sources
	- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
	- Data: Cybersecurity Data

	## 2. Uses
	### Direct use
	You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
	### Downstream use
	- Embedding extraction for clustering.
	- Named Entity Recognition on log or security data.
	- Classification of security data.
	- Anomaly detection in security logs.
	- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
	- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
	### Out-of-scope use
	- Not meant for high-stakes automated blocking decisions without human review.
	- Not optimized for languages other than English and Indonesian.
	- Not tested for non-cybersecurity domains or out-of-distribution data.

	### Downstream Usecase in Development using this model
	- NER on security log, botnet data, and json data.
	- Early classification of SIEM alert & events.

	## 3. Bias, Risks, and Limitations
	Because the model is based on a small subset (98.55%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
	- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
	- Should not be used as sole authority for incident decisions; only as an aid to human analysts.

	## 4. Training Details

	### Text Processing & Chunking
	Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
	- Max sequence length: 512 tokens
	- Stride: 32 tokens (overlap between consecutive chunks)
	- Chunking behavior: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.

	### Training Hyperparameters
	- Base model: `boltuix/bert-micro`
	- Training epochs: 3
	- Learning rate: 5e-05
	- Batch size: 16
	- Weight decay: 0.01
	- Warmup ratio: 0.06
	- Gradient accumulation steps: 1
	- Optimizer: AdamW
	- LR scheduler: Linear with warmup

	### Training Data
	- Total database rows: 163,996
	- Rows processed (cumulative): 161,623 (98.55%)
	- Training date: 2025-11-12 02:06:02

	### Post-Training Metrics
	- Final training loss:
	- Rows→Samples ratio: