Training update: 7,394/238,453 rows (3.10%) | +2 new @ 2025-10-23 03:13:01

0e1c769 verified 4 months ago

3.88 kB

	---
	language:
	- en
	- id
	tags:
	- bert
	- text-classification
	- token-classification
	- cybersecurity
	- fill-mask
	- named-entity-recognition
	- transformers
	- tensorflow
	- pytorch
	- masked-language-modeling
	base_model: boltuix/bert-micro
	library_name: transformers
	pipeline_tag: fill-mask
	---
	# bert-micro-cybersecurity
	## 1. Model Details
	Model description
	"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
	- Model type: fine-tuned lightweight BERT variant
	- Languages: English & Indonesia
	- Finetuned from: `boltuix/bert-micro`
	- Status: Early version — trained on 3.10% of planned data.
	Model sources
	- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
	- Data: Cybersecurity Data
	## 2. Uses
	### Direct use
	You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
	### Downstream use
	- Embedding extraction for clustering or anomaly detection in security logs.
	- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
	- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
	### Out-of-scope use
	- Not meant for high-stakes automated blocking decisions without human review.
	- Not optimized for languages other than English and Indonesian.
	- Not tested for non-cybersecurity domains or out-of-distribution data.
	## 3. Bias, Risks, and Limitations
	Because the model is based on a small subset (3.10%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
	- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
	- Should not be used as sole authority for incident decisions; only as an aid to human analysts.
	## 4. How to Get Started with the Model
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
	model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
	inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123",
	return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)
	logits = outputs.logits
	predicted_class = logits.argmax(dim=-1).item()
	```
	## 5. Training Details

	### Text Processing & Chunking
	Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
	- Max sequence length: 512 tokens
	- Stride: 32 tokens (overlap between consecutive chunks)
	- Chunking behavior: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.

	### Training Hyperparameters
	- Base model: `boltuix/bert-micro`
	- Training epochs: 3
	- Learning rate: 5e-05
	- Batch size: 16
	- Weight decay: 0.01
	- Warmup ratio: 0.06
	- Gradient accumulation steps: 1
	- Optimizer: AdamW
	- LR scheduler: Linear with warmup

	### Training Data
	- Total database rows: 238,453
	- Rows processed (cumulative): 7,394 (3.10%)
	- Rows in this session: 2
	- Training samples (after chunking): 36
	- Training date: 2025-10-23 03:13:01

	### Post-Training Metrics
	- Final training loss: 0.0000
	- Rows→Samples ratio: 18.00x (average chunks per row)