Training update: 2,116/237,881 rows (0.89%) | +500 new @ 2025-10-21 06:00:02

7cc2d5c verified 4 months ago

2.79 kB

	---
	language:
	- en
	- id
	tags:
	- bert
	- text-classification
	- token-classification
	- cybersecurity
	- fill-mask
	- named-entity-recognition
	base_model: boltuix/bert-micro
	library_name: transformers
	---
	# bert-micro-cybersecurity

	## 1. Model Details
	Model description
	"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).

	- Model type: fine-tuned lightweight BERT variant
	- Languages: English & Indonesia
	- Finetuned from: `boltuix/bert-micro`
	- Status: Early version — trained on 0.89% of planned data.

	Model sources
	- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
	- Data: Cybersecurity Data

	## 2. Uses
	### Direct use
	You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.

	### Downstream use
	- Embedding extraction for clustering or anomaly detection in security logs.
	- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
	- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).

	### Out-of-scope use
	- Not meant for high-stakes automated blocking decisions without human review.
	- Not optimized for languages other than English and Indonesian.
	- Not tested for non-cybersecurity domains or out-of-distribution data.

	## 3. Bias, Risks, and Limitations
	Because the model is based on a small subset (0.89%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).

	- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
	- Should not be used as sole authority for incident decisions; only as an aid to human analysts.

	## 4. How to Get Started with the Model
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
	model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")

	inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123",
	return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)
	logits = outputs.logits
	predicted_class = logits.argmax(dim=-1).item()
	```

	## 5. Training Details
	- Trained records: 2,116 / 237,881 (0.89%)
	- Learning rate: 5e-05
	- Epochs: 3
	- Batch size: 16
	- Max sequence length: 512