|
|
--- |
|
|
language: |
|
|
- en |
|
|
- id |
|
|
tags: |
|
|
- bert |
|
|
- text-classification |
|
|
- token-classification |
|
|
- cybersecurity |
|
|
- fill-mask |
|
|
- named-entity-recognition |
|
|
- transformers |
|
|
- tensorflow |
|
|
- pytorch |
|
|
- masked-language-modeling |
|
|
base_model: boltuix/bert-micro |
|
|
library_name: transformers |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
# bert-micro-cybersecurity |
|
|
|
|
|
## 1. Model Details |
|
|
**Model description** |
|
|
"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). |
|
|
- Model type: fine-tuned lightweight BERT variant |
|
|
- Languages: English & Indonesia |
|
|
- Finetuned from: `boltuix/bert-micro` |
|
|
- Status: **Early version** — trained on **66.14%** of planned data. |
|
|
|
|
|
**Model sources** |
|
|
- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) |
|
|
- Data: Cybersecurity Data |
|
|
|
|
|
## 2. Uses |
|
|
### Direct use |
|
|
You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. |
|
|
### Downstream use |
|
|
- Embedding extraction for clustering. |
|
|
- Named Entity Recognition on log or security data. |
|
|
- Classification of security data. |
|
|
- Anomaly detection in security logs. |
|
|
- As part of a pipeline for phishing detection, malicious email filtering, incident triage. |
|
|
- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). |
|
|
### Out-of-scope use |
|
|
- Not meant for high-stakes automated blocking decisions without human review. |
|
|
- Not optimized for languages other than English and Indonesian. |
|
|
- Not tested for non-cybersecurity domains or out-of-distribution data. |
|
|
|
|
|
### Downstream Usecase in Development using this model |
|
|
- NER on security log, botnet data, and json data. |
|
|
- Early classification of SIEM alert & events. |
|
|
|
|
|
## 3. Bias, Risks, and Limitations |
|
|
Because the model is based on a small subset (66.14%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). |
|
|
- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. |
|
|
- **Should not be used as sole authority for incident decisions; only as an aid to human analysts.** |
|
|
|
|
|
## 4. Training Details |
|
|
|
|
|
### Text Processing & Chunking |
|
|
Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy: |
|
|
- **Max sequence length**: 512 tokens |
|
|
- **Stride**: 32 tokens (overlap between consecutive chunks) |
|
|
- **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries. |
|
|
|
|
|
### Training Hyperparameters |
|
|
- **Base model**: `boltuix/bert-micro` |
|
|
- **Training epochs**: 3 |
|
|
- **Learning rate**: 5e-05 |
|
|
- **Batch size**: 16 |
|
|
- **Weight decay**: 0.01 |
|
|
- **Warmup ratio**: 0.06 |
|
|
- **Gradient accumulation steps**: 1 |
|
|
- **Optimizer**: AdamW |
|
|
- **LR scheduler**: Linear with warmup |
|
|
|
|
|
### Training Data |
|
|
- **Total database rows**: 246,838 |
|
|
- **Rows processed (cumulative)**: 163,258 (66.14%) |
|
|
- **Training date**: 2025-12-30 04:18:17 |
|
|
|
|
|
### Post-Training Metrics |
|
|
- **Final training loss**: |
|
|
- **Rows→Samples ratio**: |
|
|
|