| --- |
| language: |
| - en |
| - id |
| tags: |
| - bert |
| - text-classification |
| - token-classification |
| - cybersecurity |
| - fill-mask |
| - named-entity-recognition |
| - transformers |
| - tensorflow |
| - pytorch |
| - masked-language-modeling |
| base_model: boltuix/bert-micro |
| library_name: transformers |
| pipeline_tag: fill-mask |
| --- |
| # bert-micro-cybersecurity |
|
|
| ## 1. Model Details |
| **Model description** |
| "bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). |
| - Model type: fine-tuned lightweight BERT variant |
| - Languages: English & Indonesia |
| - Finetuned from: `boltuix/bert-micro` |
| - Status: **Early version** — trained on **54.84%** of planned data. |
|
|
| **Model sources** |
| - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) |
| - Data: Cybersecurity Data |
|
|
| ## 2. Uses |
| ### Direct use |
| You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. |
| ### Downstream use |
| - Embedding extraction for clustering. |
| - Named Entity Recognition on log or security data. |
| - Classification of security data. |
| - Anomaly detection in security logs. |
| - As part of a pipeline for phishing detection, malicious email filtering, incident triage. |
| - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). |
| ### Out-of-scope use |
| - Not meant for high-stakes automated blocking decisions without human review. |
| - Not optimized for languages other than English and Indonesian. |
| - Not tested for non-cybersecurity domains or out-of-distribution data. |
|
|
| ### Downstream Usecase in Development using this model |
| - NER on security log, botnet data, and json data. |
| - Early classification of SIEM alert & events. |
|
|
| ## 3. Bias, Risks, and Limitations |
| Because the model is based on a small subset (54.84%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). |
| - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. |
| - **Should not be used as sole authority for incident decisions; only as an aid to human analysts.** |
|
|
| ## 4. Training Details |
|
|
| ### Text Processing & Chunking |
| Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy: |
| - **Max sequence length**: 512 tokens |
| - **Stride**: 32 tokens (overlap between consecutive chunks) |
| - **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries. |
| |
| ### Training Hyperparameters |
| - **Base model**: `boltuix/bert-micro` |
| - **Training epochs**: 3 |
| - **Learning rate**: 5e-05 |
| - **Batch size**: 16 |
| - **Weight decay**: 0.01 |
| - **Warmup ratio**: 0.06 |
| - **Gradient accumulation steps**: 1 |
| - **Optimizer**: AdamW |
| - **LR scheduler**: Linear with warmup |
| |
| ### Training Data |
| - **Total database rows**: 240,595 |
| - **Rows processed (cumulative)**: 131,944 (54.84%) |
| - **Training date**: 2025-10-29 13:09:17 |
| |
| ### Post-Training Metrics |
| - **Final training loss**: |
| - **Rows→Samples ratio**: |
| |