codechrl's picture
Training update: 7,394/238,453 rows (3.10%) | +2 new @ 2025-10-23 03:13:01
0e1c769 verified
|
raw
history blame
3.88 kB
---
language:
- en
- id
tags:
- bert
- text-classification
- token-classification
- cybersecurity
- fill-mask
- named-entity-recognition
- transformers
- tensorflow
- pytorch
- masked-language-modeling
base_model: boltuix/bert-micro
library_name: transformers
pipeline_tag: fill-mask
---
# bert-micro-cybersecurity
## 1. Model Details
**Model description**
"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
- Model type: fine-tuned lightweight BERT variant
- Languages: English & Indonesia
- Finetuned from: `boltuix/bert-micro`
- Status: **Early version** — trained on **3.10%** of planned data.
**Model sources**
- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
- Data: Cybersecurity Data
## 2. Uses
### Direct use
You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
### Downstream use
- Embedding extraction for clustering or anomaly detection in security logs.
- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
### Out-of-scope use
- Not meant for high-stakes automated blocking decisions without human review.
- Not optimized for languages other than English and Indonesian.
- Not tested for non-cybersecurity domains or out-of-distribution data.
## 3. Bias, Risks, and Limitations
Because the model is based on a small subset (3.10%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
- Should not be used as sole authority for incident decisions; only as an aid to human analysts.
## 4. How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123",
return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
```
## 5. Training Details
### Text Processing & Chunking
Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
- **Max sequence length**: 512 tokens
- **Stride**: 32 tokens (overlap between consecutive chunks)
- **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.
### Training Hyperparameters
- **Base model**: `boltuix/bert-micro`
- **Training epochs**: 3
- **Learning rate**: 5e-05
- **Batch size**: 16
- **Weight decay**: 0.01
- **Warmup ratio**: 0.06
- **Gradient accumulation steps**: 1
- **Optimizer**: AdamW
- **LR scheduler**: Linear with warmup
### Training Data
- **Total database rows**: 238,453
- **Rows processed (cumulative)**: 7,394 (3.10%)
- **Rows in this session**: 2
- **Training samples (after chunking)**: 36
- **Training date**: 2025-10-23 03:13:01
### Post-Training Metrics
- **Final training loss**: 0.0000
- **Rows→Samples ratio**: 18.00x (average chunks per row)