File size: 3,877 Bytes

410d317
a90f84c
 
 
410d317
d0eba91
a90f84c
d0eba91
a90f84c
d0eba91
 
feaddf8
 
 
 
a90f84c
d0eba91
feaddf8
410d317
62aabfb
a90f84c
 
8233137
a90f84c
 
 
0e1c769
a90f84c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e1c769
a90f84c
 
 
 
 
 
 
 
 
 
 
 
 
 
ae1a849
 
 
 
 
 
 
 
 
 
a90f84c
7cc2d5c
ae1a849
 
 
 
 
 
 
bb822a0
0e1c769
 
 
 
ae1a849
 
0e1c769

---
language:
- en
- id
tags:
- bert
- text-classification
- token-classification
- cybersecurity
- fill-mask
- named-entity-recognition
- transformers
- tensorflow
- pytorch
- masked-language-modeling
base_model: boltuix/bert-micro
library_name: transformers
pipeline_tag: fill-mask
---
# bert-micro-cybersecurity
## 1. Model Details
**Model description**  
"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
- Model type: fine-tuned lightweight BERT variant  
- Languages: English & Indonesia
- Finetuned from: `boltuix/bert-micro`
- Status: **Early version** — trained on **3.10%** of planned data.
**Model sources**  
- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
- Data: Cybersecurity Data
## 2. Uses
### Direct use  
You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
### Downstream use  
- Embedding extraction for clustering or anomaly detection in security logs.  
- As part of a pipeline for phishing detection, malicious email filtering, incident triage.  
- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
### Out-of-scope use  
- Not meant for high-stakes automated blocking decisions without human review.  
- Not optimized for languages other than English and Indonesian.  
- Not tested for non-cybersecurity domains or out-of-distribution data.
## 3. Bias, Risks, and Limitations 
Because the model is based on a small subset (3.10%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
- Should not be used as sole authority for incident decisions; only as an aid to human analysts.
## 4. How to Get Started with the Model  
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", 
                   return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
```
## 5. Training Details

### Text Processing & Chunking
Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
- **Max sequence length**: 512 tokens
- **Stride**: 32 tokens (overlap between consecutive chunks)
- **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.

### Training Hyperparameters
- **Base model**: `boltuix/bert-micro`
- **Training epochs**: 3
- **Learning rate**: 5e-05
- **Batch size**: 16
- **Weight decay**: 0.01
- **Warmup ratio**: 0.06
- **Gradient accumulation steps**: 1
- **Optimizer**: AdamW
- **LR scheduler**: Linear with warmup

### Training Data
- **Total database rows**: 238,453
- **Rows processed (cumulative)**: 7,394 (3.10%)
- **Rows in this session**: 2
- **Training samples (after chunking)**: 36
- **Training date**: 2025-10-23 03:13:01

### Post-Training Metrics
- **Final training loss**: 0.0000
- **Rows→Samples ratio**: 18.00x (average chunks per row)