File size: 2,789 Bytes
410d317 a90f84c 410d317 d0eba91 a90f84c d0eba91 a90f84c d0eba91 a90f84c d0eba91 410d317 62aabfb 410d317 a90f84c 8233137 a90f84c e0c1d2e 410d317 a90f84c 410d317 a90f84c 410d317 a90f84c 410d317 a90f84c 410d317 a90f84c e0c1d2e 410d317 a90f84c 62aabfb a90f84c 62aabfb a90f84c 62aabfb a90f84c 62aabfb a90f84c 780ffbe a90f84c 2df4ce7 a90f84c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
language:
- en
- id
tags:
- bert
- text-classification
- token-classification
- cybersecurity
- fill-mask
- named-entity-recognition
base_model: boltuix/bert-micro
library_name: transformers
---
# bert-micro-cybersecurity
## 1. Model Details
**Model description**
"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
- Model type: fine-tuned lightweight BERT variant
- Languages: English & Indonesia
- Finetuned from: `boltuix/bert-micro`
- Status: **Early version** — trained on **0.68%** of planned data.
**Model sources**
- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
- Data: Cybersecurity Data
## 2. Uses
### Direct use
You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
### Downstream use
- Embedding extraction for clustering or anomaly detection in security logs.
- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
### Out-of-scope use
- Not meant for high-stakes automated blocking decisions without human review.
- Not optimized for languages other than English and Indonesian.
- Not tested for non-cybersecurity domains or out-of-distribution data.
## 3. Bias, Risks, and Limitations
Because the model is based on a small subset (0.68%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
- Should not be used as sole authority for incident decisions; only as an aid to human analysts.
## 4. How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123",
return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
```
## 5. Training Details
- **Trained records**: 1,614 / 237,699 (0.68%)
- **Learning rate**: 5e-05
- **Epochs**: 3
- **Batch size**: 1000
- **Max sequence length**: 512
|