|
|
--- |
|
|
language: |
|
|
- en |
|
|
- id |
|
|
tags: |
|
|
- text-classification |
|
|
- cybersecurity |
|
|
base_model: boltuix/bert-micro |
|
|
--- |
|
|
|
|
|
# Model Card for “bert-micro-cybersecurity” |
|
|
|
|
|
## 1. Model Details |
|
|
|
|
|
**Model description** |
|
|
“bert-micro-cybersecurity” is a compact transformer model derived from `boltuix/bert-micro`, adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). |
|
|
- Model type: fine-tuned lightweight BERT variant |
|
|
- Languages: English & Indonesia |
|
|
- Finetuned from: `boltuix/bert-micro` |
|
|
- Status: **Early version** — trained on ~ **2%** of planned data. |
|
|
|
|
|
**Model sources** |
|
|
- Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) :contentReference[oaicite:3]{index=3} |
|
|
- Data: Cybersecurity Data |
|
|
|
|
|
## 2. Uses |
|
|
|
|
|
### Direct use |
|
|
You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. |
|
|
|
|
|
### Downstream use |
|
|
- Embedding extraction for clustering or anomaly detection in security logs. |
|
|
- As part of a pipeline for phishing detection, malicious email filtering, incident triage. |
|
|
- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). |
|
|
|
|
|
### Out-of-scope use |
|
|
- Not meant for high-stakes automated blocking decisions without human review. |
|
|
- Not optimized for languages other than English. |
|
|
- Not tested for non-cybersecurity domains or out-of-distribution data. |
|
|
|
|
|
## 3. Bias, Risks, and Limitations |
|
|
Because the model is based on a very small subset (~ 2%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). |
|
|
- Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. :contentReference[oaicite:4]{index=4} |
|
|
- Should not be used as sole authority for incident decisions; only as an aid to human analysts. |
|
|
|
|
|
## 4. How to Get Started with the Model |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/bert-micro-cybersecurity") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("your-username/bert-micro-cybersecurity") |
|
|
|
|
|
inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", return_tensors="pt", truncation=True, padding=True) |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
predicted_class = logits.argmax(dim=-1).item() |