bert-micro-cybersecurity / README.md

codechrl

Training update: 6,737/238,451 rows (2.83%) | +26 new @ 2025-10-23 02:35:06

feaddf8 verified 4 months ago

preview code

raw

history blame

2.86 kB

metadata

language:
  - en
  - id
tags:
  - bert
  - text-classification
  - token-classification
  - cybersecurity
  - fill-mask
  - named-entity-recognition
  - transformers
  - tensorflow
  - pytorch
  - masked-language-modeling
base_model: boltuix/bert-micro
library_name: transformers
pipeline_tag: fill-mask

bert-micro-cybersecurity

1. Model Details

Model description
"bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).

Model type: fine-tuned lightweight BERT variant
Languages: English & Indonesia
Finetuned from: boltuix/bert-micro
Status: Early version — trained on 2.83% of planned data. Model sources
Base model: boltuix/bert-micro
Data: Cybersecurity Data

2. Uses

Direct use

You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.

Downstream use

Embedding extraction for clustering or anomaly detection in security logs.
As part of a pipeline for phishing detection, malicious email filtering, incident triage.
As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).

Out-of-scope use

Not meant for high-stakes automated blocking decisions without human review.
Not optimized for languages other than English and Indonesian.
Not tested for non-cybersecurity domains or out-of-distribution data.

3. Bias, Risks, and Limitations

Because the model is based on a small subset (2.83%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).

Inherits any biases present in the base model (boltuix/bert-micro) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
Should not be used as sole authority for incident decisions; only as an aid to human analysts.

4. How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", 
                   return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()

5. Training Details

Trained records: 6,737 / 238,451 (2.83%)
Learning rate: 5e-05
Epochs: 3
Batch size: 16
Max sequence length: 512