| | --- |
| | language: |
| | - en |
| | - id |
| | tags: |
| | - bert |
| | - text-classification |
| | - token-classification |
| | - cybersecurity |
| | - fill-mask |
| | - named-entity-recognition |
| | base_model: boltuix/bert-micro |
| | library_name: transformers |
| | --- |
| | # bert-micro-cybersecurity |
| |
|
| | ## 1. Model Details |
| | **Model description** |
| | "bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). |
| |
|
| | - Model type: fine-tuned lightweight BERT variant |
| | - Languages: English & Indonesia |
| | - Finetuned from: `boltuix/bert-micro` |
| | - Status: **Early version** — trained on **0.89%** of planned data. |
| |
|
| | **Model sources** |
| | - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) |
| | - Data: Cybersecurity Data |
| |
|
| | ## 2. Uses |
| | ### Direct use |
| | You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. |
| |
|
| | ### Downstream use |
| | - Embedding extraction for clustering or anomaly detection in security logs. |
| | - As part of a pipeline for phishing detection, malicious email filtering, incident triage. |
| | - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). |
| |
|
| | ### Out-of-scope use |
| | - Not meant for high-stakes automated blocking decisions without human review. |
| | - Not optimized for languages other than English and Indonesian. |
| | - Not tested for non-cybersecurity domains or out-of-distribution data. |
| |
|
| | ## 3. Bias, Risks, and Limitations |
| | Because the model is based on a small subset (0.89%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). |
| |
|
| | - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. |
| | - Should not be used as sole authority for incident decisions; only as an aid to human analysts. |
| |
|
| | ## 4. How to Get Started with the Model |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity") |
| | model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity") |
| | |
| | inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", |
| | return_tensors="pt", truncation=True, padding=True) |
| | outputs = model(**inputs) |
| | logits = outputs.logits |
| | predicted_class = logits.argmax(dim=-1).item() |
| | ``` |
| |
|
| | ## 5. Training Details |
| | - **Trained records**: 2,116 / 237,881 (0.89%) |
| | - **Learning rate**: 5e-05 |
| | - **Epochs**: 3 |
| | - **Batch size**: 16 |
| | - **Max sequence length**: 512 |
| |
|