Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +162 -3
config.json +24 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,162 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+tags:
+- intrusion-detection
+- host-based-ids
+- adfa-ld
+- distilbert
+- sequence-classification
+- security
+- cybersecurity
+- binary-classification
+datasets:
+- ADFA-LD
+model-index:
+- name: distilbert-base-uncased-hids-adfa
+  results:
+  - task:
+      type: text-classification
+      name: Host-based Intrusion Detection
+    dataset:
+      name: ADFA-LD
+      type: custom
+    metrics:
+    - type: accuracy
+      value: 0.9403
+    - type: f1
+      value: 0.9450
+    - type: precision
+      value: 0.9245
+    - type: recall
+      value: 0.9664
+    - type: auc
+      value: 0.9630
+---
+# DistilBERT for Host-based Intrusion Detection System (HIDS)
+This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.
+## Model Details
+### Base Model
+- **Architecture**: DistilBERT (DistilBertForSequenceClassification)
+- **Base Model**: `distilbert-base-uncased`
+- **Task**: Binary Sequence Classification (Normal vs Attack)
+- **Number of Labels**: 2
+### Training Configuration
+- **Training Epochs**: 8
+- **Batch Size**: 32
+- **Learning Rate**: 2e-05
+- **Weight Decay**: 0.0
+- **Warmup Ratio**: 0.1
+- **Optimizer**: AdamW
+- **Scheduler**: LinearLR
+### Dataset
+- **Dataset**: ADFA-LD (Australian Defence Force Academy Linux Dataset)
+- **Preprocessing**: 18-gram sequences
+## Performance
+### Validation Metrics
+- **Accuracy**: 94.03%
+- **F1 Score**: 94.50%
+- **Precision**: 92.45%
+- **Recall**: 96.64%
+- **AUC-ROC**: 96.30%
+## Usage
+You can use this model directly with a pipeline for text classification:
+```python
+>>> from transformers import pipeline
+>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
+>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")
+[{'label': 'LABEL_0',
+  'score': 0.9876},
+ {'label': 'LABEL_1',
+  'score': 0.0124}]
+```
+Here is how to use this model to get the classification of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
+model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')
+# Prepare input (18-gram system call sequence)
+text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
+encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)
+# Forward pass
+with torch.no_grad():
+    output = model(**encoded_input)
+    logits = output.logits
+    probabilities = torch.softmax(logits, dim=-1)
+    predicted_class = torch.argmax(logits, dim=-1).item()
+# Interpret results
+class_names = ["Normal", "Attack"]
+print(f"Predicted class: {class_names[predicted_class]}")
+print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
+print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")
+```
+### Data Preprocessing
+This model expects input in 18-gram format. If you have raw system call traces, you need to:
+1. Extract system calls from trace files
+2. Convert to n-grams (n=18)
+3. Format as space-separated string
+4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)
+Example preprocessing pipeline:
+```python
+def create_ngrams(trace, n=18):
+    """Convert system call trace to n-grams"""
+    ngrams = []
+    for i in range(len(trace) - n + 1):
+        ngram = trace[i:i+n]
+        ngrams.append(" ".join(map(str, ngram)))
+    return ngrams
+```
+### Limitations and Considerations
+1. **Domain Specific**: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.
+2. **Input Format**: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.
+3. **Binary Classification**: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.
+### BibTeX entry and citation info
+```bibtex
+@misc{distilbert-hids-adfa,
+  title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
+  author={salsazufar},
+  year={2025},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
+}
+```
+## References
+- ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/)
+- DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108)
+## License
+This model is licensed under the Apache 2.0 license.

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.57.1",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2169f4d48ad544ab78462e050f2c99593d8abaf01aee3c3cdcf6f90ad27648a8
+size 267832560

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff