salsazufar commited on
Commit
e73a2d5
·
verified ·
1 Parent(s): 5c5591d

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,162 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - intrusion-detection
5
+ - host-based-ids
6
+ - adfa-ld
7
+ - distilbert
8
+ - sequence-classification
9
+ - security
10
+ - cybersecurity
11
+ - binary-classification
12
+ datasets:
13
+ - ADFA-LD
14
+ model-index:
15
+ - name: distilbert-base-uncased-hids-adfa
16
+ results:
17
+ - task:
18
+ type: text-classification
19
+ name: Host-based Intrusion Detection
20
+ dataset:
21
+ name: ADFA-LD
22
+ type: custom
23
+ metrics:
24
+ - type: accuracy
25
+ value: 0.9403
26
+ - type: f1
27
+ value: 0.9450
28
+ - type: precision
29
+ value: 0.9245
30
+ - type: recall
31
+ value: 0.9664
32
+ - type: auc
33
+ value: 0.9630
34
+ ---
35
+
36
+ # DistilBERT for Host-based Intrusion Detection System (HIDS)
37
+
38
+ This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.
39
+
40
+ ## Model Details
41
+
42
+ ### Base Model
43
+ - **Architecture**: DistilBERT (DistilBertForSequenceClassification)
44
+ - **Base Model**: `distilbert-base-uncased`
45
+ - **Task**: Binary Sequence Classification (Normal vs Attack)
46
+ - **Number of Labels**: 2
47
+
48
+ ### Training Configuration
49
+ - **Training Epochs**: 8
50
+ - **Batch Size**: 32
51
+ - **Learning Rate**: 2e-05
52
+ - **Weight Decay**: 0.0
53
+ - **Warmup Ratio**: 0.1
54
+ - **Optimizer**: AdamW
55
+ - **Scheduler**: LinearLR
56
+
57
+ ### Dataset
58
+ - **Dataset**: ADFA-LD (Australian Defence Force Academy Linux Dataset)
59
+ - **Preprocessing**: 18-gram sequences
60
+
61
+
62
+ ## Performance
63
+
64
+ ### Validation Metrics
65
+ - **Accuracy**: 94.03%
66
+ - **F1 Score**: 94.50%
67
+ - **Precision**: 92.45%
68
+ - **Recall**: 96.64%
69
+ - **AUC-ROC**: 96.30%
70
+
71
+ ## Usage
72
+
73
+ You can use this model directly with a pipeline for text classification:
74
+
75
+ ```python
76
+ >>> from transformers import pipeline
77
+
78
+ >>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
79
+ >>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")
80
+
81
+ [{'label': 'LABEL_0',
82
+ 'score': 0.9876},
83
+ {'label': 'LABEL_1',
84
+ 'score': 0.0124}]
85
+ ```
86
+
87
+ Here is how to use this model to get the classification of a given text in PyTorch:
88
+
89
+ ```python
90
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
91
+ import torch
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
94
+ model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')
95
+
96
+ # Prepare input (18-gram system call sequence)
97
+ text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
98
+ encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)
99
+
100
+ # Forward pass
101
+ with torch.no_grad():
102
+ output = model(**encoded_input)
103
+ logits = output.logits
104
+ probabilities = torch.softmax(logits, dim=-1)
105
+ predicted_class = torch.argmax(logits, dim=-1).item()
106
+
107
+ # Interpret results
108
+ class_names = ["Normal", "Attack"]
109
+ print(f"Predicted class: {class_names[predicted_class]}")
110
+ print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
111
+ print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")
112
+ ```
113
+
114
+ ### Data Preprocessing
115
+
116
+ This model expects input in 18-gram format. If you have raw system call traces, you need to:
117
+
118
+ 1. Extract system calls from trace files
119
+ 2. Convert to n-grams (n=18)
120
+ 3. Format as space-separated string
121
+ 4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)
122
+
123
+ Example preprocessing pipeline:
124
+
125
+ ```python
126
+ def create_ngrams(trace, n=18):
127
+ """Convert system call trace to n-grams"""
128
+ ngrams = []
129
+ for i in range(len(trace) - n + 1):
130
+ ngram = trace[i:i+n]
131
+ ngrams.append(" ".join(map(str, ngram)))
132
+ return ngrams
133
+ ```
134
+
135
+ ### Limitations and Considerations
136
+
137
+ 1. **Domain Specific**: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.
138
+
139
+ 2. **Input Format**: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.
140
+
141
+ 3. **Binary Classification**: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.
142
+
143
+ ### BibTeX entry and citation info
144
+
145
+ ```bibtex
146
+ @misc{distilbert-hids-adfa,
147
+ title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
148
+ author={salsazufar},
149
+ year={2025},
150
+ publisher={Hugging Face},
151
+ howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
152
+ }
153
+ ```
154
+
155
+ ## References
156
+
157
+ - ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/)
158
+ - DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108)
159
+
160
+ ## License
161
+
162
+ This model is licensed under the Apache 2.0 license.
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation": "gelu",
3
+ "architectures": [
4
+ "DistilBertForSequenceClassification"
5
+ ],
6
+ "attention_dropout": 0.1,
7
+ "dim": 768,
8
+ "dropout": 0.1,
9
+ "dtype": "float32",
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "pad_token_id": 0,
17
+ "problem_type": "single_label_classification",
18
+ "qa_dropout": 0.1,
19
+ "seq_classif_dropout": 0.2,
20
+ "sinusoidal_pos_embds": false,
21
+ "tie_weights_": true,
22
+ "transformers_version": "4.57.1",
23
+ "vocab_size": 30522
24
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2169f4d48ad544ab78462e050f2c99593d8abaf01aee3c3cdcf6f90ad27648a8
3
+ size 267832560
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "DistilBertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff