TanmaySK commited on
Commit
81a1537
Β·
verified Β·
1 Parent(s): 5d5ade4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -39
README.md CHANGED
@@ -3,7 +3,13 @@ library_name: transformers
3
  license: apache-2.0
4
  base_model: distilbert-base-uncased
5
  tags:
6
- - generated_from_trainer
 
 
 
 
 
 
7
  metrics:
8
  - accuracy
9
  - f1
@@ -14,60 +20,120 @@ model-index:
14
  results: []
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # results
21
 
22
- This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.0000
25
- - Accuracy: 1.0
26
- - F1: 1.0
27
- - Precision: 1.0
28
- - Recall: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ## Model description
 
 
31
 
32
- This model is a fine-tuned version of distilbert-base-uncased for binary text classification tasks. It is designed to classify input text into two categories β€” such as malicious vs. benign network traffic, or positive vs. negative sentiment β€” depending on the dataset used. DistilBERT provides a lightweight yet powerful transformer architecture, making this model suitable for real-time or resource-constrained environments.
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
 
35
 
36
- ## Intended uses & limitations
37
 
38
- - Detecting malicious or benign traffic (if from network data)
39
- - Sentiment classification (if from reviews/tweets)
40
 
41
- ## Training and evaluation data
 
42
 
43
- The model was trained on a custom binary classification dataset containing text samples labeled as 0 (benign) or 1 (malicious). The dataset was split into training and validation sets. Text inputs were preprocessed using lowercase tokenization, padding, and truncation to a maximum length of 512 tokens.
 
 
44
 
45
- ## Training procedure
 
 
46
 
47
- The model was fine-tuned using the Hugging Face Trainer API for binary text classification. It was trained for 3 epochs with a batch size of 16, using the AdamW optimizer and a linear learning rate scheduler. The dataset was tokenized with distilbert-base-uncased, and evaluation was performed on a validation split using metrics like accuracy, precision, recall, and F1-score.
48
 
49
- ### Training hyperparameters
 
 
 
50
 
51
- The following hyperparameters were used during training:
52
- - learning_rate: 2e-05
53
- - train_batch_size: 16
54
- - eval_batch_size: 16
55
- - seed: 42
56
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
57
- - lr_scheduler_type: linear
58
- - num_epochs: 3
59
 
60
- ### Training results
 
 
 
61
 
62
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
63
- |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---:|:---------:|:------:|
64
- | 0.0 | 1.0 | 3375 | 0.0000 | 1.0 | 1.0 | 1.0 | 1.0 |
65
- | 0.0 | 2.0 | 6750 | 0.0000 | 1.0 | 1.0 | 1.0 | 1.0 |
66
- | 0.0 | 3.0 | 10125 | 0.0000 | 1.0 | 1.0 | 1.0 | 1.0 |
67
 
 
 
 
 
68
 
69
- ### Framework versions
 
 
 
70
 
71
- - Transformers 4.50.3
72
- - Pytorch 2.6.0+cu124
73
- - Tokenizers 0.21.1
 
 
 
3
  license: apache-2.0
4
  base_model: distilbert-base-uncased
5
  tags:
6
+ - text-classification
7
+ - binary-classification
8
+ - cybersecurity
9
+ - wireshark
10
+ - distilbert
11
+ - transformers
12
+ - huggingface
13
  metrics:
14
  - accuracy
15
  - f1
 
20
  results: []
21
  ---
22
 
23
+ # 🧠 results – DistilBERT for Malicious Traffic Classification
 
24
 
25
+ This model is a fine-tuned version of [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) for **binary classification of network traffic**, especially useful for distinguishing **malicious vs. benign** packets based on preprocessed Wireshark-style logs.
26
 
27
+ ---
28
+
29
+ ## πŸ“Š Evaluation Results
30
+
31
+ | Metric | Value |
32
+ |-------------|-------|
33
+ | Accuracy | 1.0 |
34
+ | Precision | 1.0 |
35
+ | Recall | 1.0 |
36
+ | F1 Score | 1.0 |
37
+ | Eval Loss | 0.0000 |
38
+
39
+ > ⚠️ These perfect results are on the validation set and may not generalize to unseen or noisy real-world data. Be sure to test on diverse inputs.
40
+
41
+ ---
42
+
43
+ ## 🧩 Model Description
44
+
45
+ This model uses the lightweight and efficient **DistilBERT** transformer, fine-tuned for binary classification. Input data should be short text sequences (e.g., protocol descriptions, IP headers, or Wireshark logs).
46
+
47
+ ---
48
+
49
+ ## πŸ’‘ Intended Use & Limitations
50
+
51
+ ### βœ… Intended Uses
52
 
53
+ - **Malicious traffic detection** (from packet text)
54
+ - **Intrusion detection system (IDS)** aid
55
+ - Sentiment analysis or spam detection (if retrained)
56
 
57
+ ### ❌ Limitations
58
 
59
+ - English and network-related text only
60
+ - Binary classification (0 = benign, 1 = malicious)
61
+ - Not trained on raw PCAPs β€” requires preprocessing
62
+
63
+ ---
64
+
65
+ ## πŸ‹οΈ Training Procedure
66
+
67
+ - Model: `distilbert-base-uncased`
68
+ - Framework: `Transformers` Trainer API
69
+ - Optimizer: AdamW
70
+ - Scheduler: Linear LR decay
71
+ - Epochs: 3
72
+ - Batch Size: 16
73
+ - Seed: 42
74
+
75
+ ---
76
+
77
+ ## πŸ“Š Training and Evaluation Data
78
+
79
+ The model was trained on a custom dataset with binary labels:
80
+ - `input`: stringified packet details (e.g., IPs, protocol, flags)
81
+ - `BinaryLabel`: `0` = benign, `1` = malicious
82
+
83
+ Text was tokenized using the DistilBERT tokenizer with truncation and padding.
84
+
85
+ ---
86
 
87
+ ## πŸ§ͺ Example Usage
88
 
89
+ ### πŸ”Œ Hugging Face Pipeline (Single Prediction)
90
 
91
+ ```python
92
+ from transformers import pipeline
93
 
94
+ # Load from Hugging Face Hub
95
+ classifier = pipeline("text-classification", model="TanmaySK/results")
96
 
97
+ # Predict
98
+ text = "SrcIP:10.0.0.1 DstIP:192.168.1.1 Protocol:TCP Flags:SYN"
99
+ result = classifier(text)
100
 
101
+ # Interpret label
102
+ label_map = {"LABEL_0": "Benign", "LABEL_1": "Malicious"}
103
+ print(f"Prediction: {label_map[result[0]['label']]} (Confidence: {result[0]['score']:.4f})")
104
 
 
105
 
106
+ ## πŸ“ CSV Batch Prediction (Local Wireshark Data)
107
+ import pandas as pd
108
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
109
+ import torch
110
 
111
+ # Load model
112
+ model = AutoModelForSequenceClassification.from_pretrained("TanmaySK/results")
113
+ tokenizer = AutoTokenizer.from_pretrained("TanmaySK/results")
 
 
 
 
 
114
 
115
+ # Device setup
116
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
117
+ model.to(device)
118
+ model.eval()
119
 
120
+ # Load CSV
121
+ df = pd.read_csv("wireshark_unlabeled.csv") # Must have 'input' column
122
+ label_map = {0: "Benign", 1: "Malicious"}
123
+ predictions = []
 
124
 
125
+ # Predict each row
126
+ for text in df["input"]:
127
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
128
+ inputs = {k: v.to(device) for k, v in inputs.items() if k != "token_type_ids"}
129
 
130
+ with torch.no_grad():
131
+ logits = model(**inputs).logits
132
+ pred = torch.argmax(logits, dim=1).item()
133
+ predictions.append(pred)
134
 
135
+ # Save results
136
+ df["PredictedLabel"] = predictions
137
+ df["PredictionText"] = [label_map[p] for p in predictions]
138
+ df.to_csv("wireshark_predictions.csv", index=False)
139
+ print("βœ… Saved to wireshark_predictions.csv")