TanmaySK
/

results

@@ -3,7 +3,13 @@ library_name: transformers
 license: apache-2.0
 base_model: distilbert-base-uncased
 tags:
-- generated_from_trainer
 metrics:
 - accuracy
 - f1
@@ -14,60 +20,120 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# results
-This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.0000
-- Accuracy: 1.0
-- F1: 1.0
-- Precision: 1.0
-- Recall: 1.0
-## Model description
-This model is a fine-tuned version of distilbert-base-uncased for binary text classification tasks. It is designed to classify input text into two categories — such as malicious vs. benign network traffic, or positive vs. negative sentiment — depending on the dataset used. DistilBERT provides a lightweight yet powerful transformer architecture, making this model suitable for real-time or resource-constrained environments.
-## Intended uses & limitations
-- Detecting malicious or benign traffic (if from network data)
-- Sentiment classification (if from reviews/tweets)
-## Training and evaluation data
-The model was trained on a custom binary classification dataset containing text samples labeled as 0 (benign) or 1 (malicious). The dataset was split into training and validation sets. Text inputs were preprocessed using lowercase tokenization, padding, and truncation to a maximum length of 512 tokens.
-## Training procedure
-The model was fine-tuned using the Hugging Face Trainer API for binary text classification. It was trained for 3 epochs with a batch size of 16, using the AdamW optimizer and a linear learning rate scheduler. The dataset was tokenized with distilbert-base-uncased, and evaluation was performed on a validation split using metrics like accuracy, precision, recall, and F1-score.
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 16
-- eval_batch_size: 16
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 3
-### Training results
-| Training Loss | Epoch | Step  | Validation Loss | Accuracy | F1  | Precision | Recall |
-|:-------------:|:-----:|:-----:|:---------------:|:--------:|:---:|:---------:|:------:|
-| 0.0           | 1.0   | 3375  | 0.0000          | 1.0      | 1.0 | 1.0       | 1.0    |
-| 0.0           | 2.0   | 6750  | 0.0000          | 1.0      | 1.0 | 1.0       | 1.0    |
-| 0.0           | 3.0   | 10125 | 0.0000          | 1.0      | 1.0 | 1.0       | 1.0    |
-### Framework versions
-- Transformers 4.50.3
-- Pytorch 2.6.0+cu124
-- Tokenizers 0.21.1

 license: apache-2.0
 base_model: distilbert-base-uncased
 tags:
+- text-classification
+- binary-classification
+- cybersecurity
+- wireshark
+- distilbert
+- transformers
+- huggingface
 metrics:
 - accuracy
 - f1
   results: []
 ---
+# 🧠 results – DistilBERT for Malicious Traffic Classification
+This model is a fine-tuned version of [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) for **binary classification of network traffic**, especially useful for distinguishing **malicious vs. benign** packets based on preprocessed Wireshark-style logs.
+---
+## 📊 Evaluation Results
+| Metric      | Value |
+|-------------|-------|
+| Accuracy    | 1.0   |
+| Precision   | 1.0   |
+| Recall      | 1.0   |
+| F1 Score    | 1.0   |
+| Eval Loss   | 0.0000 |
+> ⚠️ These perfect results are on the validation set and may not generalize to unseen or noisy real-world data. Be sure to test on diverse inputs.
+---
+## 🧩 Model Description
+This model uses the lightweight and efficient **DistilBERT** transformer, fine-tuned for binary classification. Input data should be short text sequences (e.g., protocol descriptions, IP headers, or Wireshark logs).
+---
+## 💡 Intended Use & Limitations
+### ✅ Intended Uses
+- **Malicious traffic detection** (from packet text)
+- **Intrusion detection system (IDS)** aid
+- Sentiment analysis or spam detection (if retrained)
+### ❌ Limitations
+- English and network-related text only
+- Binary classification (0 = benign, 1 = malicious)
+- Not trained on raw PCAPs — requires preprocessing
+---
+## 🏋️ Training Procedure
+- Model: `distilbert-base-uncased`
+- Framework: `Transformers` Trainer API
+- Optimizer: AdamW
+- Scheduler: Linear LR decay
+- Epochs: 3
+- Batch Size: 16
+- Seed: 42
+---
+## 📊 Training and Evaluation Data
+The model was trained on a custom dataset with binary labels:
+- `input`: stringified packet details (e.g., IPs, protocol, flags)
+- `BinaryLabel`: `0` = benign, `1` = malicious
+Text was tokenized using the DistilBERT tokenizer with truncation and padding.
+---
+## 🧪 Example Usage
+### 🔌 Hugging Face Pipeline (Single Prediction)
+```python
+from transformers import pipeline
+# Load from Hugging Face Hub
+classifier = pipeline("text-classification", model="TanmaySK/results")
+# Predict
+text = "SrcIP:10.0.0.1 DstIP:192.168.1.1 Protocol:TCP Flags:SYN"
+result = classifier(text)
+# Interpret label
+label_map = {"LABEL_0": "Benign", "LABEL_1": "Malicious"}
+print(f"Prediction: {label_map[result[0]['label']]} (Confidence: {result[0]['score']:.4f})")
+## 📁 CSV Batch Prediction (Local Wireshark Data)
+import pandas as pd
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model
+model = AutoModelForSequenceClassification.from_pretrained("TanmaySK/results")
+tokenizer = AutoTokenizer.from_pretrained("TanmaySK/results")
+# Device setup
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+model.eval()
+# Load CSV
+df = pd.read_csv("wireshark_unlabeled.csv")  # Must have 'input' column
+label_map = {0: "Benign", 1: "Malicious"}
+predictions = []
+# Predict each row
+for text in df["input"]:
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
+    inputs = {k: v.to(device) for k, v in inputs.items() if k != "token_type_ids"}
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        pred = torch.argmax(logits, dim=1).item()
+        predictions.append(pred)
+# Save results
+df["PredictedLabel"] = predictions
+df["PredictionText"] = [label_map[p] for p in predictions]
+df.to_csv("wireshark_predictions.csv", index=False)
+print("✅ Saved to wireshark_predictions.csv")