SamSec007
/

phishbyte

@@ -1,20 +1,130 @@
----
-library_name: phishbyte
-license: mit
-pipeline_tag: text-classification
-tags:
-- cascading-inference
-- email-security
-- from-scratch
-- lightweight
-- model_hub_mixin
-- no-pretrained-weights
-- phishing-detection
-- pytorch
-- pytorch_model_hub_mixin
----
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Code: https://github.com/AnonymousSingh-007/Phish_Byte
-- Paper: [More Information Needed]
-- Docs: https://github.com/AnonymousSingh-007/Phish_Byte#readme

+---
+language: en
+license: mit
+library_name: phishbyte
+pipeline_tag: text-classification
+tags:
+  - phishing-detection
+  - email-security
+  - pytorch
+  - from-scratch
+  - no-pretrained-weights
+  - cascading-inference
+  - lightweight
+  - explainable-ai
+datasets:
+  - CEAS-2008
+metrics:
+  - f1
+  - precision
+  - recall
+  - accuracy
+model-index:
+  - name: phishbyte
+    results:
+      - task:
+          type: text-classification
+          name: Phishing Email Detection
+        dataset:
+          name: CEAS-2008
+          type: ceas-2008
+        metrics:
+          - type: f1
+            value: 0.948
+          - type: accuracy
+            value: 0.944
+          - type: precision
+            value: 0.954
+          - type: recall
+            value: 0.943
+---
+# Phish_Byte
+A from-scratch PyTorch model for **email phishing detection**.
+**F1 0.948** on CEAS-2008. **12,545 parameters** (≈9,000× smaller than DistilBERT).
+**1,500+ emails/sec** on a laptop GPU. Every verdict explains itself.
+## Why this exists
+Every phishing detection model on HuggingFace is currently a fine-tuned
+transformer (DistilBERT, BERT, RoBERTa) — 65 to 110 million parameters,
+~250 MB on disk, ~50 ms per email on GPU. Phish_Byte takes a different
+bet: a small custom MLP trained from scratch, fed by 29 carefully chosen
+features, routed through a cascading inference pipeline. The model is
+**9,000× smaller** than DistilBERT, performs competitively, deploys
+without a GPU, and explains every decision.
+## Usage
+```python
+from phishbyte import PhishByteEngine
+engine  = PhishByteEngine.from_pretrained("AnonymousSingh-007/phishbyte")
+verdict = engine.analyze(raw_email_string)
+print(verdict.label)             # 'phishing'
+print(verdict.probability)       # 0.9735
+print(verdict.confidence)        # 'high'
+print(verdict.layer_used)        # 2 — MLP made this call
+print(verdict.feature_weights)   # full per-feature attribution
+```
+## Architecture
+```
+Layer 1 — rule scorers (~1 ms): domain + URL + SPF + subject
+            │
+            ├──► obvious phishing? short-circuit verdict
+            │
+            └──► otherwise route to MLP
+                       │
+Layer 2 — MLP (~3 ms): 29 → 96 → 48 → 1 (sigmoid)
+            │
+            ▼
+        PhishVerdict {label, probability, confidence, layer_used, feature_weights}
+```
+## Performance (CEAS-2008, n=2000 held-out)
+| Metric           | Value     |
+|------------------|----------:|
+| F1 score         | **0.948** |
+| Accuracy         | 94.40%    |
+| Precision        | 0.9537    |
+| Recall           | 0.9432    |
+| Parameters       | 12,545    |
+| Model size       | ~50 KB    |
+| Throughput (GPU) | 1,527 /s  |
+| Throughput (CPU) | ~800 /s   |
+## Features (29 inputs)
+- **Domain (5)**: From/Reply-To/Return-Path mismatch, freemail flag, brand impersonation
+- **URL (5)**: HTTPS ratio, anchor mismatch, suspicious TLD, urgency, link density
+- **SPF (3)**: SPF fail, no record, no sending IP
+- **Subject (7)**: urgency, security theme, brand name, currency, all caps, fake RE, fake transaction ID
+- **Character-level (5)**: caps ratio, digit ratio, special chars, avg word length, HTML/text ratio
+- **Composite (4)**: per-layer normalized scores
+## Limitations
+- ~5% of decisions are wrong (F1 0.948, not 1.0). Use as one signal in defence-in-depth, not the only gate.
+- Trained on CEAS-2008 — English-language phishing from 2008. Modern attack patterns and non-English emails will degrade performance.
+- SPF validation is bypassed for training (historical domains don't resolve) but runs live at inference time.
+- Adversarial emails crafted specifically to game these features will get through.
+## Citation
+```bibtex
+@software{phishbyte2026,
+  author  = {Singh, Samratth},
+  title   = {Phish_Byte: A cascading from-scratch PyTorch model for email phishing detection},
+  year    = {2026},
+  url     = {https://github.com/AnonymousSingh-007/Phish_Byte}
+}
+```
+## License
+MIT