| # WAF-DistilBERT: Web Application Firewall using DistilBERT | |
| ## Model Description | |
| WAF-DistilBERT is a fine-tuned version of DistilBERT, specifically trained to detect malicious web requests in real-time. This model serves as the core component of a Web Application Firewall (WAF) system. | |
| ## Intended Use | |
| This model is designed for: | |
| - Real-time detection of malicious web requests | |
| - Integration into web application security systems | |
| - Identifying common web attacks like SQL injection, XSS, and path traversal | |
| - Enhancing existing security infrastructure | |
| ### Out-of-Scope Use Cases | |
| This model should not be used as: | |
| - The sole security measure for web applications | |
| - A replacement for traditional WAF rule-based systems | |
| - A tool for generating malicious payloads | |
| - A security measure for non-HTTP traffic | |
| ## Training Data | |
| The model was trained on the CSIC 2010 HTTP Dataset, which includes: | |
| - Normal HTTP requests | |
| - Various attack patterns including SQL injection, XSS, buffer overflow | |
| - A balanced distribution of benign and malicious requests | |
| ### Training Procedure | |
| - Base model: DistilBERT-base-uncased | |
| - Training type: Fine-tuning | |
| - Training hardware: NVIDIA GPU | |
| - Number of epochs: 3 | |
| - Batch size: 32 | |
| - Learning rate: 2e-5 | |
| - Optimizer: AdamW | |
| - Loss function: Binary Cross-Entropy | |
| ## Performance and Limitations | |
| ### Performance Metrics | |
| - Accuracy: >95% | |
| - F1-Score: >0.94 | |
| - False Positive Rate: <1% | |
| - Average inference time: <100ms per request | |
| ### Limitations | |
| - Limited to HTTP request analysis | |
| - May require retraining for organization-specific traffic patterns | |
| - Performance may vary for zero-day attacks | |
| - Best used in conjunction with traditional security measures | |
| ## Bias and Risks | |
| ### Bias | |
| The model may show bias towards: | |
| - Common attack patterns in the training data | |
| - English-language payloads | |
| - HTTP requests following standard web frameworks | |
| ### Risks | |
| - False positives may block legitimate traffic | |
| - False negatives could allow attacks through | |
| - May require regular updates to maintain effectiveness | |
| - Resource consumption under high load | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("jacpacd/waf-distilbert") | |
| model = AutoModelForSequenceClassification.from_pretrained("jacpacd/waf-distilbert") | |
| # Prepare input | |
| request = "GET /admin?id=1 OR 1=1" | |
| inputs = tokenizer(request, return_tensors="pt", truncation=True, max_length=512) | |
| # Make prediction | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| prediction = torch.sigmoid(outputs.logits) | |
| is_malicious = prediction.item() > 0.5 | |
| confidence = prediction.item() | |
| ``` | |
| ## Environmental Impact | |
| - Model Size: ~268MB | |
| - Inference Energy Cost: Low (compared to larger models) | |
| - Training Energy Cost: Moderate | |
| ## Technical Specifications | |
| - Model Architecture: DistilBERT | |
| - Language(s): English | |
| - License: MIT | |
| - Input Format: Text (HTTP requests) | |
| - Output Format: Binary classification with confidence score | |
| - Model Size: 268MB | |
| - Number of Parameters: ~65M | |
| ## Citation | |
| If you use this model in your research, please cite: | |
| ```bibtex | |
| @misc{waf-distilbert, | |
| author = {jacpacd}, | |
| title = {WAF-DistilBERT: Web Application Firewall using DistilBERT}, | |
| year = {2025}, | |
| publisher = {Hugging Face}, | |
| journal = {Hugging Face model repository}, | |
| howpublished = {\url{https://huggingface.co/jacpacd/waf-distilbert}} | |
| } | |
| ``` | |
| ## Contact | |
| For questions and feedback about the model, please: | |
| - Open an issue on GitHub | |
| - Contact through Hugging Face | |
| - Submit pull requests for improvements |