# WAF-DistilBERT: Web Application Firewall using DistilBERT

## Model Description

WAF-DistilBERT is a fine-tuned version of DistilBERT, specifically trained to detect malicious web requests in real-time. This model serves as the core component of a Web Application Firewall (WAF) system.

## Intended Use

This model is designed for:
- Real-time detection of malicious web requests
- Integration into web application security systems
- Identifying common web attacks like SQL injection, XSS, and path traversal
- Enhancing existing security infrastructure

### Out-of-Scope Use Cases

This model should not be used as:
- The sole security measure for web applications
- A replacement for traditional WAF rule-based systems
- A tool for generating malicious payloads
- A security measure for non-HTTP traffic

## Training Data

The model was trained on the CSIC 2010 HTTP Dataset, which includes:
- Normal HTTP requests
- Various attack patterns including SQL injection, XSS, buffer overflow
- A balanced distribution of benign and malicious requests

### Training Procedure

- Base model: DistilBERT-base-uncased
- Training type: Fine-tuning
- Training hardware: NVIDIA GPU
- Number of epochs: 3
- Batch size: 32
- Learning rate: 2e-5
- Optimizer: AdamW
- Loss function: Binary Cross-Entropy

## Performance and Limitations

### Performance Metrics

- Accuracy: >95%
- F1-Score: >0.94
- False Positive Rate: <1%
- Average inference time: <100ms per request

### Limitations

- Limited to HTTP request analysis
- May require retraining for organization-specific traffic patterns
- Performance may vary for zero-day attacks
- Best used in conjunction with traditional security measures

## Bias and Risks

### Bias

The model may show bias towards:
- Common attack patterns in the training data
- English-language payloads
- HTTP requests following standard web frameworks

### Risks

- False positives may block legitimate traffic
- False negatives could allow attacks through
- May require regular updates to maintain effectiveness
- Resource consumption under high load

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("jacpacd/waf-distilbert")
model = AutoModelForSequenceClassification.from_pretrained("jacpacd/waf-distilbert")

# Prepare input
request = "GET /admin?id=1 OR 1=1"
inputs = tokenizer(request, return_tensors="pt", truncation=True, max_length=512)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.sigmoid(outputs.logits)

is_malicious = prediction.item() > 0.5
confidence = prediction.item()
```

## Environmental Impact

- Model Size: ~268MB
- Inference Energy Cost: Low (compared to larger models)
- Training Energy Cost: Moderate

## Technical Specifications

- Model Architecture: DistilBERT
- Language(s): English
- License: MIT
- Input Format: Text (HTTP requests)
- Output Format: Binary classification with confidence score
- Model Size: 268MB
- Number of Parameters: ~65M

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{waf-distilbert,
  author = {jacpacd},
  title = {WAF-DistilBERT: Web Application Firewall using DistilBERT},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/jacpacd/waf-distilbert}}
}
```

## Contact

For questions and feedback about the model, please:
- Open an issue on GitHub
- Contact through Hugging Face
- Submit pull requests for improvements