|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- security |
|
|
- ner |
|
|
- vulnerability-detection |
|
|
- codebert |
|
|
- lora |
|
|
library_name: transformers |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Vulnerability Extractor - CodeBERT with LoRA |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER). |
|
|
|
|
|
**Task**: Token Classification / Named Entity Recognition |
|
|
|
|
|
**Base Model**: microsoft/codebert-base |
|
|
|
|
|
**Fine-tuning Method**: LoRA (98% parameter reduction) |
|
|
|
|
|
## Extracted Entities |
|
|
|
|
|
- **SOFTWARE**: Software/service names (e.g., Apache, nginx, OpenSSL) |
|
|
- **VERSION**: Version numbers (e.g., 2.4.49, 1.1.0) |
|
|
- **ERROR**: Error types (e.g., buffer overflow, authentication failure) |
|
|
- **EXPLOIT**: Exploit hints (e.g., Heartbleed, path traversal) |
|
|
- **IP**: IP addresses |
|
|
- **PORT**: Port numbers |
|
|
- **USER**: Usernames |
|
|
- **PATH**: File paths |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Entity Recognition F1**: ~0.88 |
|
|
- **Inference Speed**: ~60ms per log (GPU) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor") |
|
|
model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor") |
|
|
|
|
|
# Extract vulnerabilities |
|
|
log = "Apache 2.4.49 path traversal attack attempt detected" |
|
|
inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.argmax(outputs.logits, dim=-1) |
|
|
|
|
|
# Decode entities |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) |
|
|
labels = [model.config.id2label[p.item()] for p in predictions[0]] |
|
|
|
|
|
entities = [] |
|
|
current_entity = None |
|
|
|
|
|
for token, label in zip(tokens, labels): |
|
|
if token in ['<s>', '</s>', '<pad>']: |
|
|
continue |
|
|
if label.startswith('B-'): |
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]} |
|
|
elif label.startswith('I-') and current_entity: |
|
|
current_entity['text'] += token.replace('Ġ', ' ') |
|
|
|
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
|
|
|
print(f"Entities: {entities}") |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Parameters**: ~125M (only ~2M trainable with LoRA) |
|
|
- **Input**: Security log text (max 128 tokens) |
|
|
- **Output**: Token-level entity labels (BIO tagging) |
|
|
- **Entity Types**: 8 types + O (outside) |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
1. Automated vulnerability scanning |
|
|
2. Security log analysis |
|
|
3. Threat intelligence extraction |
|
|
4. CVE mapping preparation |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Entity extraction accuracy depends on log format |
|
|
- May miss entities with unusual formatting |
|
|
- Trained on specific entity types only |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{vulnerability-extractor, |
|
|
author = {Your Name}, |
|
|
title = {Vulnerability Extractor with CodeBERT and LoRA}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|