Swapnanil09's picture
Update README.md
a5b295e verified
---
language: en
license: mit
tags:
- security
- ner
- vulnerability-detection
- codebert
- lora
library_name: transformers
pipeline_tag: token-classification
---
# Vulnerability Extractor - CodeBERT with LoRA
## Model Description
This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER).
**Task**: Token Classification / Named Entity Recognition
**Base Model**: microsoft/codebert-base
**Fine-tuning Method**: LoRA (98% parameter reduction)
## Extracted Entities
- **SOFTWARE**: Software/service names (e.g., Apache, nginx, OpenSSL)
- **VERSION**: Version numbers (e.g., 2.4.49, 1.1.0)
- **ERROR**: Error types (e.g., buffer overflow, authentication failure)
- **EXPLOIT**: Exploit hints (e.g., Heartbleed, path traversal)
- **IP**: IP addresses
- **PORT**: Port numbers
- **USER**: Usernames
- **PATH**: File paths
## Performance
- **Entity Recognition F1**: ~0.88
- **Inference Speed**: ~60ms per log (GPU)
## Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor")
model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor")
# Extract vulnerabilities
log = "Apache 2.4.49 path traversal attack attempt detected"
inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode entities
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
entities = []
current_entity = None
for token, label in zip(tokens, labels):
if token in ['<s>', '</s>', '<pad>']:
continue
if label.startswith('B-'):
if current_entity:
entities.append(current_entity)
current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]}
elif label.startswith('I-') and current_entity:
current_entity['text'] += token.replace('Ġ', ' ')
if current_entity:
entities.append(current_entity)
print(f"Entities: {entities}")
```
## Model Details
- **Parameters**: ~125M (only ~2M trainable with LoRA)
- **Input**: Security log text (max 128 tokens)
- **Output**: Token-level entity labels (BIO tagging)
- **Entity Types**: 8 types + O (outside)
## Use Cases
1. Automated vulnerability scanning
2. Security log analysis
3. Threat intelligence extraction
4. CVE mapping preparation
## Limitations
- Entity extraction accuracy depends on log format
- May miss entities with unusual formatting
- Trained on specific entity types only
## Citation
```bibtex
@misc{vulnerability-extractor,
author = {Your Name},
title = {Vulnerability Extractor with CodeBERT and LoRA},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}}
}
```
## License
MIT License