vulnerability-extractor / README.md

Swapnanil09

Update README.md

a5b295e verified 4 days ago

preview code

raw

history blame contribute delete

3.02 kB

metadata

language: en
license: mit
tags:
  - security
  - ner
  - vulnerability-detection
  - codebert
  - lora
library_name: transformers
pipeline_tag: token-classification

Vulnerability Extractor - CodeBERT with LoRA

Model Description

This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER).

Task: Token Classification / Named Entity Recognition

Base Model: microsoft/codebert-base

Fine-tuning Method: LoRA (98% parameter reduction)

Extracted Entities

SOFTWARE: Software/service names (e.g., Apache, nginx, OpenSSL)
VERSION: Version numbers (e.g., 2.4.49, 1.1.0)
ERROR: Error types (e.g., buffer overflow, authentication failure)
EXPLOIT: Exploit hints (e.g., Heartbleed, path traversal)
IP: IP addresses
PORT: Port numbers
USER: Usernames
PATH: File paths

Performance

Entity Recognition F1: ~0.88
Inference Speed: ~60ms per log (GPU)

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor")
model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor")

# Extract vulnerabilities
log = "Apache 2.4.49 path traversal attack attempt detected"
inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode entities
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

entities = []
current_entity = None

for token, label in zip(tokens, labels):
    if token in ['<s>', '</s>', '<pad>']:
        continue
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]}
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += token.replace('Ġ', ' ')

if current_entity:
    entities.append(current_entity)

print(f"Entities: {entities}")

Model Details

Parameters: ~125M (only ~2M trainable with LoRA)
Input: Security log text (max 128 tokens)
Output: Token-level entity labels (BIO tagging)
Entity Types: 8 types + O (outside)

Use Cases

Automated vulnerability scanning
Security log analysis
Threat intelligence extraction
CVE mapping preparation

Limitations

Entity extraction accuracy depends on log format
May miss entities with unusual formatting
Trained on specific entity types only

Citation

@misc{vulnerability-extractor,
  author = {Your Name},
  title = {Vulnerability Extractor with CodeBERT and LoRA},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}}
}

License

MIT License