Swapnanil09's picture
Update README.md
a5b295e verified
metadata
language: en
license: mit
tags:
  - security
  - ner
  - vulnerability-detection
  - codebert
  - lora
library_name: transformers
pipeline_tag: token-classification

Vulnerability Extractor - CodeBERT with LoRA

Model Description

This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER).

Task: Token Classification / Named Entity Recognition

Base Model: microsoft/codebert-base

Fine-tuning Method: LoRA (98% parameter reduction)

Extracted Entities

  • SOFTWARE: Software/service names (e.g., Apache, nginx, OpenSSL)
  • VERSION: Version numbers (e.g., 2.4.49, 1.1.0)
  • ERROR: Error types (e.g., buffer overflow, authentication failure)
  • EXPLOIT: Exploit hints (e.g., Heartbleed, path traversal)
  • IP: IP addresses
  • PORT: Port numbers
  • USER: Usernames
  • PATH: File paths

Performance

  • Entity Recognition F1: ~0.88
  • Inference Speed: ~60ms per log (GPU)

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor")
model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor")

# Extract vulnerabilities
log = "Apache 2.4.49 path traversal attack attempt detected"
inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode entities
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

entities = []
current_entity = None

for token, label in zip(tokens, labels):
    if token in ['<s>', '</s>', '<pad>']:
        continue
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]}
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += token.replace('Ġ', ' ')

if current_entity:
    entities.append(current_entity)

print(f"Entities: {entities}")

Model Details

  • Parameters: ~125M (only ~2M trainable with LoRA)
  • Input: Security log text (max 128 tokens)
  • Output: Token-level entity labels (BIO tagging)
  • Entity Types: 8 types + O (outside)

Use Cases

  1. Automated vulnerability scanning
  2. Security log analysis
  3. Threat intelligence extraction
  4. CVE mapping preparation

Limitations

  • Entity extraction accuracy depends on log format
  • May miss entities with unusual formatting
  • Trained on specific entity types only

Citation

@misc{vulnerability-extractor,
  author = {Your Name},
  title = {Vulnerability Extractor with CodeBERT and LoRA},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}}
}

License

MIT License