Swapnanil09
/

vulnerability-extractor

Token Classification

vulnerability-detection

Model card Files Files and versions

vulnerability-extractor / README.md

Swapnanil09's picture

Update README.md

a5b295e verified 4 days ago

|

history blame contribute delete

3.02 kB

	---
	language: en
	license: mit
	tags:
	- security
	- ner
	- vulnerability-detection
	- codebert
	- lora
	library_name: transformers
	pipeline_tag: token-classification
	---

	# Vulnerability Extractor - CodeBERT with LoRA

	## Model Description

	This model extracts vulnerability indicators from security logs using Named Entity Recognition (NER).

	Task: Token Classification / Named Entity Recognition

	Base Model: microsoft/codebert-base

	Fine-tuning Method: LoRA (98% parameter reduction)

	## Extracted Entities

	- SOFTWARE: Software/service names (e.g., Apache, nginx, OpenSSL)
	- VERSION: Version numbers (e.g., 2.4.49, 1.1.0)
	- ERROR: Error types (e.g., buffer overflow, authentication failure)
	- EXPLOIT: Exploit hints (e.g., Heartbleed, path traversal)
	- IP: IP addresses
	- PORT: Port numbers
	- USER: Usernames
	- PATH: File paths

	## Performance

	- Entity Recognition F1: ~0.88
	- Inference Speed: ~60ms per log (GPU)

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model
	tokenizer = AutoTokenizer.from_pretrained("Swapnanil09/vulnerability-extractor")
	model = AutoModelForTokenClassification.from_pretrained("Swapnanil09/vulnerability-extractor")

	# Extract vulnerabilities
	log = "Apache 2.4.49 path traversal attack attempt detected"
	inputs = tokenizer(log, return_tensors="pt", truncation=True, padding=True)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=-1)

	# Decode entities
	tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
	labels = [model.config.id2label[p.item()] for p in predictions[0]]

	entities = []
	current_entity = None

	for token, label in zip(tokens, labels):
	if token in ['<s>', '</s>', '<pad>']:
	continue
	if label.startswith('B-'):
	if current_entity:
	entities.append(current_entity)
	current_entity = {'text': token.replace('Ġ', ' ').strip(), 'type': label[2:]}
	elif label.startswith('I-') and current_entity:
	current_entity['text'] += token.replace('Ġ', ' ')

	if current_entity:
	entities.append(current_entity)

	print(f"Entities: {entities}")
	```

	## Model Details

	- Parameters: ~125M (only ~2M trainable with LoRA)
	- Input: Security log text (max 128 tokens)
	- Output: Token-level entity labels (BIO tagging)
	- Entity Types: 8 types + O (outside)

	## Use Cases

	1. Automated vulnerability scanning
	2. Security log analysis
	3. Threat intelligence extraction
	4. CVE mapping preparation

	## Limitations

	- Entity extraction accuracy depends on log format
	- May miss entities with unusual formatting
	- Trained on specific entity types only

	## Citation

	```bibtex
	@misc{vulnerability-extractor,
	author = {Your Name},
	title = {Vulnerability Extractor with CodeBERT and LoRA},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Swapnanil09/vulnerability-extractor}}
	}
	```

	## License

	MIT License