Fixed bug tokenizer SecureBERT (#1)

d59be39 8 days ago

5.48 kB

	---
	license: mit
	language:
	- en
	tags:
	- cybersecurity
	- vulnerability
	- mitre-attack
	- text-classification
	- fine-tuned
	- securebert
	base_model: ehsanaghaei/SecureBERT
	---

	# SecureBERT — CVE-LMTune ATT&CK Classifier (Flat)

	<div align="center" style="display:inline-flex; gap:18px; align-items:center; flex-wrap:nowrap;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5b/Logo_Universit%C3%A9_de_Lorraine.svg/1280px-Logo_Universit%C3%A9_de_Lorraine.svg.png" alt="Universite de Lorraine" style="height:50px; width:auto;" /> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Inr_logo_rouge.svg/1280px-Inr_logo_rouge.svg.png" alt="INRIA" style="height:50px; width:auto;" /> <img src="https://upload.wikimedia.org/wikipedia/fr/6/6e/Logo_loria_abrege_couleur.png" alt="LORIA" style="height:70px; width:auto;" /> <img src="https://www.pepr-cybersecurite.fr/wp-content/uploads/2023/09/pep-cybersecurite-550x250-1.png" alt="SuperViZ" style="height:70px; width:auto;" /> </div>

	[![GitHub](https://img.shields.io/badge/GitHub-CVE--LMTune-black?logo=github)](https://github.com/terranovafr/CVE-LMTune)
	[![Paper](https://img.shields.io/badge/Paper-HAL-green?logo=information&logoColor=white)](https://hal.science/hal-05500820)
	[![PhD theses.fr](https://img.shields.io/badge/Project-theses.fr-orange?logo=university&logoColor=white)](https://theses.fr/s371241)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Zenodo Data](https://img.shields.io/badge/Zenodo-Data%20Repository-lightblue?logo=information&logoColor=white)](https://doi.org/10.5281/zenodo.16936476)

	Part of the CVE-LMTune model suite, a collection of language models fine-tuned for multi-taxonomy vulnerability classification across widely used cybersecurity taxonomies, including CWE, CAPEC, and MITRE ATT&CK.

	## Paper

	> Franco Terranova, Sana Rekbi, Abdelkader Lahmadi, Isabelle Chrisment.
	> Multi-Taxonomy Vulnerability Classification with Hierarchically Finetuned Language Models.
	> The 23rd Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA '26).

	## Overview

	This model performs multi-label ATT&CK classification from vulnerability descriptions. Given a CVE-style description, it predicts one or more ATT&CK identifiers associated with the described vulnerability.

	\| Property \| Value \|
	\|----------\|-------\|
	\| Taxonomy \| MITRE ATT&CK Enterprise Subtechniques \|
	\| Task \| Multi-label text classification \|
	\| Input \| Vulnerability description (e.g., CVE summary) \|
	\| Output \| One or more ATT&CK identifiers \|
	\| Number of labels \| 175 \|
	\| Number of samples \| 231,009 \|
	\| Latest CVE update included \| 17/06/2026 \|
	\| Split \| train (60%), val (20%), test (20%) \|

	## Evaluation Results

	The model was evaluated on the held-out test set using standard multi-label classification metrics using sigmoid activation and a default threshold of 0.5.

	Ranking Metrics
	\| LRAP \| MRR \| Coverage Error \| Label Ranking Loss \| P@1 \| P@3 \| P@5 \| R@1 \| R@3 \| R@5 \|
	\|------\|-----\|----------------\|--------------------\|-----\|-----\|-----\|-----\|-----\|-----\|
	\| 0.9152 \| 0.9460 \| 18.79 \| 0.0173 \| 0.9321 \| 0.9084 \| 0.8458 \| 0.1286 \| 0.3779 \| 0.5554 \|

	Threshold = 0.5
	\| Micro P \| Micro R \| Micro F1 \| Macro F1 \| Weighted F1 \| Hamming Loss \| Subset Accuracy \|
	\|--------\|--------\|----------\|----------\|------------\|--------------\|----------------\|
	\| 0.8612 \| 0.7767 \| 0.8168 \| 0.4286 \| 0.8093 \| 0.0264 \| 0.6874 \|

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("Sana9/securebert-vuln2attack-flat", use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained("Sana9/securebert-vuln2attack-flat")

	text = "Buffer overflow vulnerability in OpenSSL allows remote attackers to execute arbitrary code."

	with torch.no_grad():
	probs = torch.sigmoid(
	model(**tokenizer(text, return_tensors="pt", truncation=True)).logits
	)[0]

	predictions = {
	model.config.id2label[i]: p.item()
	for i, p in enumerate(probs)
	if p > 0.5
	}

	print(predictions)
	```

	## Citation

	```bibtex
	@inproceedings{terranova2026multitaxonomy,
	author = {Franco Terranova and Sana Rekbi and Abdelkader Lahmadi and Isabelle Chrisment},
	title = {Multi-Taxonomy Vulnerability Classification with Hierarchically Finetuned Language Models},
	booktitle = {Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA)},
	year = {2026},
	month = jul,
	address = {Chania, Crete, Greece},
	note = {HAL identifier: hal-05500820v2}
	}
	```

	## Related Resources

	- 🤗 [Full model suite on Hugging Face](https://huggingface.co/Sana9)
	- 💻 [CVE-LMTune — Training code (GitHub)](https://github.com/terranovafr/CVE-LMTune)
	- 📦 [Zenodo — Data repository](https://doi.org/10.5281/zenodo.16936476)

	## Disclaimers

	- This product is a result of the use of the NVD API but is not endorsed or certified by the NVD. The same for the CVE2CAPEC project and the Hugging Face API.
	- This project relies on data publicly available from the CWE, CAPEC, and MITRE ATT&CK projects.
	- This work has been partially supported by the French National Research Agency under the France 2030 label (Superviz ANR-22-PECY-0008). The views reflected herein do not necessarily reflect the opinion of the French government.