cedricbonhomme commited on
Commit
190c99c
·
verified ·
1 Parent(s): d226ee0

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +54 -63
  2. config.json +1 -1
  3. model.safetensors +1 -1
  4. training_args.bin +1 -1
README.md CHANGED
@@ -1,84 +1,75 @@
1
  ---
2
- language:
3
- - zh
4
- license: cc-by-4.0
5
  library_name: transformers
6
- tags:
7
- - text-classification
8
- - vulnerability
9
- - severity
10
- - cybersecurity
11
- - cnvd
12
- datasets:
13
- - CIRCL/Vulnerability-CNVD
14
  base_model: hfl/chinese-macbert-base
15
- pipeline_tag: text-classification
 
 
 
 
 
 
16
  ---
17
 
18
- # VLAI: Automated Vulnerability Severity Classification (Chinese Text)
19
-
20
- A fine-tuned [hfl/chinese-macbert-base](https://huggingface.co/hfl/chinese-macbert-base) model for classifying Chinese vulnerability descriptions from the [China National Vulnerability Database (CNVD)](https://www.cnvd.org.cn/) into three severity levels: **Low**, **Medium**, and **High**.
21
-
22
- Trained on the [CIRCL/Vulnerability-CNVD](https://huggingface.co/datasets/CIRCL/Vulnerability-CNVD) dataset as part of the [VulnTrain](https://github.com/vulnerability-lookup/VulnTrain) project.
23
-
24
- ## Evaluation results
25
-
26
- Evaluated on a **deduplicated test set** (25,664 samples) where no description text appears in both train and test splits, preventing data leakage from CNVD's reuse of boilerplate descriptions across different vulnerability IDs.
27
-
28
- | Class | Precision | Recall | F1-score | Support |
29
- |--------|-----------|--------|----------|---------|
30
- | Low | 0.6091 | 0.3966 | 0.4804 | 2,267 |
31
- | Medium | 0.7743 | 0.8387 | 0.8052 | 14,353 |
32
- | High | 0.7808 | 0.7461 | 0.7631 | 9,044 |
33
-
34
- - **Overall accuracy**: 76.70%
35
- - **Macro F1**: 0.6829
36
 
37
- ### Class distribution
38
 
39
- The dataset is imbalanced: Low (8.8%), Medium (55.9%), High (35.2%).
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- ## Usage
42
 
43
- ```python
44
- from transformers import pipeline
45
 
46
- classifier = pipeline(
47
- "text-classification",
48
- model="CIRCL/vulnerability-severity-classification-chinese-macbert-base"
49
- )
50
 
51
- description = "TOTOLINK A3600R存在缓冲区溢出漏洞,攻击者可利用该漏洞在系统上执行任意代码或者导致拒绝服务。"
52
- result = classifier(description)
53
- print(result)
54
- ```
55
 
56
- ## Known limitations
57
 
58
- - **Low severity recall**: the Low class has the lowest recall. Approximately 60% of Low-severity entries are misclassified, mostly as Medium. This reflects the vocabulary overlap between Low and Medium descriptions in CNVD data. Class-weighted loss and focal loss were tested but all degraded Medium recall disproportionately without a net benefit.
59
 
60
- - **Keyword dependency**: the model biases toward a vulnerability type's typical severity. For example, buffer overflow descriptions are predicted as High regardless of the actual assigned severity. On entries where the actual severity deviates from the type's typical severity, accuracy drops from ~89% to ~55%.
61
 
62
- - **Negation blindness**: the model does not understand negation. Descriptions like "does NOT allow remote code execution" can still produce high-confidence High severity predictions.
63
 
64
- - **CVE overlap**: 81% of CNVD entries have a corresponding CVE. The model primarily adds value for the ~19% of CNVD-only entries (concentrated in Chinese domestic software) where no CVE severity assessment exists.
 
 
 
 
 
 
 
65
 
66
- These limitations were identified through independent analysis in [VulnTrain#19](https://github.com/vulnerability-lookup/VulnTrain/issues/19).
67
 
68
- ## Training details
 
 
 
 
 
 
69
 
70
- - **Base model**: [hfl/chinese-macbert-base](https://huggingface.co/hfl/chinese-macbert-base)
71
- - **Dataset**: [CIRCL/Vulnerability-CNVD](https://huggingface.co/datasets/CIRCL/Vulnerability-CNVD)
72
- - **Train/test split**: deduplicated on description text (no leakage), 80/20 split
73
- - **Loss**: uniform cross-entropy (no class weighting)
74
- - **Learning rate**: 3e-05
75
- - **Batch size**: 16
76
- - **Epochs**: 5
77
- - **Best model selection**: by accuracy
78
 
79
- ## References
80
 
81
- - [Vulnerability-Lookup](https://vulnerability.circl.lu) — the vulnerability data source
82
- - [VulnTrain](https://github.com/vulnerability-lookup/VulnTrain) — training pipeline
83
- - [ML-Gateway](https://github.com/vulnerability-lookup/ML-Gateway) — inference API
84
- - [VLAI paper](https://arxiv.org/abs/2507.03607) — Bonhomme, C., Dulaunoy, A. (2025)
 
1
  ---
 
 
 
2
  library_name: transformers
3
+ license: apache-2.0
 
 
 
 
 
 
 
4
  base_model: hfl/chinese-macbert-base
5
+ tags:
6
+ - generated_from_trainer
7
+ metrics:
8
+ - accuracy
9
+ model-index:
10
+ - name: vulnerability-severity-classification-chinese-macbert-base
11
+ results: []
12
  ---
13
 
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ # vulnerability-severity-classification-chinese-macbert-base
18
 
19
+ This model is a fine-tuned version of [hfl/chinese-macbert-base](https://huggingface.co/hfl/chinese-macbert-base) on an unknown dataset.
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 1.3186
22
+ - Accuracy: 0.7657
23
+ - F1 Macro: 0.6796
24
+ - Low Precision: 0.5544
25
+ - Low Recall: 0.3987
26
+ - Low F1: 0.4638
27
+ - Medium Precision: 0.7805
28
+ - Medium Recall: 0.8196
29
+ - Medium F1: 0.7996
30
+ - High Precision: 0.7787
31
+ - High Recall: 0.7720
32
+ - High F1: 0.7753
33
 
34
+ ## Model description
35
 
36
+ More information needed
 
37
 
38
+ ## Intended uses & limitations
 
 
 
39
 
40
+ More information needed
 
 
 
41
 
42
+ ## Training and evaluation data
43
 
44
+ More information needed
45
 
46
+ ## Training procedure
47
 
48
+ ### Training hyperparameters
49
 
50
+ The following hyperparameters were used during training:
51
+ - learning_rate: 3e-05
52
+ - train_batch_size: 32
53
+ - eval_batch_size: 32
54
+ - seed: 42
55
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
56
+ - lr_scheduler_type: linear
57
+ - num_epochs: 5
58
 
59
+ ### Training results
60
 
61
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | Low Precision | Low Recall | Low F1 | Medium Precision | Medium Recall | Medium F1 | High Precision | High Recall | High F1 |
62
+ |:-------------:|:-----:|:-----:|:---------------:|:--------:|:--------:|:-------------:|:----------:|:------:|:----------------:|:-------------:|:---------:|:--------------:|:-----------:|:-------:|
63
+ | 1.2165 | 1.0 | 3221 | 1.2363 | 0.7429 | 0.5960 | 0.6380 | 0.1631 | 0.2598 | 0.7335 | 0.8549 | 0.7895 | 0.7679 | 0.7115 | 0.7386 |
64
+ | 1.1430 | 2.0 | 6442 | 1.1676 | 0.7625 | 0.6464 | 0.6316 | 0.2643 | 0.3726 | 0.7548 | 0.8568 | 0.8026 | 0.7909 | 0.7386 | 0.7639 |
65
+ | 0.8890 | 3.0 | 9663 | 1.1915 | 0.7631 | 0.6690 | 0.5884 | 0.3470 | 0.4365 | 0.7833 | 0.8091 | 0.7960 | 0.7564 | 0.7933 | 0.7744 |
66
+ | 0.8253 | 4.0 | 12884 | 1.2354 | 0.7675 | 0.6796 | 0.5739 | 0.3874 | 0.4626 | 0.7765 | 0.8305 | 0.8026 | 0.7849 | 0.7630 | 0.7738 |
67
+ | 0.5851 | 5.0 | 16105 | 1.3186 | 0.7657 | 0.6796 | 0.5544 | 0.3987 | 0.4638 | 0.7805 | 0.8196 | 0.7996 | 0.7787 | 0.7720 | 0.7753 |
68
 
 
 
 
 
 
 
 
 
69
 
70
+ ### Framework versions
71
 
72
+ - Transformers 5.9.0
73
+ - Pytorch 2.12.0+cu130
74
+ - Datasets 4.8.5
75
+ - Tokenizers 0.22.2
config.json CHANGED
@@ -39,7 +39,7 @@
39
  "pooler_type": "first_token_transform",
40
  "problem_type": "single_label_classification",
41
  "tie_word_embeddings": true,
42
- "transformers_version": "5.8.1",
43
  "type_vocab_size": 2,
44
  "use_cache": false,
45
  "vocab_size": 21128
 
39
  "pooler_type": "first_token_transform",
40
  "problem_type": "single_label_classification",
41
  "tie_word_embeddings": true,
42
+ "transformers_version": "5.9.0",
43
  "type_vocab_size": 2,
44
  "use_cache": false,
45
  "vocab_size": 21128
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:85b042c63a5b88954da96b16469622785fb8241352817f2c25b7685e1f663473
3
  size 409103316
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c3c95c84cd5eda329a19bb85b5541b787efe845b3928db83a1fc51f11ec0245
3
  size 409103316
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:575baa156bdc5e5f07e63e7d51c0971eb2b304d73bb6c1c735b76e9ac8a00b4c
3
  size 5329
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe676f030dfeaf223c576a6c298bff30cc2b00d682b7d519687ed39eaba7c8da
3
  size 5329