naufalso commited on
Commit
dc4b429
·
verified ·
1 Parent(s): ade7974

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -174
README.md CHANGED
@@ -2,198 +2,136 @@
2
  library_name: transformers
3
  tags:
4
  - generated_from_trainer
5
- datasets:
6
- - naufalso/redsage_seed
7
- - naufalso/cybersecurity_seed_dump
8
- - trendmicro-ailab/Primus-Seed
9
- - trendmicro-ailab/Primus-Seed
10
- - naufalso/nvd-cve
11
  model-index:
12
- - name: outputs/pretrain/qwen/RedSage-Qwen3-8B-Pretrain_05-Seed-New
13
  results: []
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
-
19
- [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
20
- <details><summary>See axolotl config</summary>
21
-
22
- axolotl version: `0.10.0`
23
- ```yaml
24
- # ------------------------------------------------------------------
25
- # Basic model + tokenizer
26
- # ------------------------------------------------------------------
27
- base_model: ./outputs/pretrain/qwen/dedup/RedSage-Qwen3-8b-Base-Pretrain-Dedup_05 # dense 8 B variant
28
- # model_type: qwen
29
- # tokenizer_type: qwen
30
- trust_remote_code: true
31
- auto_resume_from_checkpoints: true
32
-
33
- # ------------------------------------------------------------------
34
- # Precision + distributed strategy
35
- # ------------------------------------------------------------------
36
- bf16: true # enable bf16 math
37
- deepspeed: deepspeed_configs/zero3_bf16.json # sharded weights/opt/grads
38
- gradient_checkpointing: true # recompute to save VRAM
39
- sequence_parallel: true # tiny extra memory win
40
-
41
- # ------------------------------------------------------------------
42
- # Batch, sequence, epochs
43
- # ------------------------------------------------------------------
44
- micro_batch_size: 32
45
- gradient_accumulation_steps: 1 # 16 x 2 x 4 GPU = 32 x 8 node = 256 batch
46
- num_epochs: 5
47
- seq_length: 32768
48
-
49
- # ------------------------------------------------------------------
50
- # Optimiser & scheduler
51
- # ------------------------------------------------------------------
52
- optimizer: adamw_torch
53
- lr_scheduler: cosine
54
- learning_rate: 2.5e-5
55
- weight_decay: 0.05
56
- warmup_ratio: 0.01
57
- cosine_min_lr_ratio: 0.1
58
- cosine_constant_lr_ratio: 0.2
59
-
60
- # ------------------------------------------------------------------
61
- # Dataset (replace with your own)
62
- # ------------------------------------------------------------------
63
- chat_template: jinja
64
- chat_template_jinja: "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%-\
65
- \ if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n\
66
- \ {%- else %}\n {{- 'You are REDSAGE, cybersecurity-tuned model developed\
67
- \ by Khalifa University. You are a helpful assistant.' }}\n {%- endif %}\n \
68
- \ {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the\
69
- \ user query.\\n\\nYou are provided with function signatures within <tools></tools>\
70
- \ XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n\
71
- \ {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor\
72
- \ each function call, return a json object with function name and arguments within\
73
- \ <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>,\
74
- \ \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else\
75
- \ %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\\
76
- n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\\
77
- nYou are REDSAGE, cybersecurity-tuned model developed by Khalifa University. You\
78
- \ are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%-\
79
- \ for message in messages %}\n {%- if (message.role == \"user\") or (message.role\
80
- \ == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls)\
81
- \ %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>'\
82
- \ + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>'\
83
- \ + message.role }}\n {%- if message.content %}\n {{- '\\n' +\
84
- \ message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls\
85
- \ %}\n {%- if tool_call.function is defined %}\n {%- set\
86
- \ tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\\
87
- n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n \
88
- \ {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n\
89
- \ {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\\
90
- n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 ==\
91
- \ 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user'\
92
- \ }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{-\
93
- \ message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last\
94
- \ or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\\
95
- n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt\
96
- \ %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
97
-
98
- datasets:
99
- - path: naufalso/redsage_seed
100
- type: completion
101
- name: all
102
-
103
- - path: naufalso/cybersecurity_seed_dump
104
- type: completion
105
- name: default
106
-
107
- - path: trendmicro-ailab/Primus-Seed
108
- type: completion
109
- name: cybersecurity_companies_websites
110
- field: content
111
-
112
- - path: trendmicro-ailab/Primus-Seed
113
- type: completion
114
- name: mitre
115
- field: content
116
-
117
- - path: naufalso/nvd-cve
118
- type: completion
119
- name: filtered
120
-
121
-
122
- # ------------------------------------------------------------------
123
- # Logging / output
124
- # ------------------------------------------------------------------
125
- output_dir: ./outputs/pretrain/qwen/RedSage-Qwen3-8B-Pretrain_05-Seed-New
126
- dataset_prepared_path: ./prepared_datasets/RedSage-Qwen3-8B-Pretrain_05-Seed-New
127
- saves_per_epoch: 1
128
- eval_steps: 0.5
129
- val_set_size: 0.05
130
- log_with:
131
- - wandb
132
- - tensorboard
133
-
134
- use_tensorboard: true
135
- wandb_mode: "offline"
136
- wandb_entity: naufalso
137
- wandb_project: redsage
138
- wandb_name: RedSage-Qwen3-8B-Pretrain_05-Seed-New
139
-
140
- # ------------------------------------------------------------------
141
- # Misc
142
- # ------------------------------------------------------------------
143
- save_total_limit: 5 # keep the last 2 checkpoints
144
- load_in_8bit: false # full fine-tune, no quantisation
145
- torch_compile: false # turn on only after the run is stable
146
- ```
147
 
148
- </details><br>
 
 
 
149
 
150
- # outputs/pretrain/qwen/RedSage-Qwen3-8B-Pretrain_05-Seed-New
151
 
152
- This model was trained from scratch on the naufalso/redsage_seed, the naufalso/cybersecurity_seed_dump, the trendmicro-ailab/Primus-Seed, the trendmicro-ailab/Primus-Seed and the naufalso/nvd-cve datasets.
153
- It achieves the following results on the evaluation set:
154
- - Loss: 0.9952
155
 
156
- ## Model description
 
 
 
157
 
158
- More information needed
 
159
 
160
- ## Intended uses & limitations
161
 
162
- More information needed
163
 
164
- ## Training and evaluation data
165
 
166
- More information needed
 
 
 
 
 
 
 
167
 
168
- ## Training procedure
169
 
170
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
171
 
172
- The following hyperparameters were used during training:
173
- - learning_rate: 2.5e-05
174
- - train_batch_size: 32
175
- - eval_batch_size: 32
176
- - seed: 42
177
- - distributed_type: multi-GPU
178
- - num_devices: 32
179
- - total_train_batch_size: 1024
180
- - total_eval_batch_size: 1024
181
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
182
- - lr_scheduler_type: cosine
183
- - lr_scheduler_warmup_steps: 69
184
- - training_steps: 6921
185
 
186
- ### Training results
187
 
188
- | Training Loss | Epoch | Step | Validation Loss |
189
- |:-------------:|:------:|:----:|:---------------:|
190
- | No log | 0 | 0 | 1.7388 |
191
- | 0.9127 | 2.4989 | 3461 | 0.9952 |
192
 
 
193
 
194
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
- - Transformers 4.52.3
197
- - Pytorch 2.5.1+cu121
198
- - Datasets 3.6.0
199
- - Tokenizers 0.21.2
 
2
  library_name: transformers
3
  tags:
4
  - generated_from_trainer
5
+ - cybersecurity
6
+ - continual-pretraining
7
+ - targeted-pretraining
8
+ - text-generation
9
+ - casual-lm
10
+ - risys-lab
11
  model-index:
12
+ - name: RedSage-Qwen3-8B-Base
13
  results: []
14
+ language:
15
+ - en
16
+ base_model:
17
+ - RISys-Lab/RedSage-Qwen3-8B-CFW
18
+ pipeline_tag: text-generation
19
  ---
20
 
21
+ # RedSage-Qwen3-8B-Base
22
+
23
+ <div align="center">
24
+ <img src="https://img.shields.io/badge/Task-Cybersecurity-red" alt="Cybersecurity">
25
+ <img src="https://img.shields.io/badge/Stage-Targeted_Pretraining-blue" alt="Targeted Pretraining">
26
+ </div>
27
+
28
+ ## Model Summary
29
+
30
+ **RedSage-Qwen3-8B-Base** is a cybersecurity-specialized Large Language Model (LLM) developed by **RISys-Lab**. It represents the **second stage** of the RedSage pre-training pipeline.
31
+
32
+ This model builds upon **RedSage-Qwen3-8B-CFW** by undergoing **Targeted Pre-Training** on high-quality, curated cybersecurity resources (`RedSage-Seed` and `RedSage-Dump`). While the previous stage focused on breadth using web data, this stage focuses on depth, technical standards, and verified skills.
33
+
34
+ - **Paper:** [RedSage: A Cybersecurity Generalist LLM](https://openreview.net/forum?id=W4FAenIrQ2)
35
+ - **Repository:** [GitHub](https://github.com/RISys-Lab/RedSage)
36
+ - **Base Model:** [RISys-Lab/RedSage-Qwen3-8B-CFW](https://huggingface.co/RISys-Lab/RedSage-Qwen3-8B-CFW)
37
+ - **Variant:** Base (Final Pre-trained Checkpoint)
38
+
39
+ ## Intended Use
40
+
41
+ This model is a **base model** intended for:
42
+ 1. **Fine-tuning:** Serving as a high-quality foundation for downstream cybersecurity tasks (e.g., incident response, malware analysis).
43
+ 2. **Research:** Investigating the impact of curated versus web-scale data in domain adaptation.
44
+ 3. **Completion:** Code completion and technical writing in cybersecurity contexts.
45
+
46
+ **Note:** As a base model, this checkpoint has **not** been instruction-tuned (SFT) or aligned (DPO). It behaves like a completion engine. For a chat-ready assistant, please see `RISys-Lab/RedSage-Qwen3-8B-DPO`.
47
+
48
+ ## Training Lineage
49
+
50
+ RedSage employs a multi-stage training pipeline. This model represents the output of **Stage 2**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ 1. Stage 1: Continual Pre-Training (CPT) -> `RedSage-Qwen3-8B-CFW` (CyberFineWeb data)
53
+ 2. **Stage 2: Targeted Pre-Training** -> **`RedSage-Qwen3-8B-Base`** (Current Model)
54
+ 3. Stage 3: Supervised Fine-Tuning (SFT) -> `RedSage-Qwen3-8B-Ins`
55
+ 4. Stage 4: Direct Preference Optimization (DPO) -> `RedSage-Qwen3-8B-DPO`
56
 
57
+ ## Training Data: RedSage-Seed & Dump
58
 
59
+ This model was trained on approximately **850 million tokens** of curated data, split into two collections:
 
 
60
 
61
+ 1. **RedSage-Seed (~150M Tokens):** A highly curated collection of 28,637 samples converted to structured Markdown.
62
+ * **Knowledge:** General concepts and Frameworks (MITRE ATT&CK, CAPEC, CWE, OWASP).
63
+ * **Skills:** Offensive security resources including write-ups, hacking techniques, and payload examples.
64
+ * **Tools:** Manuals and cheat sheets for CLI tools and Kali Linux.
65
 
66
+ 2. **RedSage-Dump (~700M Tokens):** A larger aggregation of 459K technical documents.
67
+ * **Sources:** Computer education portals, cybersecurity news, RFC entries, NIST publications, and the National Vulnerability Database (NVD).
68
 
69
+ ## Performance
70
 
71
+ RedSage-8B-Base achieves state-of-the-art performance among 8B models, showing significant improvements over the general-purpose Qwen3-8B-Base. It achieves the highest mean score on external benchmarks among all 8B base models tested.
72
 
73
+ ### RedSage-Bench (0-shot Accuracy)
74
 
75
+ | Category | Qwen3-8B-Base | **RedSage-8B-Base** |
76
+ | :--- | :---: | :---: |
77
+ | **Macro Average** | 84.24 | **85.05** |
78
+ | Knowledge (General) | 83.08 | 83.12 |
79
+ | Knowledge (Frameworks) | 81.94 | **84.94** |
80
+ | Skill (Offensive) | 88.23 | **88.72** |
81
+ | Tools (CLI) | 85.08 | **85.44** |
82
+ | Tools (Kali) | 78.86 | **79.36** |
83
 
84
+ ### External Cybersecurity Benchmarks (5-shot)
85
 
86
+ | Benchmark | Qwen3-8B-Base | **RedSage-8B-Base** |
87
+ | :--- | :---: | :---: |
88
+ | **Mean** | 80.81 | **84.56** |
89
+ | CTI-Bench (MCQ) | 68.80 | **71.04** |
90
+ | CTI-Bench (RCM) | 63.50 | **78.40** |
91
+ | CyberMetric (500) | 92.00 | **92.60** |
92
+ | MMLU (Security) | 83.00 | **87.00** |
93
+ | SecBench (En) | **82.84** | 81.76 |
94
+ | SecEva (MCQ) | 75.60 | **75.83** |
95
+ | SECURE (CWET) | 92.70 | **93.22** |
96
+ | SECURE (KCV) | 75.05 | **87.20** |
97
+ | SECURE (MEAT) | 93.81 | **94.00** |
98
 
99
+ ## Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
+ The model was trained using the [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) framework.
102
 
103
+ - **Learning Rate:** 2.5e-6 (constant with linear warmup)
104
+ - **Optimizer:** AdamW
105
+ - **Epochs:** 1
 
106
 
107
+ ## Usage
108
 
109
+ ```python
110
+ from transformers import AutoTokenizer, AutoModelForCausalLM
111
+
112
+ model_id = "RISys-Lab/RedSage-Qwen3-8B-Base"
113
+
114
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
115
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
116
+
117
+ text = "The primary difference between a firewall and an IDS is"
118
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
119
+
120
+ outputs = model.generate(**inputs, max_new_tokens=50)
121
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
122
+ ```
123
+
124
+ ## Citation
125
+
126
+ If you use this model or dataset, please cite our paper:
127
+
128
+ ```
129
+ @inproceedings{suryanto2026redsage,
130
+ title={RedSage: A Cybersecurity Generalist {LLM}},
131
+ author={Naufal Suryanto and Muzammal Naseer and Pengfei Li and Syed Talal Wasim and Jinhui Yi and Juergen Gall and Paolo Ceravolo and Ernesto Damiani},
132
+ booktitle={The Fourteenth International Conference on Learning Representations},
133
+ year={2026},
134
+ url={https://openreview.net/forum?id=W4FAenIrQ2}
135
+ }
136
+ ```
137