AlexXBueno commited on
Commit
1ccde08
·
verified ·
1 Parent(s): 99dea0d

update model card

Browse files
Files changed (1) hide show
  1. README.md +102 -143
README.md CHANGED
@@ -6,202 +6,161 @@ tags:
6
  - base_model:adapter:mistralai/Mistral-7B-v0.3
7
  - lora
8
  - transformers
 
 
 
 
9
  ---
10
 
11
  # Model Card for Model ID
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
-
15
 
16
 
17
  ## Model Details
18
 
19
  ### Model Description
20
 
21
- <!-- Provide a longer summary of what this model is. -->
22
-
23
 
24
-
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
 
33
  ### Model Sources [optional]
34
 
35
- <!-- Provide the basic links for the model. -->
36
 
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
 
41
  ## Uses
42
 
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
-
45
  ### Direct Use
46
 
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
-
49
- [More Information Needed]
50
-
51
- ### Downstream Use [optional]
52
-
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
-
55
- [More Information Needed]
56
 
57
- ### Out-of-Scope Use
58
 
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
60
 
61
- [More Information Needed]
62
 
63
  ## Bias, Risks, and Limitations
64
 
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
-
67
- [More Information Needed]
68
 
69
  ### Recommendations
70
 
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
72
 
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
 
75
  ## How to Get Started with the Model
76
 
77
- Use the code below to get started with the model.
78
-
79
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ## Training Details
82
 
83
  ### Training Data
84
 
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
-
87
- [More Information Needed]
88
 
89
  ### Training Procedure
90
 
91
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
 
93
- #### Preprocessing [optional]
94
-
95
- [More Information Needed]
96
 
 
 
 
97
 
98
  #### Training Hyperparameters
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
-
102
- #### Speeds, Sizes, Times [optional]
103
-
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
-
106
- [More Information Needed]
107
-
108
- ## Evaluation
109
-
110
- <!-- This section describes the evaluation protocols and provides the results. -->
111
-
112
- ### Testing Data, Factors & Metrics
113
-
114
- #### Testing Data
115
-
116
- <!-- This should link to a Dataset Card if possible. -->
117
-
118
- [More Information Needed]
119
-
120
- #### Factors
121
-
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
-
124
- [More Information Needed]
125
-
126
- #### Metrics
127
-
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
-
130
- [More Information Needed]
131
-
132
- ### Results
133
-
134
- [More Information Needed]
135
 
136
- #### Summary
137
-
138
-
139
-
140
- ## Model Examination [optional]
141
-
142
- <!-- Relevant interpretability work for the model goes here -->
143
-
144
- [More Information Needed]
145
-
146
- ## Environmental Impact
147
-
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
-
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
-
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
-
158
- ## Technical Specifications [optional]
159
 
160
  ### Model Architecture and Objective
161
 
162
- [More Information Needed]
163
 
164
  ### Compute Infrastructure
165
 
166
- [More Information Needed]
167
-
168
- #### Hardware
169
-
170
- [More Information Needed]
171
 
172
  #### Software
173
-
174
- [More Information Needed]
175
-
176
- ## Citation [optional]
177
-
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
-
180
- **BibTeX:**
181
-
182
- [More Information Needed]
183
-
184
- **APA:**
185
-
186
- [More Information Needed]
187
-
188
- ## Glossary [optional]
189
-
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
-
192
- [More Information Needed]
193
-
194
- ## More Information [optional]
195
-
196
- [More Information Needed]
197
-
198
- ## Model Card Authors [optional]
199
-
200
- [More Information Needed]
201
-
202
- ## Model Card Contact
203
-
204
- [More Information Needed]
205
- ### Framework versions
206
-
207
- - PEFT 0.18.1
 
6
  - base_model:adapter:mistralai/Mistral-7B-v0.3
7
  - lora
8
  - transformers
9
+ - cyber-threat-intelligence
10
+ - cti
11
+ - ner
12
+ - information-extraction
13
  ---
14
 
15
  # Model Card for Model ID
16
 
17
+ This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports.
18
+ It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise).
19
 
20
 
21
  ## Model Details
22
 
23
  ### Model Description
24
 
25
+ This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain.
26
+ The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines, SIEMs, or autonomous agent workflows (like LangGraph).
27
 
28
+ - **Developed by:** Alex Bueno
29
+ - **Model type:** Causal Language Model with LoRA adapters (PEFT)
30
+ - **Language(s) (NLP):** English
31
+ - **License:** Apache 2.0
32
+ - **Finetuned from model:** `mistralai/Mistral-7B-v0.3`
 
 
 
33
 
34
  ### Model Sources [optional]
35
 
36
+ - **Repository:** https://huggingface.co/AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor
37
 
 
 
 
38
 
39
  ## Uses
40
 
 
 
41
  ### Direct Use
42
 
43
+ The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template.
44
+ It will extract relevant entities and return them as a structured JSON array.
 
 
 
 
 
 
 
45
 
46
+ ### Downstream Use
47
 
48
+ - **Multi-Agent Systems:** As a specific Tool Node for an orchestrator agent (e.g., Llama-3-70B) to extract structured data before querying a Vector Database or SQL.
49
+ - **CTI Pipelines:** Automated ingestion and structuring of daily threat reports into a local database.
50
 
 
51
 
52
  ## Bias, Risks, and Limitations
53
 
54
+ The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text.
 
 
55
 
56
  ### Recommendations
57
 
58
+ - **Temperature:** It is strictly recommended to use a low temperature (`temperature=0.1` or `0.0`) during inference to ensure deterministic extraction.
59
+ - **Validation:** Use Pydantic or structured decoding libraries (like `Outlines` or `Guidance`) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax.
60
 
 
61
 
62
  ## How to Get Started with the Model
63
 
64
+ Use the code below to load the quantized base model and apply the LoRA adapters for inference:
65
+
66
+ ```python
67
+ import torch
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
69
+ from peft import PeftModel
70
+
71
+ MODEL_NAME = "mistralai/Mistral-7B-v0.3"
72
+ ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor"
73
+
74
+ # Load Tokenizer
75
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
76
+ tokenizer.pad_token = tokenizer.eos_token
77
+
78
+ # Configure 4-bit Quantization
79
+ bnb_config = BitsAndBytesConfig(
80
+ load_in_4bit=True,
81
+ bnb_4bit_quant_type="nf4",
82
+ bnb_4bit_compute_dtype=torch.float16,
83
+ bnb_4bit_use_double_quant=True
84
+ )
85
+
86
+ # Load Base Model
87
+ base_model = AutoModelForCausalLM.from_pretrained(
88
+ MODEL_NAME,
89
+ quantization_config=bnb_config,
90
+ device_map="auto",
91
+ dtype=torch.float16,
92
+ low_cpu_mem_usage=True,
93
+ attn_implementation="sdpa"
94
+ )
95
+ base_model.config.use_cache = True
96
+
97
+ # Merge Adapters
98
+ model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
99
+ model.eval()
100
+
101
+ # Inference
102
+ text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing."
103
+ prompt = (
104
+ f"### Instruction: Extract cyber threat entities in JSON format.\n"
105
+ f"### Input: {text}\n"
106
+ f"### Response: "
107
+ )
108
+
109
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
+ with torch.no_grad():
111
+ outputs = model.generate(
112
+ **inputs,
113
+ max_new_tokens=256,
114
+ temperature=0.1,
115
+ do_sample=True,
116
+ pad_token_id=tokenizer.eos_token_id
117
+ )
118
+
119
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip())
120
+ ```
121
 
122
  ## Training Details
123
 
124
  ### Training Data
125
 
126
+ The model was fine-tuned on the ```mrmoor/cyber-threat-intelligence``` dataset, which contains annotated cybersecurity entities.
 
 
127
 
128
  ### Training Procedure
129
 
130
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
131
 
132
+ #### Preprocessing
 
 
133
 
134
+ A custom Data Collator (```CTICompletionCollator```) was implemented during training.
135
+ It calculates the loss only on the JSON response generated by the model.
136
+ The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation.
137
 
138
  #### Training Hyperparameters
139
+ - Training regime: QLoRA (4-bit base model, 16-bit adapters)
140
+ - Epochs: 3
141
+ - Learning Rate: 2e-4
142
+ - Batch Size: 2
143
+ - Gradient Accumulation Steps: 8
144
+ - Optimizer: AdamW
145
+ - LR Scheduler: Linear
146
+ - LoRA Rank (r): 8
147
+ - LoRA Alpha: 32
148
+ - LoRA Dropout: 0.05
149
+ - Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
+ ## Technical Specifications
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
  ### Model Architecture and Objective
155
 
156
+ The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task.
157
 
158
  ### Compute Infrastructure
159
 
160
+ The entire stack was developed and validated on local/on-premise infrastructure, bypassing cloud dependencies to assure data privacy for sensitive CTI documents.
 
 
 
 
161
 
162
  #### Software
163
+ - PEFT 0.18.1
164
+ - Transformers
165
+ - BitsAndBytes
166
+ - PyTorch 2.5.1