File size: 5,504 Bytes
3f23f3c 1ccde08 7d59e75 3f23f3c 1ccde08 3f23f3c 1ccde08 7d59e75 3f23f3c 1ccde08 3f23f3c 7d59e75 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 7d59e75 1ccde08 3f23f3c 1ccde08 3f23f3c 7d59e75 1ccde08 3f23f3c 1ccde08 3f23f3c 7d59e75 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 1ccde08 3f23f3c 7d59e75 3f23f3c 1ccde08 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ---
base_model: mistralai/Mistral-7B-v0.3
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:mistralai/Mistral-7B-v0.3
- lora
- transformers
- cyber-threat-intelligence
- cti
- ner
- information-extraction
license: apache-2.0
datasets:
- mrmoor/cyber-threat-intelligence
language:
- en
---
# Model Card for Model ID
This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports.
It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise).
## Model Details
### Model Description
This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain.
The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines or autonomous agent workflows (like LangGraph).
- **Developed by:** Alex Bueno
- **Model type:** Causal Language Model with LoRA adapters (PEFT)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** `mistralai/Mistral-7B-v0.3`
### Model Sources
- **Repository:** https://huggingface.co/AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor
## Uses
### Direct Use
The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template.
It will extract relevant entities and return them as a structured JSON array.
### Downstream Use
- **Multi-Agent Systems:** As a specific Tool Node for an orchestrator agent to extract structured data before querying a Vector Database or SQL.
- **CTI Pipelines:** Automated ingestion and structuring of daily threat reports into a local database.
## Bias, Risks, and Limitations
The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text.
### Recommendations
- **Temperature:** It is recommended to use a low temperature (`temperature=0.1` or `0.0`) during inference to ensure deterministic extraction.
- **Validation:** Use Pydantic or structured decoding libraries (like `Outlines` or `Guidance`) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax.
## How to Get Started with the Model
Use the code below to load the quantized base model and apply the LoRA adapters for inference:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
MODEL_NAME = "mistralai/Mistral-7B-v0.3"
ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
# Configure 4-bit Quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load Base Model
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="sdpa"
)
base_model.config.use_cache = True
# Merge Adapters
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()
# Inference
text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing."
prompt = (
f"### Instruction: Extract cyber threat entities in JSON format.\n"
f"### Input: {text}\n"
f"### Response: "
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip())
```
## Training Details
### Training Data
The model was fine-tuned on the `mrmoor/cyber-threat-intelligence` dataset, which contains annotated cybersecurity entities.
### Training Procedure
#### Preprocessing
A custom Data Collator (```CTICompletionCollator```) was implemented during training.
It calculates the loss only on the JSON response generated by the model.
The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation.
#### Training Hyperparameters
- Training regime: QLoRA (4-bit base model, 16-bit adapters)
- Epochs: 3
- Learning Rate: 2e-4
- Batch Size: 2
- Gradient Accumulation Steps: 8
- Optimizer: AdamW
- LR Scheduler: Linear
- LoRA Rank (r): 8
- LoRA Alpha: 32
- LoRA Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
## Technical Specifications
### Model Architecture and Objective
The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task.
### Compute Infrastructure
The entire stack was developed and validated on local infrastructure, avoiding cloud dependencies to esnure data privacy for sensitive CTI documents.
#### Software
- PEFT 0.18.1
- Transformers
- BitsAndBytes
- PyTorch 2.5.1 |