Model Card for Model ID
This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports. It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise).
Model Details
Model Description
This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain. The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines or autonomous agent workflows (like LangGraph).
- Developed by: Alex Bueno
- Model type: Causal Language Model with LoRA adapters (PEFT)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model:
mistralai/Mistral-7B-v0.3
Model Sources
Uses
Direct Use
The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template. It will extract relevant entities and return them as a structured JSON array.
Downstream Use
- Multi-Agent Systems: As a specific Tool Node for an orchestrator agent to extract structured data before querying a Vector Database or SQL.
- CTI Pipelines: Automated ingestion and structuring of daily threat reports into a local database.
Bias, Risks, and Limitations
The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text.
Recommendations
- Temperature: It is recommended to use a low temperature (
temperature=0.1or0.0) during inference to ensure deterministic extraction. - Validation: Use Pydantic or structured decoding libraries (like
OutlinesorGuidance) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax.
How to Get Started with the Model
Use the code below to load the quantized base model and apply the LoRA adapters for inference:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
MODEL_NAME = "mistralai/Mistral-7B-v0.3"
ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
# Configure 4-bit Quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load Base Model
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="sdpa"
)
base_model.config.use_cache = True
# Merge Adapters
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()
# Inference
text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing."
prompt = (
f"### Instruction: Extract cyber threat entities in JSON format.\n"
f"### Input: {text}\n"
f"### Response: "
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip())
Training Details
Training Data
The model was fine-tuned on the mrmoor/cyber-threat-intelligence dataset, which contains annotated cybersecurity entities.
Training Procedure
Preprocessing
A custom Data Collator (CTICompletionCollator) was implemented during training.
It calculates the loss only on the JSON response generated by the model.
The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation.
Training Hyperparameters
- Training regime: QLoRA (4-bit base model, 16-bit adapters)
- Epochs: 3
- Learning Rate: 2e-4
- Batch Size: 2
- Gradient Accumulation Steps: 8
- Optimizer: AdamW
- LR Scheduler: Linear
- LoRA Rank (r): 8
- LoRA Alpha: 32
- LoRA Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Technical Specifications
Model Architecture and Objective
The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task.
Compute Infrastructure
The entire stack was developed and validated on local infrastructure, avoiding cloud dependencies to esnure data privacy for sensitive CTI documents.
Software
- PEFT 0.18.1
- Transformers
- BitsAndBytes
- PyTorch 2.5.1
- Downloads last month
- 30
Model tree for AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor
Base model
mistralai/Mistral-7B-v0.3