File size: 5,504 Bytes

3f23f3c
 
 
 
 
 
 
 
1ccde08
 
 
 
7d59e75
 
 
 
 
3f23f3c
 
 
 
1ccde08
 
3f23f3c
 
 
 
 
 
1ccde08
7d59e75
3f23f3c
1ccde08
 
 
 
 
3f23f3c
7d59e75
3f23f3c
1ccde08
3f23f3c
 
 
 
 
 
1ccde08
 
3f23f3c
1ccde08
3f23f3c
7d59e75
1ccde08
3f23f3c
 
 
 
1ccde08
3f23f3c
 
 
7d59e75
1ccde08
3f23f3c
 
 
 
1ccde08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f23f3c
 
 
 
 
7d59e75
3f23f3c
 
 
1ccde08
3f23f3c
1ccde08
 
 
3f23f3c
 
1ccde08
 
 
 
 
 
 
 
 
 
 
3f23f3c
 
1ccde08
3f23f3c
 
 
1ccde08
3f23f3c
 
 
7d59e75
3f23f3c
 
1ccde08

---
base_model: mistralai/Mistral-7B-v0.3
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:mistralai/Mistral-7B-v0.3
- lora
- transformers
- cyber-threat-intelligence
- cti
- ner
- information-extraction
license: apache-2.0
datasets:
- mrmoor/cyber-threat-intelligence
language:
- en
---

# Model Card for Model ID

This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports. 
It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise).


## Model Details

### Model Description

This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain. 
The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines or autonomous agent workflows (like LangGraph).

- **Developed by:** Alex Bueno
- **Model type:** Causal Language Model with LoRA adapters (PEFT)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** `mistralai/Mistral-7B-v0.3`

### Model Sources

- **Repository:** https://huggingface.co/AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor


## Uses

### Direct Use

The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template. 
It will extract relevant entities and return them as a structured JSON array.

### Downstream Use

- **Multi-Agent Systems:** As a specific Tool Node for an orchestrator agent to extract structured data before querying a Vector Database or SQL.
- **CTI Pipelines:** Automated ingestion and structuring of daily threat reports into a local database.


## Bias, Risks, and Limitations

The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text.

### Recommendations

- **Temperature:** It is recommended to use a low temperature (`temperature=0.1` or `0.0`) during inference to ensure deterministic extraction.
- **Validation:** Use Pydantic or structured decoding libraries (like `Outlines` or `Guidance`) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax.


## How to Get Started with the Model

Use the code below to load the quantized base model and apply the LoRA adapters for inference:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

MODEL_NAME = "mistralai/Mistral-7B-v0.3"
ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Configure 4-bit Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load Base Model
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.float16,
    low_cpu_mem_usage=True,
    attn_implementation="sdpa"
)
base_model.config.use_cache = True

# Merge Adapters
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()

# Inference
text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing."
prompt = (
    f"### Instruction: Extract cyber threat entities in JSON format.\n"
    f"### Input: {text}\n"
    f"### Response: "
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip())
```

## Training Details

### Training Data

The model was fine-tuned on the `mrmoor/cyber-threat-intelligence` dataset, which contains annotated cybersecurity entities.

### Training Procedure

#### Preprocessing

A custom Data Collator (```CTICompletionCollator```) was implemented during training. 
It calculates the loss only on the JSON response generated by the model. 
The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation.

#### Training Hyperparameters
- Training regime: QLoRA (4-bit base model, 16-bit adapters)
- Epochs: 3
- Learning Rate: 2e-4
- Batch Size: 2
- Gradient Accumulation Steps: 8
- Optimizer: AdamW
- LR Scheduler: Linear
- LoRA Rank (r): 8
- LoRA Alpha: 32
- LoRA Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


## Technical Specifications

### Model Architecture and Objective

The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task.

### Compute Infrastructure

The entire stack was developed and validated on local infrastructure, avoiding cloud dependencies to esnure data privacy for sensitive CTI documents.

#### Software
- PEFT 0.18.1
- Transformers
- BitsAndBytes
- PyTorch 2.5.1