update model card 2

7d59e75 verified 14 days ago

5.5 kB

	---
	base_model: mistralai/Mistral-7B-v0.3
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- base_model:adapter:mistralai/Mistral-7B-v0.3
	- lora
	- transformers
	- cyber-threat-intelligence
	- cti
	- ner
	- information-extraction
	license: apache-2.0
	datasets:
	- mrmoor/cyber-threat-intelligence
	language:
	- en
	---

	# Model Card for Model ID

	This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports.
	It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise).


	## Model Details

	### Model Description

	This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain.
	The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines or autonomous agent workflows (like LangGraph).

	- Developed by: Alex Bueno
	- Model type: Causal Language Model with LoRA adapters (PEFT)
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model: `mistralai/Mistral-7B-v0.3`

	### Model Sources

	- Repository: https://huggingface.co/AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor


	## Uses

	### Direct Use

	The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template.
	It will extract relevant entities and return them as a structured JSON array.

	### Downstream Use

	- Multi-Agent Systems: As a specific Tool Node for an orchestrator agent to extract structured data before querying a Vector Database or SQL.
	- CTI Pipelines: Automated ingestion and structuring of daily threat reports into a local database.


	## Bias, Risks, and Limitations

	The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text.

	### Recommendations

	- Temperature: It is recommended to use a low temperature (`temperature=0.1` or `0.0`) during inference to ensure deterministic extraction.
	- Validation: Use Pydantic or structured decoding libraries (like `Outlines` or `Guidance`) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax.


	## How to Get Started with the Model

	Use the code below to load the quantized base model and apply the LoRA adapters for inference:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel

	MODEL_NAME = "mistralai/Mistral-7B-v0.3"
	ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor"

	# Load Tokenizer
	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
	tokenizer.pad_token = tokenizer.eos_token

	# Configure 4-bit Quantization
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_use_double_quant=True
	)

	# Load Base Model
	base_model = AutoModelForCausalLM.from_pretrained(
	MODEL_NAME,
	quantization_config=bnb_config,
	device_map="auto",
	dtype=torch.float16,
	low_cpu_mem_usage=True,
	attn_implementation="sdpa"
	)
	base_model.config.use_cache = True

	# Merge Adapters
	model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
	model.eval()

	# Inference
	text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing."
	prompt = (
	f"### Instruction: Extract cyber threat entities in JSON format.\n"
	f"### Input: {text}\n"
	f"### Response: "
	)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip())
	```

	## Training Details

	### Training Data

	The model was fine-tuned on the `mrmoor/cyber-threat-intelligence` dataset, which contains annotated cybersecurity entities.

	### Training Procedure

	#### Preprocessing

	A custom Data Collator (```CTICompletionCollator```) was implemented during training.
	It calculates the loss only on the JSON response generated by the model.
	The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation.

	#### Training Hyperparameters
	- Training regime: QLoRA (4-bit base model, 16-bit adapters)
	- Epochs: 3
	- Learning Rate: 2e-4
	- Batch Size: 2
	- Gradient Accumulation Steps: 8
	- Optimizer: AdamW
	- LR Scheduler: Linear
	- LoRA Rank (r): 8
	- LoRA Alpha: 32
	- LoRA Dropout: 0.05
	- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


	## Technical Specifications

	### Model Architecture and Objective

	The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task.

	### Compute Infrastructure

	The entire stack was developed and validated on local infrastructure, avoiding cloud dependencies to esnure data privacy for sensitive CTI documents.

	#### Software
	- PEFT 0.18.1
	- Transformers
	- BitsAndBytes
	- PyTorch 2.5.1