| | --- |
| | base_model: mistralai/Mistral-7B-v0.3 |
| | library_name: peft |
| | pipeline_tag: text-generation |
| | tags: |
| | - base_model:adapter:mistralai/Mistral-7B-v0.3 |
| | - lora |
| | - transformers |
| | - cyber-threat-intelligence |
| | - cti |
| | - ner |
| | - information-extraction |
| | license: apache-2.0 |
| | datasets: |
| | - mrmoor/cyber-threat-intelligence |
| | language: |
| | - en |
| | --- |
| | |
| | # Model Card for Model ID |
| |
|
| | This model is a fine-tuned version of Mistral-7B-v0.3 designed to automate the extraction of Cyber Threat Intelligence (CTI) from unstructured security reports. |
| | It transforms raw, technical text into structured JSON format containing cybersecurity entities (e.g., Malware, Threat Actors, Attack Patterns, and Indicators of Compromise). |
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This model uses QLoRA (Quantized Low-Rank Adaptation) to efficiently adapt the Mistral-7B base model for the highly specific task of Named Entity Recognition (NER) in the cybersecurity domain. |
| | The model outputs a strict JSON structure, making it ideal for integration into automated RAG pipelines or autonomous agent workflows (like LangGraph). |
| |
|
| | - **Developed by:** Alex Bueno |
| | - **Model type:** Causal Language Model with LoRA adapters (PEFT) |
| | - **Language(s) (NLP):** English |
| | - **License:** Apache 2.0 |
| | - **Finetuned from model:** `mistralai/Mistral-7B-v0.3` |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** https://huggingface.co/AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor |
| |
|
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | The model is designed to be directly queried with unstructured cybersecurity text (like threat reports, blogs, or logs) using a specific prompt template. |
| | It will extract relevant entities and return them as a structured JSON array. |
| |
|
| | ### Downstream Use |
| |
|
| | - **Multi-Agent Systems:** As a specific Tool Node for an orchestrator agent to extract structured data before querying a Vector Database or SQL. |
| | - **CTI Pipelines:** Automated ingestion and structuring of daily threat reports into a local database. |
| |
|
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | The model may suffer from previous knowledge bias, which may leads to insert threat actors or malware names that are semantically related but not explicitly mentioned in the input text. |
| |
|
| | ### Recommendations |
| |
|
| | - **Temperature:** It is recommended to use a low temperature (`temperature=0.1` or `0.0`) during inference to ensure deterministic extraction. |
| | - **Validation:** Use Pydantic or structured decoding libraries (like `Outlines` or `Guidance`) in production to enforce JSON grammar, as the model may occasionally produce malformed JSON syntax. |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to load the quantized base model and apply the LoRA adapters for inference: |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
| | from peft import PeftModel |
| | |
| | MODEL_NAME = "mistralai/Mistral-7B-v0.3" |
| | ADAPTER_REPO = "AlexXBueno/Mistral-7B-Cyber-Thread-Intelligence-Extractor" |
| | |
| | # Load Tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) |
| | tokenizer.pad_token = tokenizer.eos_token |
| | |
| | # Configure 4-bit Quantization |
| | bnb_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_quant_type="nf4", |
| | bnb_4bit_compute_dtype=torch.float16, |
| | bnb_4bit_use_double_quant=True |
| | ) |
| | |
| | # Load Base Model |
| | base_model = AutoModelForCausalLM.from_pretrained( |
| | MODEL_NAME, |
| | quantization_config=bnb_config, |
| | device_map="auto", |
| | dtype=torch.float16, |
| | low_cpu_mem_usage=True, |
| | attn_implementation="sdpa" |
| | ) |
| | base_model.config.use_cache = True |
| | |
| | # Merge Adapters |
| | model = PeftModel.from_pretrained(base_model, ADAPTER_REPO) |
| | model.eval() |
| | |
| | # Inference |
| | text = "The attacker used IP 192.168.1.50 to deliver the Emotet payload via phishing." |
| | prompt = ( |
| | f"### Instruction: Extract cyber threat entities in JSON format.\n" |
| | f"### Input: {text}\n" |
| | f"### Response: " |
| | ) |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=256, |
| | temperature=0.1, |
| | do_sample=True, |
| | pad_token_id=tokenizer.eos_token_id |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1].strip()) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | The model was fine-tuned on the `mrmoor/cyber-threat-intelligence` dataset, which contains annotated cybersecurity entities. |
| |
|
| | ### Training Procedure |
| |
|
| | #### Preprocessing |
| |
|
| | A custom Data Collator (```CTICompletionCollator```) was implemented during training. |
| | It calculates the loss only on the JSON response generated by the model. |
| | The instruction and input tokens were masked using -100 labels to prevent the model from learning to generate the prompt itself, focusing entirely on the JSON structure generation. |
| |
|
| | #### Training Hyperparameters |
| | - Training regime: QLoRA (4-bit base model, 16-bit adapters) |
| | - Epochs: 3 |
| | - Learning Rate: 2e-4 |
| | - Batch Size: 2 |
| | - Gradient Accumulation Steps: 8 |
| | - Optimizer: AdamW |
| | - LR Scheduler: Linear |
| | - LoRA Rank (r): 8 |
| | - LoRA Alpha: 32 |
| | - LoRA Dropout: 0.05 |
| | - Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| | |
| | |
| | ## Technical Specifications |
| | |
| | ### Model Architecture and Objective |
| | |
| | The objective is strictly Information Extraction (IE) formatted as an Instruction-Following generation task. |
| | |
| | ### Compute Infrastructure |
| | |
| | The entire stack was developed and validated on local infrastructure, avoiding cloud dependencies to esnure data privacy for sensitive CTI documents. |
| | |
| | #### Software |
| | - PEFT 0.18.1 |
| | - Transformers |
| | - BitsAndBytes |
| | - PyTorch 2.5.1 |