darkforensic-7b

Threat-intelligence dark-web assistant for European CISOs and DPOs. Spanish-first. Defensive role only. Built by Neural Ghost — a small, European-hosted threat-intelligence platform.

darkforensic-7b is a QLoRA fine-tune of Qwen2.5-7B-Instruct trained on a curated corpus of real dark-web findings (Tor + I2P, ~3,300 sites across categories: marketplace, hacking, fraud, forum, leak, cyberattack, services, news) with synthetic Q&A pairs generated by Anthropic Claude Sonnet 4.6 and filtered for quality, defensive intent, and toxicity.

This repo includes both:

adapter_model.safetensors — the LoRA adapter (309 MB) for use with peft + Qwen2.5-7B-Instruct base.
darkforensic-7b-lora.gguf — the same LoRA in GGUF format (155 MB) for use with Ollama or llama.cpp against any Qwen2.5-7B GGUF base.

What it does well

Answers operational questions a CISO, DPO, or SOC analyst would ask about dark-web findings, in Spanish, with references to source findings when applicable, without fabricating facts when the finding is silent.

Evaluation results

Benchmarked against qwen2.5:3b base on a held-out eval set of synthetic Q&A from real findings, judged by Claude Sonnet 4.6 with a 1-5 rubric across four dimensions:

Metric	Base (qwen2.5:3b)	darkforensic-7b	Delta
Relevance	3.23	4.64	+43.7%
Faithfulness (cites source)	2.74	4.04	+47.4%
Conciseness	2.59	4.24	+63.7%
Defensiveness	4.31	4.52	+4.9%

Note on defensiveness: in 2 of 25 evaluated cases the model reproduced source .onion URLs verbatim from the retrieved context. This is a known limitation of LoRA fine-tunes over corpora containing IOCs. See "Production deployment" below — the recommended deployment wraps inference with an IOC redaction filter.

What it explicitly does NOT do

Explain how to attack systems, buy stolen data, or access marketplaces. All training Q&A pairs were filtered for defensive-only intent.
Provide attribution to nation-states or named individuals.
Process or generate content classified as CSAM. Kill-switch inherited from the Neural Ghost CSAM-001 policy (see LICENSE).

Intended use

Internal threat-intel assistants for European mid-market regulated organizations (banca cooperativa, mutualidades sanitarias, hospitales privados, ayuntamientos medianos, despachos legales). Designed for the European compliance context (Schrems II, RGPD, NIS2, DORA).

NOT intended for general-purpose chatbots, English-language threat-intel, forensic analysis without RAG context, or autonomous decision-making.

How to use

With Ollama (recommended for VPS deployment)

# Base GGUF of Qwen2.5-7B-Instruct from a community quant
wget https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf

# This LoRA in GGUF format
wget https://huggingface.co/jmpicon2026/darkforensic-7b/resolve/main/darkforensic-7b-lora.gguf

# Modelfile
cat > Modelfile <<EOF
FROM ./Qwen2.5-7B-Instruct-Q4_K_M.gguf
ADAPTER ./darkforensic-7b-lora.gguf
PARAMETER temperature 0.4
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """Eres darkforensic, asistente experto en threat-intelligence dark-web para CISOs y DPOs europeos. Respondes en castellano, factual, conciso, defensivo. NUNCA cites URLs .onion completas, redactalas como [URL .onion redactada]. NUNCA proporciones credenciales literales. Tu rol es DEFENSIVO."""
EOF

ollama create darkforensic-7b -f Modelfile
ollama run darkforensic-7b "Que hacer si detectamos credenciales nuestras en un combo-list?"

With Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", torch_dtype="bfloat16", device_map="auto"
)
model = PeftModel.from_pretrained(base, "jmpicon2026/darkforensic-7b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

messages = [
    {"role": "system", "content": "Eres darkforensic, asistente threat-intel..."},
    {"role": "user", "content": "..."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Production deployment — IOC redaction

Always wrap inference with an IOC redaction filter in production.

The model can reproduce .onion URLs, hashes, BTC/XMR addresses, or infostealer combos verbatim from the retrieved context (this is a side-effect of fine-tuning on findings that contain those IOCs). Production deployment must post-process model output before returning to end-users.

Reference Python implementation in the Neural Ghost repo (redact_iocs.py). Sketch:

import re
ONION = re.compile(r"\b(?:[a-z2-7]{56}|[a-z2-7]{16})\.onion\b", re.I)
HASH  = re.compile(r"\b(?:[a-f0-9]{32}|[a-f0-9]{40}|[a-f0-9]{64}|[a-f0-9]{128})\b", re.I)
BTC   = re.compile(r"\b(?:bc1[a-z0-9]{8,87}|[13][a-km-zA-HJ-NP-Z1-9]{25,34})\b")
COMBO = re.compile(r"\b[\w.+-]+@[\w.-]+\.[a-z]{2,}\s*[:|]\s*\S{4,}\b", re.I)

def redact(text: str) -> str:
    text = ONION.sub(lambda m: m.group(0)[:8] + "...[REDACTED-onion]", text)
    text = HASH.sub("[REDACTED-hash]", text)
    text = BTC.sub("[REDACTED-btc]", text)
    text = COMBO.sub("[REDACTED-combo]", text)
    return text

This is standard practice in serious threat-intel providers (Recorded Future, Flashpoint, KELA all post-process IOCs before customer delivery).

Training details


Base model	Qwen/Qwen2.5-7B-Instruct (Apache-2.0)
Method	QLoRA r=32, alpha=64, dropout=0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Sequence length	2048
Effective batch size	16 (micro 4 × accum 4)
Learning rate	2e-4 cosine, warmup 3%
Epochs	3
Hardware	1× H100 80GB (RunPod EU), ~26 min
Training cost	~$3 GPU + ~$36 dataset generation = ~$39 total
Training data	3,290 dark-web findings → 9,376 synthetic Q&A pairs (filtered)
Train/eval split	90/10
Initial eval loss	7.49
Final eval loss	1.12

Limitations

Spanish-first. English answers exist but are not optimized.
Knowledge cutoff: corpus indexed up to 2026-04-30.
Synthetic-data origins: Claude Sonnet 4.6 stylistic patterns inherited.
Not for autonomous decision-making. Assistant to a human analyst.
Hallucination risk: always verify high-stakes claims against authoritative sources (BleepingComputer, Krebs, CCN-CERT, vendor advisories).
IOC leakage in raw output: the model reproduces source IOCs in ~8% of answers. Production deployment MUST use the redaction filter.

License

Dual-licensed:

Research / academic / personal use: free under CC-BY-NC-SA 4.0. Cite as below.
Commercial use: requires a commercial license from Neural Ghost. Contact: hello@neural-ghost.com. Typically 500-2500 EUR/year.

Full license terms in LICENSE file.

Citation

@misc{darkforensic-7b-2026,
  title={darkforensic-7b: A Spanish threat-intelligence dark-web assistant fine-tuned from Qwen2.5-7B-Instruct},
  author={Bobal, Jose and {Neural Ghost contributors}},
  year={2026},
  howpublished={\url{https://huggingface.co/jmpicon2026/darkforensic-7b}},
}