darkforensic-7b

Threat-intelligence dark-web assistant for European CISOs and DPOs. Spanish-first. Defensive role only. Built by Neural Ghost β€” a small, European-hosted threat-intelligence platform.

darkforensic-7b is a QLoRA fine-tune of Qwen2.5-7B-Instruct trained on a curated corpus of real dark-web findings (Tor + I2P, ~3,300 sites across categories: marketplace, hacking, fraud, forum, leak, cyberattack, services, news) with synthetic Q&A pairs generated by Anthropic Claude Sonnet 4.6 and filtered for quality, defensive intent, and toxicity.

This repo includes both:

  • adapter_model.safetensors β€” the LoRA adapter (309 MB) for use with peft + Qwen2.5-7B-Instruct base.
  • darkforensic-7b-lora.gguf β€” the same LoRA in GGUF format (155 MB) for use with Ollama or llama.cpp against any Qwen2.5-7B GGUF base.

What it does well

Answers operational questions a CISO, DPO, or SOC analyst would ask about dark-web findings, in Spanish, with references to source findings when applicable, without fabricating facts when the finding is silent.

Evaluation results

Benchmarked against qwen2.5:3b base on a held-out eval set of synthetic Q&A from real findings, judged by Claude Sonnet 4.6 with a 1-5 rubric across four dimensions:

Metric Base (qwen2.5:3b) darkforensic-7b Delta
Relevance 3.23 4.64 +43.7%
Faithfulness (cites source) 2.74 4.04 +47.4%
Conciseness 2.59 4.24 +63.7%
Defensiveness 4.31 4.52 +4.9%

Note on defensiveness: in 2 of 25 evaluated cases the model reproduced source .onion URLs verbatim from the retrieved context. This is a known limitation of LoRA fine-tunes over corpora containing IOCs. See "Production deployment" below β€” the recommended deployment wraps inference with an IOC redaction filter.

What it explicitly does NOT do

  • Explain how to attack systems, buy stolen data, or access marketplaces. All training Q&A pairs were filtered for defensive-only intent.
  • Provide attribution to nation-states or named individuals.
  • Process or generate content classified as CSAM. Kill-switch inherited from the Neural Ghost CSAM-001 policy (see LICENSE).

Intended use

Internal threat-intel assistants for European mid-market regulated organizations (banca cooperativa, mutualidades sanitarias, hospitales privados, ayuntamientos medianos, despachos legales). Designed for the European compliance context (Schrems II, RGPD, NIS2, DORA).

NOT intended for general-purpose chatbots, English-language threat-intel, forensic analysis without RAG context, or autonomous decision-making.

How to use

With Ollama (recommended for VPS deployment)

# Base GGUF of Qwen2.5-7B-Instruct from a community quant
wget https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf

# This LoRA in GGUF format
wget https://huggingface.co/jmpicon2026/darkforensic-7b/resolve/main/darkforensic-7b-lora.gguf

# Modelfile
cat > Modelfile <<EOF
FROM ./Qwen2.5-7B-Instruct-Q4_K_M.gguf
ADAPTER ./darkforensic-7b-lora.gguf
PARAMETER temperature 0.4
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """Eres darkforensic, asistente experto en threat-intelligence dark-web para CISOs y DPOs europeos. Respondes en castellano, factual, conciso, defensivo. NUNCA cites URLs .onion completas, redactalas como [URL .onion redactada]. NUNCA proporciones credenciales literales. Tu rol es DEFENSIVO."""
EOF

ollama create darkforensic-7b -f Modelfile
ollama run darkforensic-7b "Que hacer si detectamos credenciales nuestras en un combo-list?"

With Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", torch_dtype="bfloat16", device_map="auto"
)
model = PeftModel.from_pretrained(base, "jmpicon2026/darkforensic-7b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

messages = [
    {"role": "system", "content": "Eres darkforensic, asistente threat-intel..."},
    {"role": "user", "content": "..."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Production deployment β€” IOC redaction

Always wrap inference with an IOC redaction filter in production.

The model can reproduce .onion URLs, hashes, BTC/XMR addresses, or infostealer combos verbatim from the retrieved context (this is a side-effect of fine-tuning on findings that contain those IOCs). Production deployment must post-process model output before returning to end-users.

Reference Python implementation in the Neural Ghost repo (redact_iocs.py). Sketch:

import re
ONION = re.compile(r"\b(?:[a-z2-7]{56}|[a-z2-7]{16})\.onion\b", re.I)
HASH  = re.compile(r"\b(?:[a-f0-9]{32}|[a-f0-9]{40}|[a-f0-9]{64}|[a-f0-9]{128})\b", re.I)
BTC   = re.compile(r"\b(?:bc1[a-z0-9]{8,87}|[13][a-km-zA-HJ-NP-Z1-9]{25,34})\b")
COMBO = re.compile(r"\b[\w.+-]+@[\w.-]+\.[a-z]{2,}\s*[:|]\s*\S{4,}\b", re.I)

def redact(text: str) -> str:
    text = ONION.sub(lambda m: m.group(0)[:8] + "...[REDACTED-onion]", text)
    text = HASH.sub("[REDACTED-hash]", text)
    text = BTC.sub("[REDACTED-btc]", text)
    text = COMBO.sub("[REDACTED-combo]", text)
    return text

This is standard practice in serious threat-intel providers (Recorded Future, Flashpoint, KELA all post-process IOCs before customer delivery).

Training details

Base model Qwen/Qwen2.5-7B-Instruct (Apache-2.0)
Method QLoRA r=32, alpha=64, dropout=0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Sequence length 2048
Effective batch size 16 (micro 4 Γ— accum 4)
Learning rate 2e-4 cosine, warmup 3%
Epochs 3
Hardware 1Γ— H100 80GB (RunPod EU), ~26 min
Training cost ~$3 GPU + ~$36 dataset generation = ~$39 total
Training data 3,290 dark-web findings β†’ 9,376 synthetic Q&A pairs (filtered)
Train/eval split 90/10
Initial eval loss 7.49
Final eval loss 1.12

Limitations

  • Spanish-first. English answers exist but are not optimized.
  • Knowledge cutoff: corpus indexed up to 2026-04-30.
  • Synthetic-data origins: Claude Sonnet 4.6 stylistic patterns inherited.
  • Not for autonomous decision-making. Assistant to a human analyst.
  • Hallucination risk: always verify high-stakes claims against authoritative sources (BleepingComputer, Krebs, CCN-CERT, vendor advisories).
  • IOC leakage in raw output: the model reproduces source IOCs in ~8% of answers. Production deployment MUST use the redaction filter.

License

Dual-licensed:

  1. Research / academic / personal use: free under CC-BY-NC-SA 4.0. Cite as below.
  2. Commercial use: requires a commercial license from Neural Ghost. Contact: hello@neural-ghost.com. Typically 500-2500 EUR/year.

Full license terms in LICENSE file.

Citation

@misc{darkforensic-7b-2026,
  title={darkforensic-7b: A Spanish threat-intelligence dark-web assistant fine-tuned from Qwen2.5-7B-Instruct},
  author={Bobal, Jose and {Neural Ghost contributors}},
  year={2026},
  howpublished={\url{https://huggingface.co/jmpicon2026/darkforensic-7b}},
}

Acknowledgements


Built with patience by Jose Bobal as part of Neural Ghost. European-hosted threat-intel for organizations the American giants don't attend.

Downloads last month
60
GGUF
Model size
80.7M params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jmpicon2026/darkforensic-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(2045)
this model