darkforensic-7b
Threat-intelligence dark-web assistant for European CISOs and DPOs. Spanish-first. Defensive role only. Built by Neural Ghost β a small, European-hosted threat-intelligence platform.
darkforensic-7b is a QLoRA fine-tune of Qwen2.5-7B-Instruct
trained on a curated corpus of real dark-web findings (Tor + I2P, ~3,300
sites across categories: marketplace, hacking, fraud, forum, leak,
cyberattack, services, news) with synthetic Q&A pairs generated by
Anthropic Claude Sonnet 4.6 and filtered for quality, defensive intent,
and toxicity.
This repo includes both:
adapter_model.safetensorsβ the LoRA adapter (309 MB) for use withpeft+ Qwen2.5-7B-Instruct base.darkforensic-7b-lora.ggufβ the same LoRA in GGUF format (155 MB) for use with Ollama orllama.cppagainst any Qwen2.5-7B GGUF base.
What it does well
Answers operational questions a CISO, DPO, or SOC analyst would ask about dark-web findings, in Spanish, with references to source findings when applicable, without fabricating facts when the finding is silent.
Evaluation results
Benchmarked against qwen2.5:3b base on a held-out eval set of synthetic
Q&A from real findings, judged by Claude Sonnet 4.6 with a 1-5 rubric
across four dimensions:
| Metric | Base (qwen2.5:3b) | darkforensic-7b | Delta |
|---|---|---|---|
| Relevance | 3.23 | 4.64 | +43.7% |
| Faithfulness (cites source) | 2.74 | 4.04 | +47.4% |
| Conciseness | 2.59 | 4.24 | +63.7% |
| Defensiveness | 4.31 | 4.52 | +4.9% |
Note on defensiveness: in 2 of 25 evaluated cases the model reproduced source
.onionURLs verbatim from the retrieved context. This is a known limitation of LoRA fine-tunes over corpora containing IOCs. See "Production deployment" below β the recommended deployment wraps inference with an IOC redaction filter.
What it explicitly does NOT do
- Explain how to attack systems, buy stolen data, or access marketplaces. All training Q&A pairs were filtered for defensive-only intent.
- Provide attribution to nation-states or named individuals.
- Process or generate content classified as CSAM. Kill-switch
inherited from the Neural Ghost CSAM-001 policy (see
LICENSE).
Intended use
Internal threat-intel assistants for European mid-market regulated organizations (banca cooperativa, mutualidades sanitarias, hospitales privados, ayuntamientos medianos, despachos legales). Designed for the European compliance context (Schrems II, RGPD, NIS2, DORA).
NOT intended for general-purpose chatbots, English-language threat-intel, forensic analysis without RAG context, or autonomous decision-making.
How to use
With Ollama (recommended for VPS deployment)
# Base GGUF of Qwen2.5-7B-Instruct from a community quant
wget https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf
# This LoRA in GGUF format
wget https://huggingface.co/jmpicon2026/darkforensic-7b/resolve/main/darkforensic-7b-lora.gguf
# Modelfile
cat > Modelfile <<EOF
FROM ./Qwen2.5-7B-Instruct-Q4_K_M.gguf
ADAPTER ./darkforensic-7b-lora.gguf
PARAMETER temperature 0.4
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """Eres darkforensic, asistente experto en threat-intelligence dark-web para CISOs y DPOs europeos. Respondes en castellano, factual, conciso, defensivo. NUNCA cites URLs .onion completas, redactalas como [URL .onion redactada]. NUNCA proporciones credenciales literales. Tu rol es DEFENSIVO."""
EOF
ollama create darkforensic-7b -f Modelfile
ollama run darkforensic-7b "Que hacer si detectamos credenciales nuestras en un combo-list?"
With Transformers + PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct", torch_dtype="bfloat16", device_map="auto"
)
model = PeftModel.from_pretrained(base, "jmpicon2026/darkforensic-7b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
messages = [
{"role": "system", "content": "Eres darkforensic, asistente threat-intel..."},
{"role": "user", "content": "..."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Production deployment β IOC redaction
Always wrap inference with an IOC redaction filter in production.
The model can reproduce .onion URLs, hashes, BTC/XMR addresses, or
infostealer combos verbatim from the retrieved context (this is a
side-effect of fine-tuning on findings that contain those IOCs).
Production deployment must post-process model output before returning
to end-users.
Reference Python implementation in the
Neural Ghost repo
(redact_iocs.py). Sketch:
import re
ONION = re.compile(r"\b(?:[a-z2-7]{56}|[a-z2-7]{16})\.onion\b", re.I)
HASH = re.compile(r"\b(?:[a-f0-9]{32}|[a-f0-9]{40}|[a-f0-9]{64}|[a-f0-9]{128})\b", re.I)
BTC = re.compile(r"\b(?:bc1[a-z0-9]{8,87}|[13][a-km-zA-HJ-NP-Z1-9]{25,34})\b")
COMBO = re.compile(r"\b[\w.+-]+@[\w.-]+\.[a-z]{2,}\s*[:|]\s*\S{4,}\b", re.I)
def redact(text: str) -> str:
text = ONION.sub(lambda m: m.group(0)[:8] + "...[REDACTED-onion]", text)
text = HASH.sub("[REDACTED-hash]", text)
text = BTC.sub("[REDACTED-btc]", text)
text = COMBO.sub("[REDACTED-combo]", text)
return text
This is standard practice in serious threat-intel providers (Recorded Future, Flashpoint, KELA all post-process IOCs before customer delivery).
Training details
| Base model | Qwen/Qwen2.5-7B-Instruct (Apache-2.0) |
| Method | QLoRA r=32, alpha=64, dropout=0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Sequence length | 2048 |
| Effective batch size | 16 (micro 4 Γ accum 4) |
| Learning rate | 2e-4 cosine, warmup 3% |
| Epochs | 3 |
| Hardware | 1Γ H100 80GB (RunPod EU), ~26 min |
| Training cost | ~$3 GPU + ~$36 dataset generation = ~$39 total |
| Training data | 3,290 dark-web findings β 9,376 synthetic Q&A pairs (filtered) |
| Train/eval split | 90/10 |
| Initial eval loss | 7.49 |
| Final eval loss | 1.12 |
Limitations
- Spanish-first. English answers exist but are not optimized.
- Knowledge cutoff: corpus indexed up to 2026-04-30.
- Synthetic-data origins: Claude Sonnet 4.6 stylistic patterns inherited.
- Not for autonomous decision-making. Assistant to a human analyst.
- Hallucination risk: always verify high-stakes claims against authoritative sources (BleepingComputer, Krebs, CCN-CERT, vendor advisories).
- IOC leakage in raw output: the model reproduces source IOCs in ~8% of answers. Production deployment MUST use the redaction filter.
License
Dual-licensed:
- Research / academic / personal use: free under CC-BY-NC-SA 4.0. Cite as below.
- Commercial use: requires a commercial license from Neural Ghost. Contact: hello@neural-ghost.com. Typically 500-2500 EUR/year.
Full license terms in LICENSE file.
Citation
@misc{darkforensic-7b-2026,
title={darkforensic-7b: A Spanish threat-intelligence dark-web assistant fine-tuned from Qwen2.5-7B-Instruct},
author={Bobal, Jose and {Neural Ghost contributors}},
year={2026},
howpublished={\url{https://huggingface.co/jmpicon2026/darkforensic-7b}},
}
Acknowledgements
- Qwen team β base model under Apache-2.0
- Anthropic β synthetic-data tooling (Sonnet 4.6)
- Axolotl β training pipeline
- llama.cpp β GGUF conversion
- bartowski β base model GGUF quants
Built with patience by Jose Bobal as part of Neural Ghost. European-hosted threat-intel for organizations the American giants don't attend.
- Downloads last month
- 60
We're not able to determine the quantization variants.