neuralghost
/

darkforensic-7b

@@ -49,16 +49,17 @@ findings collected by the GhostNet Intelligence Platform. The model was
 trained on 9,376 synthetic question--answer pairs distilled from a
 state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
 findings spanning eleven threat categories. In a 15-question
-head-to-head evaluation judged by Claude Sonnet 4.6 on a 4-dimension
-rubric, v2 reaches \textbf{6.03/10 average} versus 7.12 for the teacher
-itself and 1.25 for Gemini 2.5 Flash --- i.e.\@
-\textbf{${\sim}\,85\,\%$ of the teacher's quality at roughly $1\,\%$ of
-its per-token cost}, with \textbf{$0\,\%$ critical-quality responses}
-versus $100\,\%$ for Gemini Flash. We release weights, LoRA adapter,
-GGUF quantization, the training pipeline and the evaluation harness
-under a dual research \,+\, commercial license. A v3 cycle is currently
-in training, targeting a doubling of the corpus and incorporation of
-analyst preference feedback via DPO.
 \smallskip
 \noindent\textbf{Keywords:} threat intelligence, dark web, LLM
@@ -115,10 +116,10 @@ turns indexed content into operator-ready answers.
   Filtering pipeline that explicitly removes verbatim PII while
   retaining the analyst's ability to talk about IOCs in the abstract.
 \item A head-to-head evaluation against the state-of-the-art teacher
-  (Claude Sonnet 4.6) and against a widely-used commercial small model
-  (Gemini 2.5 Flash), with the same rubric and the same prompts.
-  Honest results: v2 reaches $85\,\%$ of the teacher's quality and
-  outperforms Gemini Flash by a wide margin.
 \item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
   $\to$ entity store) released under dual research/commercial license,
   so that a customer can reproduce the pipeline in their own
@@ -273,30 +274,37 @@ the answer is graded by Claude Sonnet 4.6 acting as judge on a
 Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
 \emph{critical failure} (the answer is operationally useless or worse).
-The same protocol is applied to two reference models:
 \begin{itemize}
 \item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
   training Q\&A --- i.e.\@ a strong upper bound.
-\item \textbf{Gemini 2.5 Flash}, a widely-used commercial small model
-  with similar latency/cost profile to ours.
 \end{itemize}
 \subsection{Results}
 \begin{center}
-\begin{tabular}{lrrr}
 \toprule
 \textbf{Dimension} & \textbf{Claude Sonnet 4.6}
-                   & \textbf{DarkForensic-7B v2}
-                   & \textbf{Gemini 2.5 Flash} \\
 \midrule
-Exactitud      & 6.73 & \textbf{5.67} & 1.43 \\
-Profundidad    & 6.93 & \textbf{5.13} & 1.00 \\
-Accionabilidad & 7.20 & \textbf{6.20} & 1.00 \\
-Claridad       & 7.60 & \textbf{7.13} & 1.57 \\
 \midrule
-\textbf{Average}      & \textbf{7.12} & \textbf{6.03} & \textbf{1.25} \\
-Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} & 15/15 \\
 \bottomrule
 \end{tabular}
 \end{center}
@@ -307,15 +315,8 @@ Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} & 15/15 \\
 DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
 average. The gap is concentrated in \emph{profundidad} --- analytical
 depth --- which is plausibly bounded by the model's parameter count.
-\emph{Claridad} is essentially tied ($7.13$ vs $7.60$).
-\textbf{vs Gemini 2.5 Flash (commercial peer).}
-DarkForensic outperforms by ${\approx}\,5{\times}$ in every axis.
-Gemini Flash produced a critical-failure response on $15/15$ prompts
---- typically a truncated answer, a generic non-answer, or an output
-unrelated to the finding context. We do not think this is malicious;
-we read it as confirmation that a generalist small model without
-domain tuning is structurally not the right tool for this workload.
 \textbf{vs base (Qwen2.5:3b).}
 On a separate, earlier evaluation against the 3B base, v2 improves the
@@ -371,7 +372,7 @@ no cherry-picking was performed.
   independent judge and report inter-judge agreement.
 \end{itemize}
-\subsection{What's in v3}
 v3 is in training as of this paper. Targets:
 \begin{enumerate}
 \item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
@@ -381,6 +382,8 @@ v3 is in training as of this paper. Targets:
   inter-judge agreement reported.
 \item Targeted improvement on \emph{profundidad}.
 \item Expanded coverage of Arabic and Mandarin sources.
 \end{enumerate}
 Objective: close the gap with Claude Sonnet 4.6 from the current
 $15\,\%$ down to under $10\,\%$, while staying under \$100 per training
@@ -419,8 +422,7 @@ dark-web case in Spanish-speaking Europe. It runs locally, leaks
 nothing, cost \$66 to produce, and is grounded in real,
 audit-traceable findings rather than scraped clearnet text. It reaches
 $85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
-$1\,\%$ of the per-token cost, and outperforms Gemini 2.5 Flash by
-$5{\times}$.
 The next version (v3) is already in training. The aim is to halve the
 remaining gap with Sonnet while keeping the model deployable on

 trained on 9,376 synthetic question--answer pairs distilled from a
 state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
 findings spanning eleven threat categories. In a 15-question
+head-to-head evaluation against the teacher, judged by the same teacher
+on a 4-dimension rubric, v2 reaches \textbf{6.03/10 average} versus
+$7.12$ for the teacher --- i.e.\@ \textbf{${\sim}\,85\,\%$ of the
+teacher's quality at roughly $1\,\%$ of its per-token cost}, with
+\textbf{$0\,\%$ critical-quality responses} (avg $\le 3.0$ on a 1--10
+scale). We release weights, LoRA adapter, GGUF quantization, the
+training pipeline and the evaluation harness under a dual research
+\,+\, commercial license. A v3 cycle is currently in training,
+targeting a doubling of the corpus, a wider eval set, a second
+independent judge to control for self-judging bias, and incorporation
+of analyst preference feedback via DPO.
 \smallskip
 \noindent\textbf{Keywords:} threat intelligence, dark web, LLM
   Filtering pipeline that explicitly removes verbatim PII while
   retaining the analyst's ability to talk about IOCs in the abstract.
 \item A head-to-head evaluation against the state-of-the-art teacher
+  (Claude Sonnet 4.6), with both candidate and teacher constrained to
+  the same generation budget. Honest results: v2 reaches $85\,\%$ of
+  the teacher's quality, with zero critical-quality failures on the
+  held-out set.
 \item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
   $\to$ entity store) released under dual research/commercial license,
   so that a customer can reproduce the pipeline in their own
 Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
 \emph{critical failure} (the answer is operationally useless or worse).
+Both candidate and reference are constrained to the same generation
+budget (\texttt{max\_tokens=1500}) so the rubric scores the substance
+of the answer rather than its length. The reference model is:
 \begin{itemize}
 \item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
   training Q\&A --- i.e.\@ a strong upper bound.
 \end{itemize}
+A first run of the evaluation also included Gemini 2.5 Flash as a
+commercial small-model peer. The Gemini client used in that first
+run was misconfigured: in roughly half of the prompts it returned
+empty or sub-150-character outputs. Rather than report a flattering
+but unfair comparison against a broken-client run, we exclude Gemini
+from the headline results of this paper. A correctly-configured re-run
+will accompany v3 (Section~\ref{sec:v3}).
 \subsection{Results}
 \begin{center}
+\begin{tabular}{lrr}
 \toprule
 \textbf{Dimension} & \textbf{Claude Sonnet 4.6}
+                   & \textbf{DarkForensic-7B v2} \\
 \midrule
+Exactitud      & 6.73 & \textbf{5.67} \\
+Profundidad    & 6.93 & \textbf{5.13} \\
+Accionabilidad & 7.20 & \textbf{6.20} \\
+Claridad       & 7.60 & \textbf{7.13} \\
 \midrule
+\textbf{Average}        & \textbf{7.12} & \textbf{6.03} \\
+Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} \\
 \bottomrule
 \end{tabular}
 \end{center}
 DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
 average. The gap is concentrated in \emph{profundidad} --- analytical
 depth --- which is plausibly bounded by the model's parameter count.
+\emph{Claridad} is essentially tied ($7.13$ vs $7.60$); a CISO would
+not preferentially read one answer over the other.
 \textbf{vs base (Qwen2.5:3b).}
 On a separate, earlier evaluation against the 3B base, v2 improves the
   independent judge and report inter-judge agreement.
 \end{itemize}
+\subsection{What's in v3}\label{sec:v3}
 v3 is in training as of this paper. Targets:
 \begin{enumerate}
 \item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
   inter-judge agreement reported.
 \item Targeted improvement on \emph{profundidad}.
 \item Expanded coverage of Arabic and Mandarin sources.
+\item Re-run of the Gemini 2.5 Flash comparison with a
+  correctly-configured client.
 \end{enumerate}
 Objective: close the gap with Claude Sonnet 4.6 from the current
 $15\,\%$ down to under $10\,\%$, while staying under \$100 per training
 nothing, cost \$66 to produce, and is grounded in real,
 audit-traceable findings rather than scraped clearnet text. It reaches
 $85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
+$1\,\%$ of the per-token cost.
 The next version (v3) is already in training. The aim is to halve the
 remaining gap with Sonnet while keeping the model deployable on