Update LaTeX source — same change as PDF
Browse files
paper/source/paper_darkforensic.tex
CHANGED
|
@@ -49,16 +49,17 @@ findings collected by the GhostNet Intelligence Platform. The model was
|
|
| 49 |
trained on 9,376 synthetic question--answer pairs distilled from a
|
| 50 |
state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
|
| 51 |
findings spanning eleven threat categories. In a 15-question
|
| 52 |
-
head-to-head evaluation
|
| 53 |
-
rubric, v2 reaches \textbf{6.03/10 average} versus
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
| 62 |
|
| 63 |
\smallskip
|
| 64 |
\noindent\textbf{Keywords:} threat intelligence, dark web, LLM
|
|
@@ -115,10 +116,10 @@ turns indexed content into operator-ready answers.
|
|
| 115 |
Filtering pipeline that explicitly removes verbatim PII while
|
| 116 |
retaining the analyst's ability to talk about IOCs in the abstract.
|
| 117 |
\item A head-to-head evaluation against the state-of-the-art teacher
|
| 118 |
-
(Claude Sonnet 4.6)
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
\item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
|
| 123 |
$\to$ entity store) released under dual research/commercial license,
|
| 124 |
so that a customer can reproduce the pipeline in their own
|
|
@@ -273,30 +274,37 @@ the answer is graded by Claude Sonnet 4.6 acting as judge on a
|
|
| 273 |
Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
|
| 274 |
\emph{critical failure} (the answer is operationally useless or worse).
|
| 275 |
|
| 276 |
-
|
|
|
|
|
|
|
| 277 |
\begin{itemize}
|
| 278 |
\item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
|
| 279 |
training Q\&A --- i.e.\@ a strong upper bound.
|
| 280 |
-
\item \textbf{Gemini 2.5 Flash}, a widely-used commercial small model
|
| 281 |
-
with similar latency/cost profile to ours.
|
| 282 |
\end{itemize}
|
| 283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
\subsection{Results}
|
| 285 |
|
| 286 |
\begin{center}
|
| 287 |
-
\begin{tabular}{
|
| 288 |
\toprule
|
| 289 |
\textbf{Dimension} & \textbf{Claude Sonnet 4.6}
|
| 290 |
-
& \textbf{DarkForensic-7B v2}
|
| 291 |
-
& \textbf{Gemini 2.5 Flash} \\
|
| 292 |
\midrule
|
| 293 |
-
Exactitud & 6.73 & \textbf{5.67}
|
| 294 |
-
Profundidad & 6.93 & \textbf{5.13}
|
| 295 |
-
Accionabilidad & 7.20 & \textbf{6.20}
|
| 296 |
-
Claridad & 7.60 & \textbf{7.13}
|
| 297 |
\midrule
|
| 298 |
-
\textbf{Average}
|
| 299 |
-
Critical fails ($\le3.0$) & 0/15 & \textbf{0/15}
|
| 300 |
\bottomrule
|
| 301 |
\end{tabular}
|
| 302 |
\end{center}
|
|
@@ -307,15 +315,8 @@ Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} & 15/15 \\
|
|
| 307 |
DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
|
| 308 |
average. The gap is concentrated in \emph{profundidad} --- analytical
|
| 309 |
depth --- which is plausibly bounded by the model's parameter count.
|
| 310 |
-
\emph{Claridad} is essentially tied ($7.13$ vs $7.60$)
|
| 311 |
-
|
| 312 |
-
\textbf{vs Gemini 2.5 Flash (commercial peer).}
|
| 313 |
-
DarkForensic outperforms by ${\approx}\,5{\times}$ in every axis.
|
| 314 |
-
Gemini Flash produced a critical-failure response on $15/15$ prompts
|
| 315 |
-
--- typically a truncated answer, a generic non-answer, or an output
|
| 316 |
-
unrelated to the finding context. We do not think this is malicious;
|
| 317 |
-
we read it as confirmation that a generalist small model without
|
| 318 |
-
domain tuning is structurally not the right tool for this workload.
|
| 319 |
|
| 320 |
\textbf{vs base (Qwen2.5:3b).}
|
| 321 |
On a separate, earlier evaluation against the 3B base, v2 improves the
|
|
@@ -371,7 +372,7 @@ no cherry-picking was performed.
|
|
| 371 |
independent judge and report inter-judge agreement.
|
| 372 |
\end{itemize}
|
| 373 |
|
| 374 |
-
\subsection{What's in v3}
|
| 375 |
v3 is in training as of this paper. Targets:
|
| 376 |
\begin{enumerate}
|
| 377 |
\item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
|
|
@@ -381,6 +382,8 @@ v3 is in training as of this paper. Targets:
|
|
| 381 |
inter-judge agreement reported.
|
| 382 |
\item Targeted improvement on \emph{profundidad}.
|
| 383 |
\item Expanded coverage of Arabic and Mandarin sources.
|
|
|
|
|
|
|
| 384 |
\end{enumerate}
|
| 385 |
Objective: close the gap with Claude Sonnet 4.6 from the current
|
| 386 |
$15\,\%$ down to under $10\,\%$, while staying under \$100 per training
|
|
@@ -419,8 +422,7 @@ dark-web case in Spanish-speaking Europe. It runs locally, leaks
|
|
| 419 |
nothing, cost \$66 to produce, and is grounded in real,
|
| 420 |
audit-traceable findings rather than scraped clearnet text. It reaches
|
| 421 |
$85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
|
| 422 |
-
$1\,\%$ of the per-token cost
|
| 423 |
-
$5{\times}$.
|
| 424 |
|
| 425 |
The next version (v3) is already in training. The aim is to halve the
|
| 426 |
remaining gap with Sonnet while keeping the model deployable on
|
|
|
|
| 49 |
trained on 9,376 synthetic question--answer pairs distilled from a
|
| 50 |
state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
|
| 51 |
findings spanning eleven threat categories. In a 15-question
|
| 52 |
+
head-to-head evaluation against the teacher, judged by the same teacher
|
| 53 |
+
on a 4-dimension rubric, v2 reaches \textbf{6.03/10 average} versus
|
| 54 |
+
$7.12$ for the teacher --- i.e.\@ \textbf{${\sim}\,85\,\%$ of the
|
| 55 |
+
teacher's quality at roughly $1\,\%$ of its per-token cost}, with
|
| 56 |
+
\textbf{$0\,\%$ critical-quality responses} (avg $\le 3.0$ on a 1--10
|
| 57 |
+
scale). We release weights, LoRA adapter, GGUF quantization, the
|
| 58 |
+
training pipeline and the evaluation harness under a dual research
|
| 59 |
+
\,+\, commercial license. A v3 cycle is currently in training,
|
| 60 |
+
targeting a doubling of the corpus, a wider eval set, a second
|
| 61 |
+
independent judge to control for self-judging bias, and incorporation
|
| 62 |
+
of analyst preference feedback via DPO.
|
| 63 |
|
| 64 |
\smallskip
|
| 65 |
\noindent\textbf{Keywords:} threat intelligence, dark web, LLM
|
|
|
|
| 116 |
Filtering pipeline that explicitly removes verbatim PII while
|
| 117 |
retaining the analyst's ability to talk about IOCs in the abstract.
|
| 118 |
\item A head-to-head evaluation against the state-of-the-art teacher
|
| 119 |
+
(Claude Sonnet 4.6), with both candidate and teacher constrained to
|
| 120 |
+
the same generation budget. Honest results: v2 reaches $85\,\%$ of
|
| 121 |
+
the teacher's quality, with zero critical-quality failures on the
|
| 122 |
+
held-out set.
|
| 123 |
\item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
|
| 124 |
$\to$ entity store) released under dual research/commercial license,
|
| 125 |
so that a customer can reproduce the pipeline in their own
|
|
|
|
| 274 |
Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
|
| 275 |
\emph{critical failure} (the answer is operationally useless or worse).
|
| 276 |
|
| 277 |
+
Both candidate and reference are constrained to the same generation
|
| 278 |
+
budget (\texttt{max\_tokens=1500}) so the rubric scores the substance
|
| 279 |
+
of the answer rather than its length. The reference model is:
|
| 280 |
\begin{itemize}
|
| 281 |
\item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
|
| 282 |
training Q\&A --- i.e.\@ a strong upper bound.
|
|
|
|
|
|
|
| 283 |
\end{itemize}
|
| 284 |
|
| 285 |
+
A first run of the evaluation also included Gemini 2.5 Flash as a
|
| 286 |
+
commercial small-model peer. The Gemini client used in that first
|
| 287 |
+
run was misconfigured: in roughly half of the prompts it returned
|
| 288 |
+
empty or sub-150-character outputs. Rather than report a flattering
|
| 289 |
+
but unfair comparison against a broken-client run, we exclude Gemini
|
| 290 |
+
from the headline results of this paper. A correctly-configured re-run
|
| 291 |
+
will accompany v3 (Section~\ref{sec:v3}).
|
| 292 |
+
|
| 293 |
\subsection{Results}
|
| 294 |
|
| 295 |
\begin{center}
|
| 296 |
+
\begin{tabular}{lrr}
|
| 297 |
\toprule
|
| 298 |
\textbf{Dimension} & \textbf{Claude Sonnet 4.6}
|
| 299 |
+
& \textbf{DarkForensic-7B v2} \\
|
|
|
|
| 300 |
\midrule
|
| 301 |
+
Exactitud & 6.73 & \textbf{5.67} \\
|
| 302 |
+
Profundidad & 6.93 & \textbf{5.13} \\
|
| 303 |
+
Accionabilidad & 7.20 & \textbf{6.20} \\
|
| 304 |
+
Claridad & 7.60 & \textbf{7.13} \\
|
| 305 |
\midrule
|
| 306 |
+
\textbf{Average} & \textbf{7.12} & \textbf{6.03} \\
|
| 307 |
+
Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} \\
|
| 308 |
\bottomrule
|
| 309 |
\end{tabular}
|
| 310 |
\end{center}
|
|
|
|
| 315 |
DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
|
| 316 |
average. The gap is concentrated in \emph{profundidad} --- analytical
|
| 317 |
depth --- which is plausibly bounded by the model's parameter count.
|
| 318 |
+
\emph{Claridad} is essentially tied ($7.13$ vs $7.60$); a CISO would
|
| 319 |
+
not preferentially read one answer over the other.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
|
| 321 |
\textbf{vs base (Qwen2.5:3b).}
|
| 322 |
On a separate, earlier evaluation against the 3B base, v2 improves the
|
|
|
|
| 372 |
independent judge and report inter-judge agreement.
|
| 373 |
\end{itemize}
|
| 374 |
|
| 375 |
+
\subsection{What's in v3}\label{sec:v3}
|
| 376 |
v3 is in training as of this paper. Targets:
|
| 377 |
\begin{enumerate}
|
| 378 |
\item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
|
|
|
|
| 382 |
inter-judge agreement reported.
|
| 383 |
\item Targeted improvement on \emph{profundidad}.
|
| 384 |
\item Expanded coverage of Arabic and Mandarin sources.
|
| 385 |
+
\item Re-run of the Gemini 2.5 Flash comparison with a
|
| 386 |
+
correctly-configured client.
|
| 387 |
\end{enumerate}
|
| 388 |
Objective: close the gap with Claude Sonnet 4.6 from the current
|
| 389 |
$15\,\%$ down to under $10\,\%$, while staying under \$100 per training
|
|
|
|
| 422 |
nothing, cost \$66 to produce, and is grounded in real,
|
| 423 |
audit-traceable findings rather than scraped clearnet text. It reaches
|
| 424 |
$85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
|
| 425 |
+
$1\,\%$ of the per-token cost.
|
|
|
|
| 426 |
|
| 427 |
The next version (v3) is already in training. The aim is to halve the
|
| 428 |
remaining gap with Sonnet while keeping the model deployable on
|