jmpicon2026 commited on
Commit
00fe38d
·
verified ·
1 Parent(s): 1113d32

Update LaTeX source — same change as PDF

Browse files
Files changed (1) hide show
  1. paper/source/paper_darkforensic.tex +40 -38
paper/source/paper_darkforensic.tex CHANGED
@@ -49,16 +49,17 @@ findings collected by the GhostNet Intelligence Platform. The model was
49
  trained on 9,376 synthetic question--answer pairs distilled from a
50
  state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
51
  findings spanning eleven threat categories. In a 15-question
52
- head-to-head evaluation judged by Claude Sonnet 4.6 on a 4-dimension
53
- rubric, v2 reaches \textbf{6.03/10 average} versus 7.12 for the teacher
54
- itself and 1.25 for Gemini 2.5 Flash --- i.e.\@
55
- \textbf{${\sim}\,85\,\%$ of the teacher's quality at roughly $1\,\%$ of
56
- its per-token cost}, with \textbf{$0\,\%$ critical-quality responses}
57
- versus $100\,\%$ for Gemini Flash. We release weights, LoRA adapter,
58
- GGUF quantization, the training pipeline and the evaluation harness
59
- under a dual research \,+\, commercial license. A v3 cycle is currently
60
- in training, targeting a doubling of the corpus and incorporation of
61
- analyst preference feedback via DPO.
 
62
 
63
  \smallskip
64
  \noindent\textbf{Keywords:} threat intelligence, dark web, LLM
@@ -115,10 +116,10 @@ turns indexed content into operator-ready answers.
115
  Filtering pipeline that explicitly removes verbatim PII while
116
  retaining the analyst's ability to talk about IOCs in the abstract.
117
  \item A head-to-head evaluation against the state-of-the-art teacher
118
- (Claude Sonnet 4.6) and against a widely-used commercial small model
119
- (Gemini 2.5 Flash), with the same rubric and the same prompts.
120
- Honest results: v2 reaches $85\,\%$ of the teacher's quality and
121
- outperforms Gemini Flash by a wide margin.
122
  \item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
123
  $\to$ entity store) released under dual research/commercial license,
124
  so that a customer can reproduce the pipeline in their own
@@ -273,30 +274,37 @@ the answer is graded by Claude Sonnet 4.6 acting as judge on a
273
  Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
274
  \emph{critical failure} (the answer is operationally useless or worse).
275
 
276
- The same protocol is applied to two reference models:
 
 
277
  \begin{itemize}
278
  \item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
279
  training Q\&A --- i.e.\@ a strong upper bound.
280
- \item \textbf{Gemini 2.5 Flash}, a widely-used commercial small model
281
- with similar latency/cost profile to ours.
282
  \end{itemize}
283
 
 
 
 
 
 
 
 
 
284
  \subsection{Results}
285
 
286
  \begin{center}
287
- \begin{tabular}{lrrr}
288
  \toprule
289
  \textbf{Dimension} & \textbf{Claude Sonnet 4.6}
290
- & \textbf{DarkForensic-7B v2}
291
- & \textbf{Gemini 2.5 Flash} \\
292
  \midrule
293
- Exactitud & 6.73 & \textbf{5.67} & 1.43 \\
294
- Profundidad & 6.93 & \textbf{5.13} & 1.00 \\
295
- Accionabilidad & 7.20 & \textbf{6.20} & 1.00 \\
296
- Claridad & 7.60 & \textbf{7.13} & 1.57 \\
297
  \midrule
298
- \textbf{Average} & \textbf{7.12} & \textbf{6.03} & \textbf{1.25} \\
299
- Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} & 15/15 \\
300
  \bottomrule
301
  \end{tabular}
302
  \end{center}
@@ -307,15 +315,8 @@ Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} & 15/15 \\
307
  DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
308
  average. The gap is concentrated in \emph{profundidad} --- analytical
309
  depth --- which is plausibly bounded by the model's parameter count.
310
- \emph{Claridad} is essentially tied ($7.13$ vs $7.60$).
311
-
312
- \textbf{vs Gemini 2.5 Flash (commercial peer).}
313
- DarkForensic outperforms by ${\approx}\,5{\times}$ in every axis.
314
- Gemini Flash produced a critical-failure response on $15/15$ prompts
315
- --- typically a truncated answer, a generic non-answer, or an output
316
- unrelated to the finding context. We do not think this is malicious;
317
- we read it as confirmation that a generalist small model without
318
- domain tuning is structurally not the right tool for this workload.
319
 
320
  \textbf{vs base (Qwen2.5:3b).}
321
  On a separate, earlier evaluation against the 3B base, v2 improves the
@@ -371,7 +372,7 @@ no cherry-picking was performed.
371
  independent judge and report inter-judge agreement.
372
  \end{itemize}
373
 
374
- \subsection{What's in v3}
375
  v3 is in training as of this paper. Targets:
376
  \begin{enumerate}
377
  \item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
@@ -381,6 +382,8 @@ v3 is in training as of this paper. Targets:
381
  inter-judge agreement reported.
382
  \item Targeted improvement on \emph{profundidad}.
383
  \item Expanded coverage of Arabic and Mandarin sources.
 
 
384
  \end{enumerate}
385
  Objective: close the gap with Claude Sonnet 4.6 from the current
386
  $15\,\%$ down to under $10\,\%$, while staying under \$100 per training
@@ -419,8 +422,7 @@ dark-web case in Spanish-speaking Europe. It runs locally, leaks
419
  nothing, cost \$66 to produce, and is grounded in real,
420
  audit-traceable findings rather than scraped clearnet text. It reaches
421
  $85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
422
- $1\,\%$ of the per-token cost, and outperforms Gemini 2.5 Flash by
423
- $5{\times}$.
424
 
425
  The next version (v3) is already in training. The aim is to halve the
426
  remaining gap with Sonnet while keeping the model deployable on
 
49
  trained on 9,376 synthetic question--answer pairs distilled from a
50
  state-of-the-art teacher (Anthropic Claude Sonnet 4.6) over 3,290 real
51
  findings spanning eleven threat categories. In a 15-question
52
+ head-to-head evaluation against the teacher, judged by the same teacher
53
+ on a 4-dimension rubric, v2 reaches \textbf{6.03/10 average} versus
54
+ $7.12$ for the teacher --- i.e.\@ \textbf{${\sim}\,85\,\%$ of the
55
+ teacher's quality at roughly $1\,\%$ of its per-token cost}, with
56
+ \textbf{$0\,\%$ critical-quality responses} (avg $\le 3.0$ on a 1--10
57
+ scale). We release weights, LoRA adapter, GGUF quantization, the
58
+ training pipeline and the evaluation harness under a dual research
59
+ \,+\, commercial license. A v3 cycle is currently in training,
60
+ targeting a doubling of the corpus, a wider eval set, a second
61
+ independent judge to control for self-judging bias, and incorporation
62
+ of analyst preference feedback via DPO.
63
 
64
  \smallskip
65
  \noindent\textbf{Keywords:} threat intelligence, dark web, LLM
 
116
  Filtering pipeline that explicitly removes verbatim PII while
117
  retaining the analyst's ability to talk about IOCs in the abstract.
118
  \item A head-to-head evaluation against the state-of-the-art teacher
119
+ (Claude Sonnet 4.6), with both candidate and teacher constrained to
120
+ the same generation budget. Honest results: v2 reaches $85\,\%$ of
121
+ the teacher's quality, with zero critical-quality failures on the
122
+ held-out set.
123
  \item An end-to-end open architecture (LLM $\to$ RAG $\to$ crawler
124
  $\to$ entity store) released under dual research/commercial license,
125
  so that a customer can reproduce the pipeline in their own
 
274
  Each axis is scored 1--10. An \emph{average $\le 3.0$} counts as a
275
  \emph{critical failure} (the answer is operationally useless or worse).
276
 
277
+ Both candidate and reference are constrained to the same generation
278
+ budget (\texttt{max\_tokens=1500}) so the rubric scores the substance
279
+ of the answer rather than its length. The reference model is:
280
  \begin{itemize}
281
  \item \textbf{Claude Sonnet 4.6}, the teacher used to generate the
282
  training Q\&A --- i.e.\@ a strong upper bound.
 
 
283
  \end{itemize}
284
 
285
+ A first run of the evaluation also included Gemini 2.5 Flash as a
286
+ commercial small-model peer. The Gemini client used in that first
287
+ run was misconfigured: in roughly half of the prompts it returned
288
+ empty or sub-150-character outputs. Rather than report a flattering
289
+ but unfair comparison against a broken-client run, we exclude Gemini
290
+ from the headline results of this paper. A correctly-configured re-run
291
+ will accompany v3 (Section~\ref{sec:v3}).
292
+
293
  \subsection{Results}
294
 
295
  \begin{center}
296
+ \begin{tabular}{lrr}
297
  \toprule
298
  \textbf{Dimension} & \textbf{Claude Sonnet 4.6}
299
+ & \textbf{DarkForensic-7B v2} \\
 
300
  \midrule
301
+ Exactitud & 6.73 & \textbf{5.67} \\
302
+ Profundidad & 6.93 & \textbf{5.13} \\
303
+ Accionabilidad & 7.20 & \textbf{6.20} \\
304
+ Claridad & 7.60 & \textbf{7.13} \\
305
  \midrule
306
+ \textbf{Average} & \textbf{7.12} & \textbf{6.03} \\
307
+ Critical fails ($\le3.0$) & 0/15 & \textbf{0/15} \\
308
  \bottomrule
309
  \end{tabular}
310
  \end{center}
 
315
  DarkForensic-7B v2 reaches $85\,\%$ of the teacher's quality on
316
  average. The gap is concentrated in \emph{profundidad} --- analytical
317
  depth --- which is plausibly bounded by the model's parameter count.
318
+ \emph{Claridad} is essentially tied ($7.13$ vs $7.60$); a CISO would
319
+ not preferentially read one answer over the other.
 
 
 
 
 
 
 
320
 
321
  \textbf{vs base (Qwen2.5:3b).}
322
  On a separate, earlier evaluation against the 3B base, v2 improves the
 
372
  independent judge and report inter-judge agreement.
373
  \end{itemize}
374
 
375
+ \subsection{What's in v3}\label{sec:v3}
376
  v3 is in training as of this paper. Targets:
377
  \begin{enumerate}
378
  \item Corpus doubling to ${\approx}\,18{,}000$ Q\&A pairs.
 
382
  inter-judge agreement reported.
383
  \item Targeted improvement on \emph{profundidad}.
384
  \item Expanded coverage of Arabic and Mandarin sources.
385
+ \item Re-run of the Gemini 2.5 Flash comparison with a
386
+ correctly-configured client.
387
  \end{enumerate}
388
  Objective: close the gap with Claude Sonnet 4.6 from the current
389
  $15\,\%$ down to under $10\,\%$, while staying under \$100 per training
 
422
  nothing, cost \$66 to produce, and is grounded in real,
423
  audit-traceable findings rather than scraped clearnet text. It reaches
424
  $85\,\%$ of the quality of Claude Sonnet 4.6 on this task at roughly
425
+ $1\,\%$ of the per-token cost.
 
426
 
427
  The next version (v3) is already in training. The aim is to halve the
428
  remaining gap with Sonnet while keeping the model deployable on