idobrovolskyi commited on
Commit
4e83dd4
Β·
verified Β·
1 Parent(s): 577c994

sync README with paper-final numbers

Browse files
Files changed (1) hide show
  1. README.md +62 -87
README.md CHANGED
@@ -12,104 +12,79 @@ base_model: Qwen/Qwen3.5-27B
12
 
13
  # TorchSight Beam q4_K_M
14
 
15
- Cybersecurity document classifier. LoRA fine-tune of **Qwen 3.5 27B**, quantized to q4_K_M. ~17 GB GGUF.
16
-
17
- Recommended hardware: 32 GB.
18
-
19
- ## Benchmark Results
20
-
21
- Two benchmarks evaluated under identical methodology
22
- (alpaca prompt, Ollama `/api/generate`, Modelfile temperature 0.1,
23
- `num_predict=2048`):
24
-
25
- ### Primary β€” eval-1000-synthetic (1000 stratified samples)
26
-
27
- | Model | Category Acc 95% CI | Subcategory Acc | Type |
28
- |---|---|---|---|
29
- | **Beam q4_K_M** | **95.1%** [93.8, 96.4] | 48.5% | Local (LoRA) |
30
- | Beam f16 | 93.0% [91.2, 94.5] | 51.3% | Local (LoRA) |
31
- | Beam q8_0 | 92.7% [90.9, 94.2] | 51.3% | Local (LoRA) |
32
- | Claude Sonnet 4 | 79.9% | 23.0% | Commercial API |
33
- | Claude Opus 4 | 79.9% | 22.5% | Commercial API |
34
- | GPT-5 | 76.9% | 11.6% | Commercial API |
35
- | Gemini 2.5 Pro | 75.4% | 21.0% | Commercial API |
36
- | Regex baseline (49 patterns) | 52.7% | β€” | Rule-based |
37
- | Qwen 3.5 27B base (no LoRA) | 43.3% | 4.3% | Local |
38
-
39
- ### External β€” eval-500-external (500 held-out samples from real public datasets)
40
-
41
- Held-out splits of training sources (NVD, NIST, AI4Privacy, Enron, phishing) plus
42
- MTSamples (medical transcriptions explicitly **excluded** from training).
43
-
44
- | Model | Category Acc 95% CI | Subcategory Acc | Ξ” vs. primary |
45
- |---|---|---|---|
46
- | **Beam q4_K_M** | **93.8%** [91.3, 95.6] | 51.4% | βˆ’1.3 pp |
47
- | Beam q8_0 | 91.2% [88.4, 93.4] | 46.4% | βˆ’1.5 pp |
48
- | Beam f16 | 91.0% [88.2, 93.2] | 47.2% | βˆ’2.0 pp |
49
- | Claude Sonnet 4 | 86.4% | β€” | +6.5 pp |
50
- | Gemini 2.5 Pro | 82.0% | β€” | +6.6 pp |
51
- | GPT-5 | 65.8% | β€” | βˆ’11.1 pp |
52
- | Regex baseline | 29.6% | β€” | βˆ’23.1 pp |
53
- | Qwen 3.5 27B base | 28.0% | 0% | βˆ’15.3 pp |
54
-
55
- Beam q4_K_M's gap over Claude Sonnet 4 is statistically significant
56
- (McNemar's χ²₁ = 126.7, p β‰ˆ 2 Γ— 10⁻²⁹), as is the gap over the
57
- unfine-tuned Qwen base (χ²₁ = 489.5, p β‰ˆ 2 Γ— 10⁻¹⁰⁸ β€” fine-tuning
58
- contributes +65.8 pp on external data with the identical prompt).
 
 
 
 
 
59
 
60
  ## Usage with Ollama
61
 
62
  ```bash
63
- # Pull from Ollama Hub
64
- ollama pull torchsight/beam:q4_K_M
65
-
66
- # Or build locally from this GGUF + Modelfile
67
- ollama create torchsight/beam:q4_K_M -f Modelfile
68
  ```
69
 
70
- Modelfile:
71
-
72
- ```
73
- FROM ./beam-1.0-q4_K_M.gguf
74
- SYSTEM "You are TorchSight, a cybersecurity document classifier. Analyze the provided text and identify ALL security-relevant findings.
75
-
76
- For each finding, output a JSON object with:
77
- - category: one of [pii, credentials, financial, medical, confidential, malicious, safe]
78
- - subcategory: specific type (e.g., pii.identity, malicious.injection, credentials.api_key)
79
- - severity: one of [critical, high, medium, low, info]
80
- - explanation: detailed explanation including specific values found.
81
-
82
- If a document contains multiple types of sensitive data, return a finding for EACH one.
83
- If the text is clean/safe, output a single finding with category \"safe\".
84
-
85
- Respond ONLY with a JSON array of findings."
86
- PARAMETER temperature 0.1
87
- PARAMETER top_p 0.9
88
- PARAMETER num_predict 2048
89
- ```
90
-
91
- ## Reproducibility
92
-
93
- Eval scripts and benchmark data: <https://github.com/torchsight/torchsight/tree/main/beam/evaluation>
94
 
95
  ```bash
96
- git clone https://github.com/torchsight/torchsight
97
- cd torchsight/beam/evaluation
98
- BEAM_MODEL=torchsight/beam:q4_K_M python scripts/eval_beam.py # primary
99
- BEAM_MODEL=torchsight/beam:q4_K_M python scripts/eval_external.py # external
100
  ```
101
 
102
- ## Citation
103
 
104
- ```bibtex
105
- @misc{torchsight-beam-q4_K_M-2026,
106
- title = {TorchSight Beam q4_K_M: cybersecurity document classifier},
107
- author = {Dobrovolskyi, Ivan},
108
- year = {2026},
109
- url = {https://huggingface.co/torchsight/beam-q4_K_M},
110
- }
111
- ```
112
 
113
  ## License
114
 
115
- Apache 2.0
 
 
12
 
13
  # TorchSight Beam q4_K_M
14
 
15
+ Cybersecurity document classifier. LoRA fine-tune of **Qwen 3.5 27B**,
16
+ quantized to q4_K_M. Approximately 17 GB GGUF.
17
+
18
+ Recommended hardware: 32 GB unified memory (e.g. M-series Mac) or 24 GB GPU.
19
+
20
+ This is the **default** quantization for the TorchSight system β€”
21
+ released alongside:
22
+
23
+ > Dobrovolskyi, I. *Security Document Classification with a Fine-Tuned Local
24
+ > Large Language Model: Benchmark Data and an Open-Source System.* Journal of
25
+ > Information Security and Applications, 2026.
26
+
27
+ ## Benchmark results
28
+
29
+ Evaluated under identical methodology (alpaca prompt, Ollama `/api/generate`,
30
+ temperature = 0, `num_predict = 2048`) on the companion dataset
31
+ [`torchsight/cybersecurity-classification-benchmark`](https://huggingface.co/datasets/torchsight/cybersecurity-classification-benchmark).
32
+ Canonical numbers live in that repo's `BENCHMARK_NUMBERS.md`.
33
+
34
+ ### Primary β€” eval-1000-synthetic (n = 1,000)
35
+
36
+ | Model | Type | Cat. acc [95% CI] | Subcat. acc |
37
+ |---|---|---:|---:|
38
+ | **Beam q4_K_M** | Local (LoRA) | **95.0%** [93.5, 96.2] | 48.2% |
39
+ | Beam f16 | Local (LoRA) | 93.2% [91.5, 94.6] | 51.1% |
40
+ | Beam q8_0 | Local (LoRA) | 93.0% [91.2, 94.4] | 51.4% |
41
+ | Claude Sonnet 4 | Commercial API | 79.9% [77.3, 82.3] | 23.0% |
42
+ | Claude Opus 4 | Commercial API | 79.9% [77.3, 82.3] | 22.5% |
43
+ | GPT-5 | Commercial API | 76.9% [74.2, 79.4] | 11.6% |
44
+ | Gemini 2.5 Pro | Commercial API | 75.4% [72.6, 78.0] | 21.0% |
45
+ | Qwen 3.5 27B base | Local (no LoRA) | 86.3% [84.0, 88.3] | 19.0% |
46
+ | Regex (48 patterns)| Rule-based | 52.7% [49.6, 55.8] | β€” |
47
+
48
+ 95% confidence intervals are Wilson-score. Beam q4_K_M's advantage over every
49
+ commercial baseline is significant under pairwise McNemar's tests after
50
+ Bonferroni correction (Ξ± = 0.05).
51
+
52
+ ### External β€” eval-500-external (n = 500)
53
+
54
+ | Model | Cat. acc [95% CI] | Ξ” vs. primary |
55
+ |---|---:|---:|
56
+ | **Beam q4_K_M** | **93.8%** [91.3, 95.6] | βˆ’1.2 pp |
57
+ | Beam f16 | 91.2% [88.4, 93.4] | βˆ’2.0 pp |
58
+ | Beam q8_0 | 91.2% [88.4, 93.4] | βˆ’1.8 pp |
59
+ | Claude Sonnet 4 | 86.4% [83.1, 89.1] | +6.5 pp |
60
+ | Gemini 2.5 Pro | 82.0% [78.4, 85.1] | +6.6 pp |
61
+ | Qwen 3.5 27B base | 86.6% [83.3, 89.3] | +0.3 pp |
62
+ | GPT-5 | 65.8% [61.5, 69.8] | βˆ’11.1 pp |
63
+ | Regex baseline | 29.6% [25.8, 33.7] | βˆ’23.1 pp |
64
 
65
  ## Usage with Ollama
66
 
67
  ```bash
68
+ ollama pull torchsight/beam-q4_K_M
69
+ ollama run torchsight/beam-q4_K_M
 
 
 
70
  ```
71
 
72
+ Or via the [TorchSight CLI](https://github.com/IvanDobrovolsky/torchsight)
73
+ for full document-scanning workflow:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ```bash
76
+ ./install.sh
77
+ torchsight /path/to/scan
 
 
78
  ```
79
 
80
+ ## Training
81
 
82
+ - Base: Qwen 3.5 27B (dense)
83
+ - Method: LoRA (r = 128, Ξ± = 256), bf16, 5 epochs
84
+ - Dataset: 78,358 balanced samples β€” see [`torchsight/beam-training-data`](https://huggingface.co/datasets/torchsight/beam-training-data)
85
+ - Hardware: 8Γ— NVIDIA A100 80GB SXM4, 10.5 hours
 
 
 
 
86
 
87
  ## License
88
 
89
+ Apache 2.0. The base model (Qwen 3.5 27B) carries its own license; consult
90
+ upstream terms for use.