diff --git a/.gitattributes b/.gitattributes index a6344aac8c09253b3b630fb776ae94478aa0275b..3f83a4e740962ef822ca9b36a90759a89e281ce0 100644 --- a/.gitattributes +++ b/.gitattributes @@ -33,3 +33,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +data/processed/r8_5class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/enriched_13class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/enriched_5class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/enriched_5class_train_cleaned_trimmed.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/r8_5class_train_propagated.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/enriched_5class_train_cleaned.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/r7_5class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/r9_5class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/enriched_5class_train_cleaned_deleaked.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/backup/enriched_13class_train.jsonl filter=lfs diff=lfs merge=lfs -text +data/processed/backup/enriched_5class_train.jsonl filter=lfs diff=lfs merge=lfs -text diff --git a/data/processed/backup/enriched_13class_train.jsonl b/data/processed/backup/enriched_13class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..5c3f715443b6a71f7070cd29e370bc574e77eaa6 --- /dev/null +++ b/data/processed/backup/enriched_13class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e5d1e4f4d9bd3a414fcd81d05242c1f913f575a553b3233adb15a8ae51740ecf +size 24261655 diff --git a/data/processed/backup/enriched_5class_train.jsonl b/data/processed/backup/enriched_5class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..49a7844c24d66287d71e11be90344b586baa1efe --- /dev/null +++ b/data/processed/backup/enriched_5class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:90c6603e997c5844aec40404907210816886f6a56bce2acc5f27577b2d7f9469 +size 21643218 diff --git a/data/processed/enriched_13class_train.jsonl b/data/processed/enriched_13class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..9a473399acec02632a0d1435e83888cd63768e7b --- /dev/null +++ b/data/processed/enriched_13class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f723221d386fe83d06916f9c1b0885e52327750bcfa4d9ccac36d0143b79d410 +size 21203019 diff --git a/data/processed/enriched_5class_train.jsonl b/data/processed/enriched_5class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..5c01acda02e3409f234c443b5c54c5cdfdd9c89a --- /dev/null +++ b/data/processed/enriched_5class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ae602b2b8c89136ac80c49061c23fc0b41edeeb677a56888580feea5476dd21a +size 19573553 diff --git a/data/processed/enriched_5class_train_cleaned.jsonl b/data/processed/enriched_5class_train_cleaned.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..e80f0db45751332bb3e2d90eb2ac77af6fa29745 --- /dev/null +++ b/data/processed/enriched_5class_train_cleaned.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f0f49bff19319f8210cc3e1ecfbc18488ef60d73682f318ced1f314afbe44297 +size 18736010 diff --git a/data/processed/enriched_5class_train_cleaned_deleaked.jsonl b/data/processed/enriched_5class_train_cleaned_deleaked.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..60813c217ecdd3a9086d27ce02f26667db4aecb9 --- /dev/null +++ b/data/processed/enriched_5class_train_cleaned_deleaked.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5a047e2b5d5d8197d2601e5ed575082026876be6861eed46cb9ace034213c6d0 +size 16417825 diff --git a/data/processed/enriched_5class_train_cleaned_trimmed.jsonl b/data/processed/enriched_5class_train_cleaned_trimmed.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..05d0360d4b7fee1c665c7155a843e3fc123c17a0 --- /dev/null +++ b/data/processed/enriched_5class_train_cleaned_trimmed.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f59710bfdd33666816cb79282c812ae8db62052478d3b740b46ec827fcad71e8 +size 18097434 diff --git a/data/processed/r7_5class_train.jsonl b/data/processed/r7_5class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..52d846d8344202c13408a28b979f369770ee7a86 --- /dev/null +++ b/data/processed/r7_5class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:772a78f8c6fce9a7f0b81120082e0f5579c89b6abd2dbb4c62ccebbd182b3508 +size 19579089 diff --git a/data/processed/r8_5class_train.jsonl b/data/processed/r8_5class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..e9c23bdbbc2f1fddd17b796aa3349e5f0253ddc7 --- /dev/null +++ b/data/processed/r8_5class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e5172ffe17d2f266d1aa9f0e815a7f03045afab7b42b9d2fdaf95555cb23fbd8 +size 18041668 diff --git a/data/processed/r8_5class_train_propagated.jsonl b/data/processed/r8_5class_train_propagated.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..0adaea6b4ab5584215a355a31fd3ea0d6172c38f --- /dev/null +++ b/data/processed/r8_5class_train_propagated.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:04d647cb77ac4a81db2d69e11cca934baed8fd6562488e99c67bd292e016118c +size 22410416 diff --git a/data/processed/r9_5class_train.jsonl b/data/processed/r9_5class_train.jsonl new file mode 100644 index 0000000000000000000000000000000000000000..f0da5e771f005ff7903901feaf790df024b64fb1 --- /dev/null +++ b/data/processed/r9_5class_train.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f8d4b825c5eab6b6ac91915aac239ead3b5622844d69163fe51af4b8f9826d7f +size 14918444 diff --git a/research/decisions/ADR-001-use-case-cybersecurity.md b/research/decisions/ADR-001-use-case-cybersecurity.md new file mode 100644 index 0000000000000000000000000000000000000000..c6d7d6018f346683d209b938630a6b0539aad066 --- /dev/null +++ b/research/decisions/ADR-001-use-case-cybersecurity.md @@ -0,0 +1,45 @@ +# ADR-001: Use Case Selection — Cybersecurity IOC/Entity Extraction + +**Date:** 2026-04-24 +**Status:** Accepted +**Deciders:** Human (lead) + Claude (research partner) + +## Context + +We evaluated 10+ verticals for repurposing OpenAI's Privacy Filter (50M active MoE bidirectional token classifier) beyond PII detection. The Opus research agent conducted a comprehensive landscape analysis covering existing tools, market gaps, available datasets, and architectural fit. + +## Options Considered + +1. **Cybersecurity IOC extraction** — Score 9/10 +2. **Developer tools (secret/annotation scanning)** — Score 8.5/10 +3. **Clinical de-identification** — Score 7.5/10 +4. **Improved PII (Presidio backend)** — Score 7/10 +5. **Financial entity extraction** — Score 6/10 +6. **Energy/power systems** — Low (no data, tiny market) +7. Several others scored lower (legal, scientific, education, supply chain, HR) + +## Decision + +**Cybersecurity IOC/entity extraction from threat intelligence reports**, with CyNER (560M params) as the primary benchmark competitor. + +## Reasoning + +- **Biggest efficiency gap:** CyNER uses 560M dense params; we use 50M active (MoE). 11x compute reduction is the clearest "same accuracy, fraction of the cost" story. +- **Architecture fit:** Cybersecurity entities (IPs, hashes, CVEs, malware names, threat actors) are short-to-medium spans with clear boundaries — ideal for BIOES + Viterbi. +- **257-token window is sufficient:** IOC context is almost always within 1-2 sentences. +- **Data exists:** PRISM benchmark, CyNER corpus, Pile-NER cybersecurity subset, MITRE ATT&CK structured data. +- **Privacy argument is strong:** Threat reports contain internal network topology, can't be sent to cloud APIs. +- **Publishable:** "Sparse MoE vs. dense transformer for cybersecurity NER" is a clean research question. +- **Practical tool:** Every SOC team, every SIEM vendor needs lightweight local IOC extraction. + +## Deliverables + +1. **Research paper** — Rigorous comparison of Arcspan vs. CyNER (and other baselines) +2. **Open-source tool** — Fine-tuned checkpoint + CLI/library for cybersecurity entity extraction + +## Consequences + +- Need to acquire and convert multiple cybersecurity NER datasets to BIOES JSONL format +- Need to design a unified label taxonomy across datasets +- Need reproducible experimental setup (fixed seeds, documented hyperparameters, held-out test sets) +- Energy/power systems remains a potential future vertical once the platform is proven diff --git a/research/decisions/ADR-002-strict-r9-and-benchmark-portfolio.md b/research/decisions/ADR-002-strict-r9-and-benchmark-portfolio.md new file mode 100644 index 0000000000000000000000000000000000000000..c3d317bca984ed7cb07029d2aeecfcd6440c7e7c --- /dev/null +++ b/research/decisions/ADR-002-strict-r9-and-benchmark-portfolio.md @@ -0,0 +1,57 @@ +# ADR-002: Strict R9 Dataset and Multi-Benchmark Evaluation + +**Date:** 2026-04-26 +**Status:** Accepted +**Deciders:** Human (lead) + Codex replacing Claude + +## Context + +R8 proved that OpenAI Privacy Filter can learn the 5-class cyber NER task, but the honest exact-match results show different benchmark weaknesses: + +- APTNER exact-match micro F1: 0.4982 +- CyNER exact-match micro F1: 0.4050 + +APTNER mainly exposes APT-report-style Organization/System recall gaps. CyNER mainly exposes Indicator boundary and format coverage gaps, especially defanged or unusual IOCs. + +The entity-propagated R8 file is now available, but its audit found 156,929 added spans on top of 76,824 base spans, including many generic or ambiguous surfaces. Including it in the next run would make any result hard to interpret. + +## Options Considered + +1. **Strict R9 only:** Train on R8 + deleaked CyberNER_harmonized + deleaked DNRTI, with validation/test overlap removed before deduplication. +2. **R9 plus propagated R8:** Add the full propagated dataset immediately to maximize recall. +3. **Delay R9 for a larger data rebuild:** Wait until we harvest much more targeted APT-style and CyNER-style data. + +## Decision + +Run **strict R9** next. + +Do **not** include propagated R8 in strict R9. Treat propagation as a separate future experiment only after filtering/auditing. + +Report R9 with a benchmark portfolio: + +- APTNER exact-match as the independent APT-report benchmark +- CyNER exact-match as the original CyNER benchmark comparison +- Enriched 5-class and SecureBERT2 5-class as supplementary continuity checks +- OPF containment metrics as diagnostics only, not the primary paper-comparable score + +## Reasoning + +- Strict R9 is leakage-clean after the readiness gate: zero exact and zero prefix-80 train overlap with validation, enriched test, CyNER, SecureBERT2, and APTNER. +- The propagated dataset is too noisy for the next controlled experiment. It would likely improve some recall numbers while injecting false positives and benchmark memorization risk. +- A multi-benchmark protocol is necessary because improving APTNER and improving CyNER are not the same task. A single benchmark can be overfit unintentionally even with honest intent. +- Strict R9 gives a clean signal before larger data scaling. If it helps APTNER but not CyNER, the next branch should target Indicators. If it helps neither, we revisit training/decoding rather than blindly adding data. + +## Consequences + +- R9 may score lower than a noisy propagation-boosted run, but its result will be interpretable. +- Future data work should split into two explicit tracks: + - **Track A:** APT-report-style Organization/System examples. + - **Track B:** CyNER-style Indicator examples, including defanged domains/IPs/URLs, file paths, registry paths, package names, and odd multi-token indicators. +- Decode calibration should happen after strict R9, using validation only, then evaluated unchanged across the benchmark portfolio. + +## Source + +- R9 readiness audit: `results/r9_readiness_audit.md` +- Propagation audit: `results/entity_propagation_audit.md` +- R8 CyNER exact-match note: `research/notes/progress/2026-04-26-02-cyner-exact-match-and-gap-analysis.md` +- R9 readiness note: `research/notes/progress/2026-04-26-03-r9-readiness-and-propagation-audit.md` diff --git a/research/notes/class_balance_audit_2026-04-24.md b/research/notes/class_balance_audit_2026-04-24.md new file mode 100644 index 0000000000000000000000000000000000000000..f2bc37bceb7b840d5370cb63f0c163705319fd96 --- /dev/null +++ b/research/notes/class_balance_audit_2026-04-24.md @@ -0,0 +1,30 @@ +# Arcspan NER Dataset Class Balance Audit (2026-04-24) + +## Summary +Analyzed 5 fixed/deleaked training files comprising **54,139 examples** and **152,941 entity spans** across 5 security NER classes. + +| Dataset | Examples | All-O % | Total Spans | Imbalance | +|---------|----------|---------|-------------|-----------| +| **enriched_trimmed** | 25,127 | 10.0% | 75,677 | 2.46x | +| **enriched_deleaked** | 24,339 | 19.3% | 63,831 | 2.77x | +| **aptner_deleaked** | 3,078 | 33.1% | 4,627 | 16.77x | +| **securebert2_deleaked** | 316 | 47.5% | 344 | 11.42x | +| **defanged_augmented** | 1,279 | 0.0% | 8,462 | 11.41x | +| **COMBINED** | **54,139** | **15.5%** | **152,941** | **2.94x** | + +## Entity Distribution (Combined) +- **Indicator**: 44,282 (28.9%) — most common +- **Malware**: 35,646 (23.3%) +- **Organization**: 31,946 (20.9%) +- **System**: 25,995 (17.0%) +- **Vulnerability**: 15,072 (9.8%) — least common + +## Key Findings +1. **Enriched files dominate**: `enriched_trimmed` + `enriched_deleaked` = 49.5k examples (91% of dataset) +2. **Moderate imbalance**: 2.94x ratio within acceptable range for sequence labeling +3. **All-O distribution**: 15.5% negative examples (reasonable for NER) +4. **Defanged boost**: Augmentation adds 8.4k spans, particularly boosting Indicator class +5. **Smaller sources volatile**: `aptner` and `securebert2` show high imbalance (11–17x) but contribute <6% of total + +## Recommendation +**Dataset is well-balanced for training.** The 2.94x imbalance is healthy—Vulnerability's underrepresentation (9.8%) is acceptable given domain scarcity. Enriched files provide stable foundation; defanged augmentation adds diversity without distorting class ratios. diff --git a/research/notes/progress/2026-04-24-12-r8-dataset-build.md b/research/notes/progress/2026-04-24-12-r8-dataset-build.md new file mode 100644 index 0000000000000000000000000000000000000000..6f7f5177c0da2ac3b0344cd4baa548c27b0cc4f2 --- /dev/null +++ b/research/notes/progress/2026-04-24-12-r8-dataset-build.md @@ -0,0 +1,27 @@ +# R8 Dataset Build + +## What we found +Built the R8 (likely final) cybersecurity NER training dataset from deleaked sources: +- **26,079 train** examples, **76,824 entities**, 12.0% all-O rate +- **2,999 valid** examples, **5,927 entities**, 12.3% all-O rate +- Sources: enriched (deleaked), APTNER (deleaked), SecureBERT2 (deleaked), defanged augmented +- Stucco excluded (too noisy) +- Trimmed all-O from 20% down to 12% by random subsampling negative examples + +## Entity distribution (train) +- Indicator: 24,685 +- Malware: 16,887 +- Organization: 14,815 +- System: 13,320 +- Vulnerability: 7,117 + +## Leakage verification +- **Zero exact matches** against all 4 test sets +- Prefix-80 matches are false positives (different texts sharing common openings) + +## Why it matters +This is the final clean dataset for training. All known leakage issues resolved. + +## Open questions +- Entity propagation (cross-document) running — will it meaningfully boost recall? +- Vulnerability class is smallest (7K) — may be the hardest to learn diff --git a/research/notes/progress/2026-04-24-16-baseline-eval-script.md b/research/notes/progress/2026-04-24-16-baseline-eval-script.md new file mode 100644 index 0000000000000000000000000000000000000000..34e267369b18b7ab88786f080f2c958069e83fc3 --- /dev/null +++ b/research/notes/progress/2026-04-24-16-baseline-eval-script.md @@ -0,0 +1,33 @@ +# Baseline Evaluation Script Created + +**Date:** 2026-04-24 + +## What we built + +`src/arcspan/eval/run_baselines.py` — evaluates HF NER models against our CyNER test data (748 examples, 5 entity types) with span-level exact-match P/R/F1. + +Two baselines wired up: +1. **SecureBERT2.0-NER** (`cisco-ai/SecureBERT2.0-NER`) — TF-based, BIO, 5 entity types matching ours directly +2. **SecureModernBERT-NER** (`attack-vector/SecureModernBERT-NER`) — PyTorch, 22 entity types mapped to our 5-class space + +## Key findings from 20-example smoke test + +| Model | Overall P | Overall R | Overall F1 | +|---|---|---|---| +| SecureBERT2.0-NER | 14.8% | 40.0% | 21.6% | +| SecureModernBERT-NER | 55.0% | 55.0% | 55.0% | + +- SecureBERT2.0 is very noisy — over-predicts spans (low precision), includes trailing punctuation and non-entity text +- SecureModernBERT is substantially better on exact match; cleaner span boundaries +- Both models produce offsets with leading whitespace; we strip it in post-processing +- Neither model saw any Malware or Indicator entities in the first 20 examples (those types appear later in the dataset) + +## Why it matters + +These are the baselines our fine-tuned Arcspan model will be measured against. The script is modular (`BASELINES` registry dict) so adding more models is trivial. + +## Open questions + +- Need to run full 748-example eval for real numbers +- Should we add a "relaxed match" mode (overlapping spans count as partial credit)? +- The 20-example sample is Organization-heavy; full eval will give better per-type coverage diff --git a/research/notes/progress/2026-04-24-20-paper-direction-decided.md b/research/notes/progress/2026-04-24-20-paper-direction-decided.md new file mode 100644 index 0000000000000000000000000000000000000000..889e276ab95309777046fd9944be2f9bae550191 --- /dev/null +++ b/research/notes/progress/2026-04-24-20-paper-direction-decided.md @@ -0,0 +1,24 @@ +# Paper Direction Decided + Experimental Framework + +## Decision +Both a **publishable research paper** and an **open-source tool**. If Arcspan matches or surpasses CyNER (560M dense) at 50M active params, it's a genuine contribution. + +## Paper Thesis +"Sparse MoE token classifiers, fine-tuned with minimal data, can match dense transformer NER models at 1/11th the active compute for cybersecurity entity extraction." + +## The Five Key Experiments +1. **Main comparison table:** Arcspan vs CyNER vs BERT-base vs GLiNER-zero-shot vs regex-only +2. **Data efficiency curve (Figure 1 — the money chart):** F1 at 1%/5%/10%/25%/50%/100% of data +3. **Per-entity-type breakdown:** Where does MoE win vs lose? +4. **Viterbi vs argmax:** Our unique architectural advantage +5. **Expert routing ablation:** top-2 vs top-4 via OPF_EXPERTS_PER_TOKEN + +## Baselines to Implement +- CyNER (560M) — primary competitor +- BERT-base fine-tuned on same data (110M) — standard NER baseline +- GLiNER-M zero-shot (90M) — zero-shot ceiling +- Regex-only — lower bound +- SpaCy trf (110M) — out-of-domain baseline + +## What's Blocking Progress +Waiting on Opus agent for: CyNER exact label schema, dataset locations, PRISM benchmark details. Once we have those, we can design the label space JSON and start data conversion. diff --git a/research/notes/progress/2026-04-24-24-cyner2-baseline-discovered.md b/research/notes/progress/2026-04-24-24-cyner2-baseline-discovered.md new file mode 100644 index 0000000000000000000000000000000000000000..e8f9f237e036792950a80202c3285b2e115720de --- /dev/null +++ b/research/notes/progress/2026-04-24-24-cyner2-baseline-discovered.md @@ -0,0 +1,30 @@ +# CyNER 2.0 DeBERTa-v3-base — New Baseline Discovered + +## Key Facts +- **Model**: DeBERTa-v3-base (200M params, dense) +- **F1**: 91.88% (self-reported, needs verification) +- **Training data**: 11,074 examples (7,751 train) — augmented from bnsapa/cybersecurity-ner + AlienVault/OpenCTI +- **Label space**: 8 entity types (original 5 + Date, Location, ThreatGroup) +- **Training**: lr=2e-5, 3 epochs, batch_size=8, weight_decay=0.01 +- **License**: MIT +- **HuggingFace**: https://huggingface.co/PranavaKailash/CyNER-2.0-DeBERTa-v3-base + +## Critical Observations + +1. **91.88% F1 is likely on their own augmented test set** — NOT on the original CyNER test set. This makes direct comparison tricky. We need to eval them on the same test set. +2. **They use 8 entity types** vs CyNER's 5 — added Date, Location, ThreatGroup. Not apples-to-apples. +3. **11K examples vs our 2.8K** — they have ~3x more training data from augmentation. +4. **LR = 2e-5** — 10x lower than our first run (2e-4). This is a strong hint for our hyperparameter tuning. +5. **DeBERTa-v3-base is 200M dense params** — 4x our active params (50M). + +## What We Can Use From This + +- **Their augmented dataset** (MIT license) — we should download and convert it. 7,751 training examples is much better than our 2,811. +- **LR = 2e-5 as reference point** — our 2e-4 was too aggressive, confirmed. +- **As a baseline** — run their model on our CyNER test set for fair comparison. +- **Their additional entity types** (ThreatActor, Date) overlap with our planned Tier 1 expansion. + +## Source +- Model: https://huggingface.co/PranavaKailash/CyNER-2.0-DeBERTa-v3-base +- Dataset: https://huggingface.co/datasets/PranavaKailash/CyNER2.0_augmented_dataset +- GitHub: https://github.com/Pranava-Kailash/CyNER_2.0_API diff --git a/research/notes/progress/2026-04-24-25-competitor-landscape-deep-dive.md b/research/notes/progress/2026-04-24-25-competitor-landscape-deep-dive.md new file mode 100644 index 0000000000000000000000000000000000000000..529f931d64d0a6e4df6c91a9fef4a283e1f7ee4c --- /dev/null +++ b/research/notes/progress/2026-04-24-25-competitor-landscape-deep-dive.md @@ -0,0 +1,257 @@ +# Cybersecurity NER Competitor Landscape Deep Dive + +**Date:** 2026-04-24 +**Purpose:** Map the competitive landscape for cybersecurity NER to inform Arcspan's positioning, baselines, and related work section. + +--- + +## Summary Comparison Table + +| Model | Architecture | Active Params | Entity Types | Overall F1 | Dataset | Weights Public? | License | Runnable Baseline? | +|---|---|---|---|---|---|---|---|---| +| **SecureBERT 2.0 NER** | ModernBERT-base (22L, 768d) | ~149M | 5 (Malware, Indicator, Vulnerability, System, Organization) | **0.945** | Cisco internal (3,400 train / 717 test) | Yes ([HF](https://huggingface.co/cisco-ai/SecureBERT2.0-NER)) | Apache 2.0 | **Yes** | +| **SecureModernBERT-NER** | ModernBERT-large | ~395M | 22 fine-grained (MALWARE, THREAT-ACTOR, CVE, IPV4, IPV6, DOMAIN, URL, HASHES, EMAIL, REGISTRY-KEYS, ORG, PRODUCT, PLATFORM, SERVICE, SECTOR, LOC, FILEPATH, MITRE-TACTIC, TOOL, CAMPAIGN, ...) | **0.848** | 502,726 curated spans from real-world CTI | Yes ([HF](https://huggingface.co/attack-vector/SecureModernBERT-NER)) | MIT | **Yes** | +| **CyberLLaMA** | LLaMA-3.2-3B + BiLSTM + CRF | ~3B | BIO-tagged cybersecurity terms (4,788 unique) | **0.989** | 42,404 articles (newspapers, blogs, official sites) | No (paper only) | Unknown | **No** | +| **XLNet-CRF** | XLNet-base + CRF | ~110M | CTI entities (malware, IP, URL, hash, etc.) | **0.974** (CTI-Reports), **0.887** (MalwareTextDB) | CTI-Reports, MalwareTextDB | Code on GitHub (no pretrained weights) | Unknown | **Partial** (retrain needed) | +| **BERT-CRF for CTI** | BERT-base + CRF | ~110M | 13 types (DNRTI), malware/IP/URL/hash (CTI-Reports) | **0.900** (DNRTI), **0.773** (CTI-Reports) | DNRTI (182K words), CTI-Reports (310K records), MalwareTextDB | Code on [GitHub](https://github.com/stwater20/NER-BERT-CRF-for-CTI) | Unknown | **Partial** | +| **CyNER** | Transformer + heuristics ensemble | Varies | Malware, threat actors, indicators, vulnerabilities | ~0.74 (on CyberNER harmonized) | CyNER dataset | Yes ([GitHub](https://github.com/aiforsec/CyNER)) | Unknown | **Yes** | +| **CyberNER (Harmonized)** | RoBERTa/SecureBERT/CySecBERT + CRF | ~125M | 21 STIX 2.1 entity types | **0.736** (RoBERTa best) | 610K tokens, 23,477 sentences from 4 merged datasets | Dataset public, code public | Unknown | **Yes** (benchmark) | +| **SecLMNER** | LLM (generative) + SecureBERT (encoder) | <10B + 110M | Multi-source cybersecurity entities | SecureBERT +6-17% F1 | 5 cybersecurity text sources | No (paper only) | Unknown | **No** | + +--- + +## Detailed Model Profiles + +### 1. SecureBERT 2.0 NER (Cisco AI) + +**Architecture:** ModernBERT-base with 22 hidden layers, 768 hidden size, 12 attention heads, max 8,192 tokens. Fine-tuned for token classification. + +**Training data:** Cisco's internal hand-labeled NER corpus: 3,400 training samples, 717 test samples. The base model (SecureBERT 2.0) was pretrained on 13B+ text tokens and 53M code tokens from cybersecurity sources. + +**Performance:** +| Model | F1 | Recall | Precision | +|---|---|---|---| +| CyBERT | 0.351 | 0.281 | 0.467 | +| SecureBERT 1.0 | 0.734 | 0.759 | 0.717 | +| **SecureBERT 2.0** | **0.945** | **0.965** | **0.927** | + +Per-entity breakdown not publicly reported (only aggregate). Entity types: Malware, Indicator, Vulnerability, System, Organization (5 categories with 11 labels via BIO). + +**Availability:** +- HuggingFace: `cisco-ai/SecureBERT2.0-NER` (Apache 2.0) +- GitHub: `cisco-ai-defense/securebert2` +- Uses TF model (`TFAutoModelForTokenClassification`) -- note TensorFlow dependency + +**Paper:** Aghaei, E. et al. "SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence." arXiv:2510.00240 (2025). https://arxiv.org/abs/2510.00240 + +**Baseline verdict:** **PRIMARY BASELINE.** Directly downloadable and runnable. The 0.945 F1 is on their own dataset with only 5 entity types -- important caveat. We must either (a) eval on their dataset with their labels, or (b) eval on a shared benchmark. + +--- + +### 2. SecureModernBERT-NER (attack-vector) + +**Architecture:** ModernBERT-large (answerdotai/ModernBERT-large), ~395M params, fine-tuned for token classification. + +**Training data:** 502,726 manually curated text spans from real-world threat reports, vulnerability advisories, and incident analyses. Max sequence length 128 tokens during training. + +**Performance:** +- Precision: 0.847, Recall: 0.848, F1: 0.848, Accuracy: 0.959 +- Strong per-label: CVE (0.9995), SHA256 (0.9874), URL (0.9801), LOC (0.9557) +- Weaker: IPV6, EMAIL (rare types) + +**Entity types (22):** MALWARE, THREAT-ACTOR, CVE, IPV4, IPV6, DOMAIN, URL, MD5, SHA1, SHA256, EMAIL, REGISTRY-KEYS, ORG, PRODUCT, PLATFORM, SERVICE, SECTOR, LOC, FILEPATH, MITRE-TACTIC, TOOL, CAMPAIGN + +**Availability:** +- HuggingFace: `attack-vector/SecureModernBERT-NER` (MIT license) +- PyTorch model, standard `pipeline("token-classification")` inference + +**Paper:** No academic paper. Community model card only. + +**Baseline verdict:** **STRONG BASELINE.** 22 entity types makes this the most comprehensive label space. The 0.848 F1 across 22 types is arguably more impressive than SecureBERT 2.0's 0.945 across only 5 types. Directly runnable. MIT license is ideal. + +--- + +### 3. CyMapNER / CyNER + +**Note:** "CyMapNER" does not appear to be a real model. The actual model is **CyNER** — an open-source Python library from `aiforsec`. + +**Architecture:** Ensemble approach combining transformer-based models, heuristics for IOC extraction, and publicly available NER models. + +**Training data:** Custom cybersecurity corpus; integrates with MALOnt2.0 ontology. + +**Performance:** On the CyberNER harmonized benchmark, transformer models trained on CyNER data achieve ~0.74 F1. + +**Availability:** +- GitHub: `aiforsec/CyNER` +- arXiv: https://arxiv.org/abs/2204.05754 + +**Paper:** Alam, M.T. et al. "CyNER: A Python Library for Cybersecurity Named Entity Recognition." arXiv:2204.05754 (2022). + +**Baseline verdict:** **SECONDARY BASELINE.** Useful as a reference point for the CyberNER harmonized benchmark. Older (2022) and lower performance. + +--- + +### 4. CTI-BERT / BERT-CRF for CTI + +**Architecture:** BERT-base-uncased + CRF layer. Also evaluated with secBERT (domain-adapted BERT). + +**Training data:** Three public datasets: +- DNRTI: 182,452 words, 300+ threat reports, 13 entity classes +- CTI-Reports: 310,406 records (malware, IP, URL, hash) +- MalwareTextDB: malware text samples + +**Performance:** +- DNRTI: 90.02% F1 +- CTI-Reports: 77.29% F1 (high precision 98.37%, low recall 74.10%) +- MalwareTextDB: 58.57% F1 +- Real-world OSINT: 82.64% accuracy + +**Entity types:** 13 types on DNRTI (hacker groups, attacks, tools, vulnerabilities, methods); 4 types on CTI-Reports (malware, IP, URL, hash). + +**Availability:** +- GitHub: `stwater20/NER-BERT-CRF-for-CTI` +- No pretrained weights hosted; training code available + +**Paper:** Authors from NYCU. Published as a conference/workshop paper. PDF: https://speed.cs.nycu.edu.tw/~ydlin/Enhancing%20Cyber%20Threat%20Intelligence%20with%20Named%20Entity%20Recognition%20using%20BERT-CRF.pdf + +**Baseline verdict:** **REFERENCE ONLY.** No hosted weights. Useful as a literature comparison point for DNRTI/CTI-Reports benchmarks. + +--- + +### 5. LANCE + +**Note:** No model called "LANCE" was found in the cybersecurity NER literature. This may be a confusion with: +- **LanG** — a governance-aware agentic AI platform (unrelated) +- **SecLMNER** — the LLM+encoder pipeline framework +- **TTPrompt** — the retrieval-to-reasoning CTI NER framework + +The closest match to "LLM-based pipeline using GPT-4o/Llama on PRISM benchmark" is the **CyberBench** evaluation, which tested GPT-4 and Llama-2 on cybersecurity tasks. **PRISM** appears to be a GLM model variant, not a cybersecurity benchmark. + +**CyberBench results (AAAI-24 Workshop):** +- GPT-4: 69.6 average across all tasks +- GPT-3.5-Turbo: 62.6 +- Llama-2-13B: 54.1 +- CyberInstruct-13B (fine-tuned Llama-2): 70.4 +- For NER specifically: BERT-based models outperformed generative LLMs + +**Baseline verdict:** **NOT A REAL COMPETITOR.** "LANCE" likely doesn't exist as described. CyberBench/CyberInstruct results confirm that generative LLMs underperform specialized encoder models on NER. + +--- + +### 6. Additional High-Performers + +#### CyberLLaMA (2025) +- **Architecture:** LLaMA-3.2-3B + BiLSTM + CRF +- **F1: 98.88%** — but this is on their own custom dataset (42,404 articles, 4,788 terms). No cross-benchmark validation. +- **Paper:** Zhang, H. et al. "CyberLLaMA: A fine-tuned large language model for cybersecurity named entity recognition." Knowledge-Based Systems 328:114183 (2025). +- **Weights:** NOT public. Paper only. +- **Baseline verdict:** **NOT USABLE.** No weights, no shared benchmark. The 98.88% F1 is likely inflated by narrow label space and custom eval. Include in related work, not in experiments. + +#### XLNet-CRF (2025) +- **Architecture:** XLNet-base + CRF +- **F1: 97.43%** on CTI-Reports, 88.65% on MalwareTextDB +- **Paper:** Wang, T. et al. "XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling." Electronics 14(15):3034 (2025). https://www.mdpi.com/2079-9292/14/15/3034 +- **Code:** GitHub (training code, no pretrained weights) +- **Baseline verdict:** **REFERENCE ONLY.** We can cite their numbers on CTI-Reports/MalwareTextDB. Could retrain if we use those datasets. + +#### CyberNER Harmonized Benchmark (2025) +- **Architecture:** Various (RoBERTa+CRF best at 0.736 F1) +- **21 STIX 2.1 entity types**, 610K tokens +- **Paper:** Ech-Chammakhy, Y. et al. "CyberNER: A Harmonized STIX Corpus for Cybersecurity Named Entity Recognition." arXiv:2510.26499 (2025). +- **Data + code:** Publicly available +- **Baseline verdict:** **USE AS BENCHMARK.** This is the most principled evaluation framework -- STIX-aligned, multi-dataset, public. Best baseline F1 is only 0.736, leaving huge room for Arcspan to demonstrate value. + +#### SecLMNER (2025) +- **Architecture:** Two-stage: generative LLM (<10B params) reformats text, then SecureBERT does NER +- **Performance:** +6-17% F1 over SecureBERT alone +- **Paper:** Zhang, Y. et al. "SecLMNER: A framework for enhanced NER in multi-source cybersecurity data using LLMs." Expert Systems with Applications 271:126651 (2025). +- **Weights:** NOT public. +- **Baseline verdict:** **REFERENCE ONLY.** Interesting architecture comparison (two-stage LLM+encoder vs. our single-pass approach). + +--- + +## Key Observations + +1. **The field is fragmented.** No single benchmark dominates. Everyone evaluates on different datasets with different label spaces, making direct comparison nearly impossible. + +2. **CyberNER harmonized benchmark is the best shared eval.** 21 STIX entity types, public data+code, multiple baselines. Best result is only 0.736 F1 -- enormous headroom. + +3. **SecureBERT 2.0's 0.945 F1 is on only 5 coarse entity types** with a small private dataset. Impressive but not directly comparable to models handling 20+ types. + +4. **SecureModernBERT-NER is our closest competitor** in terms of practical utility (22 types, MIT license, public weights, standard inference). Its 0.848 F1 is the number to beat. + +5. **The claimed 98%+ F1 scores (CyberLLaMA, XLNet-CRF) are on narrow/custom benchmarks** and weights are not public. Not practically threatening. + +6. **Arcspan's architectural advantages:** 50M active params (vs. 149-395M for competitors), 128K context window (vs. 128-8192 for competitors), single-pass Viterbi decoding (vs. pipeline approaches), BIOES scheme (vs. BIO). + +--- + +## Recommended Baselines for Our Paper + +### Tier 1: Must Include (runnable, public weights) + +| Model | How to Run | What to Report | +|---|---|---| +| **SecureBERT 2.0 NER** | `pip install transformers tensorflow`; load `cisco-ai/SecureBERT2.0-NER`; standard NER pipeline. **Note:** TF model, may need `TFAutoModelForTokenClassification`. | F1 on our dataset + their dataset if we can get it | +| **SecureModernBERT-NER** | `pip install transformers`; load `attack-vector/SecureModernBERT-NER`; `pipeline("token-classification")`. PyTorch, straightforward. | F1 on our dataset (22 entity types, map to our label space) | +| **CyNER** | `pip install cyner`; GitHub `aiforsec/CyNER`. Ensemble approach. | F1 on CyberNER benchmark | + +### Tier 2: Benchmark Comparison (shared datasets) + +| Benchmark | Source | Best Published F1 | Our Target | +|---|---|---|---| +| **CyberNER (STIX harmonized)** | arXiv:2510.26499, public | 0.736 (RoBERTa+CRF) | >0.80 | +| **DNRTI** | Public, 13 entity types | 0.900 (BERT-CRF) | >0.90 | +| **CTI-Reports** | Public, 4 entity types | 0.974 (XLNet-CRF) | Competitive | + +### Tier 3: Literature Comparison (cite numbers, can't re-run) + +| Model | Reported F1 | Notes | +|---|---|---| +| CyberLLaMA | 0.989 | Custom dataset, no weights, 3B params | +| XLNet-CRF | 0.974 | CTI-Reports only, no pretrained weights | +| SecLMNER | SecureBERT +6-17% | Two-stage pipeline, no weights | +| BERT-CRF for CTI | 0.900 (DNRTI) | Can retrain from code if needed | + +### Practical Instructions + +```bash +# SecureBERT 2.0 NER +pip install transformers tensorflow +python -c " +from transformers import AutoTokenizer, TFAutoModelForTokenClassification, pipeline +model = TFAutoModelForTokenClassification.from_pretrained('cisco-ai/SecureBERT2.0-NER') +tokenizer = AutoTokenizer.from_pretrained('cisco-ai/SecureBERT2.0-NER') +nlp = pipeline('ner', model=model, tokenizer=tokenizer) +print(nlp('APT29 exploited CVE-2024-1234 using Cobalt Strike against Microsoft Exchange.')) +" + +# SecureModernBERT-NER +pip install transformers torch +python -c " +from transformers import pipeline +nlp = pipeline('token-classification', model='attack-vector/SecureModernBERT-NER', aggregation_strategy='first') +print(nlp('APT29 exploited CVE-2024-1234 using Cobalt Strike against Microsoft Exchange.')) +" + +# CyNER +pip install cyner +python -c " +import cyner +model = cyner.CyNER() +print(model.get_entities('APT29 exploited CVE-2024-1234 using Cobalt Strike.')) +" +``` + +--- + +## Arcspan Positioning + +Our key differentiators vs. the field: +1. **10x smaller active footprint** (50M vs. 149-395M) -- crucial for edge/SOC deployment +2. **128K context window** -- can process entire threat reports in one pass (competitors max at 512-8192) +3. **Constrained Viterbi decoding with BIOES** -- structurally guaranteed valid spans (competitors use BIO + greedy/CRF) +4. **Single-pass architecture** -- no two-stage LLM preprocessing (vs. SecLMNER) +5. **MoE efficiency** -- 1.5B total params but only 50M active per token + +The CyberNER harmonized benchmark (0.736 best F1, 21 STIX types) is our ideal proving ground. If Arcspan can hit >0.80 F1 on that benchmark with 50M active params, the story writes itself. diff --git a/research/notes/progress/2026-04-24-26-dataset-aggregation-plan.md b/research/notes/progress/2026-04-24-26-dataset-aggregation-plan.md new file mode 100644 index 0000000000000000000000000000000000000000..d104afd0c73c44598471e611b4becab1cdc6b8e2 --- /dev/null +++ b/research/notes/progress/2026-04-24-26-dataset-aggregation-plan.md @@ -0,0 +1,99 @@ +# Cybersecurity NER Dataset Aggregation — Research & Results + +## What We Built + +A master aggregation pipeline (`src/arcspan/data/aggregate_datasets.py`) that combines 4 public cybersecurity NER datasets into a unified 13-class and 5-class OPF BIOES JSONL format, with deduplication. + +## Datasets Found & Status + +### ✅ Successfully Aggregated (4 datasets) + +| Dataset | Source | Format | Raw Types | Sentences | Spans | Notes | +|---------|--------|--------|-----------|-----------|-------|-------| +| **CyNER original** | `data/raw/CyNER/dataset/mitre/` | CoNLL BIO | 5 | 4,372 | 3,040 | Baseline dataset | +| **CyNER 2.0 augmented** | `data/raw/cyner2_augmented/hf_dataset/` | HF datasets | 8 | 11,074 | 15,036 | Adds ThreatActor, Date, Location | +| **CyberNER harmonized** | `data/raw/CyberNER_harmonized/` (GitHub: yasirech-chammakhy/CyberNER) | CSV w/ STIX tags | 21 | 10,042 | 42,329 | **Best single source** — harmonizes CyNER+DNRTI+APTNER+Attacker onto STIX 2.1 | +| **DNRTI** | `data/raw/DNRTI/DNRTI_Dataset/` (GitHub: LiuPeiP-CS/NER4CTI) | CoNLL BIO | 13 | 6,577 | 12,974 | Chinese-origin CTI dataset, 13 cybersec types | + +### ❌ Not Usable for NER Training + +| Dataset | Why Not | +|---------|---------| +| **SecureModernBERT-NER** | Only the *model* is on HuggingFace (attack-vector/SecureModernBERT-NER). Training data (502K spans, 22 classes) is NOT published. Model card describes the data but doesn't share it. | +| **PRISM** | IOC *classification* (IoC vs nonIoC per indicator), not span-level NER annotations. Already at `data/raw/LANCE/PRISM/GT.json`. | +| **CTI-Reports** | Behind IEEE DataPort download wall. XML format with IOC extractions, not token-level NER. | +| **MalwareTextDB** | Requires manual download from statnlp.org (link may be dead). Only has generic "Entity" labels — no typed NER. | +| **bnsapa/cybersecurity-ner** | Just a distilBERT fine-tune on original CyNER MITRE data — same data we already have. | +| **Pile-NER cybersecurity subset** | General-purpose GPT-3.5-generated NER, not cybersecurity-specific. Would need heavy filtering and label mapping. Low quality. | +| **MITRE ATT&CK STIX data** | Structured KB, not annotated text. Useful for distant supervision / data augmentation but not direct NER training. | + +### 🔍 Notable: CyberNER Already Subsumes Multiple Sources + +The CyberNER harmonized corpus (arXiv:2510.26499) already harmonizes CyNER, DNRTI, APTNER, and Attacker datasets. This means our aggregation has **significant overlap** between CyberNER and the individual CyNER/DNRTI datasets. The deduplication step removed ~3,766 duplicates (exact text match), but some paraphrased overlap likely remains. This is acceptable — the STIX-harmonized labels from CyberNER are higher quality than the raw source labels. + +## Final Aggregated Stats + +After deduplication: + +| Split | Sentences | Spans | +|-------|-----------|-------| +| Train | 20,436 | 52,331 | +| Valid | 3,966 | 8,229 | +| Test | 3,897 | 7,903 | +| **Total** | **28,299** | **68,463** | + +### 13-Class Label Distribution (train) + +| Label | Count | % | +|-------|-------|---| +| MALWARE | 12,537 | 24.0% | +| ORGANIZATION | 12,036 | 23.0% | +| THREAT_ACTOR | 11,589 | 22.1% | +| TOOL | 7,459 | 14.3% | +| SYSTEM | 3,672 | 7.0% | +| VULNERABILITY | 2,709 | 5.2% | +| FILEPATH | 1,764 | 3.4% | +| DOMAIN | 298 | 0.6% | +| IP_ADDRESS | 168 | 0.3% | +| URL | 71 | 0.1% | +| EMAIL | 28 | <0.1% | +| CVE_ID | 0 | 0% | +| HASH | 0 | 0% | + +## Unified Label Mapping + +### CyNER (5 types) → 13-class +- Malware → MALWARE +- System → SYSTEM +- Organization → ORGANIZATION +- Vulnerability → VULNERABILITY +- Indicator → **dropped** (mixed IOC types, can't reliably split) + +### CyNER 2.0 (8 types) → 13-class +- Malware → MALWARE, ThreatActor → THREAT_ACTOR, System → SYSTEM, Organization → ORGANIZATION, Vulnerability → VULNERABILITY +- Indicator → dropped, Date → dropped, Location → dropped + +### CyberNER STIX (21 types) → 13-class +- Malware → MALWARE, Threat-Actor → THREAT_ACTOR, Intrusion-Set → THREAT_ACTOR +- Tool → TOOL, Software → SYSTEM, Infrastructure → SYSTEM +- Identity → ORGANIZATION, Vulnerability → VULNERABILITY +- Domain-Name → DOMAIN, IPv4-Addr → IP_ADDRESS, URL → URL, Email-Addr → EMAIL, File → FILEPATH +- Dropped: Campaign, Attack-Pattern, Course-of-Action, Indicator, Location, Observed-Data, Malware-Analysis, Network-Traffic + +### DNRTI (13 types) → 13-class +- HackOrg → THREAT_ACTOR, SamFile → MALWARE, Tool → TOOL +- SecTeam → ORGANIZATION, Org → ORGANIZATION, Exp → VULNERABILITY +- Dropped: OffAct, Time, Purp, Area, Idus, Way, Features + +## Open Questions + +1. **CVE_ID and HASH have zero examples.** Need regex-based distant supervision from MITRE ATT&CK or synthetic generation to populate these. +2. **IOC classes are severely underrepresented** (IP, DOMAIN, URL, EMAIL, FILEPATH total ~2,329 in train). Consider augmenting with regex-extracted IOCs from CTI reports. +3. **CyberNER overlap with CyNER/DNRTI.** We deduplicate by exact text, but the same sentences appear with different tokenizations. Could do fuzzy dedup but risk losing valid data. +4. **SecureModernBERT training data** (502K spans) would be transformative if released. Worth reaching out to the authors. + +## Sources +- CyberNER: https://github.com/yasirech-chammakhy/CyberNER | arXiv:2510.26499 +- DNRTI: https://github.com/LiuPeiP-CS/NER4CTI +- CyNER 2.0: HuggingFace PranavaKailash/CyNER2.0_augmented_dataset +- SecureModernBERT-NER: https://huggingface.co/attack-vector/SecureModernBERT-NER (model only) diff --git a/research/notes/progress/2026-04-24-29-final-llm-merge-complete.md b/research/notes/progress/2026-04-24-29-final-llm-merge-complete.md new file mode 100644 index 0000000000000000000000000000000000000000..3068fe1c73a34f12266b67f331554140de8d4153 --- /dev/null +++ b/research/notes/progress/2026-04-24-29-final-llm-merge-complete.md @@ -0,0 +1,43 @@ +# Final LLM Annotation Merge — All 8 Sources Complete + +## Enriched Dataset Stats +- **enriched_13class_train**: 22,052 examples (20,436 aggregated + 1,616 LLM) +- **enriched_5class_train**: 21,891 examples +- Total LLM spans: 6,060 across all 13 entity types + +## LLM Span Distribution (all sources combined) +| Label | Count | +|-------|-------| +| MALWARE | 1,638 | +| THREAT_ACTOR | 959 | +| SYSTEM | 796 | +| CVE_ID | 485 | +| VULNERABILITY | 425 | +| TOOL | 325 | +| ORGANIZATION | 315 | +| DOMAIN | 271 | +| IP_ADDRESS | 248 | +| HASH | 248 | +| FILEPATH | 234 | +| URL | 69 | +| EMAIL | 47 | + +## Sources (8 annotation agents) +| Source | Examples | Spans | +|--------|----------|-------| +| MITRE ATT&CK | 954 | 2,750 | +| NVD CVEs | 339 | 990 | +| Synthetic | 100 | 752 | +| Vendor blogs | 67 | 446 | +| News articles | 51 | 362 | +| CISA advisories | 40 | 400 | +| AlienVault OTX | 40 | 295 | +| Malware reports | 25 | 464 | + +## Why It Matters +- Zero-count entity classes eliminated (CVE_ID: 0→485, HASH: 0→248, FILEPATH: 0→234) +- 8% more training data for Round 5 +- Diverse sources = better generalization + +## Next +Round 4 training in progress (~1h36m). Round 5 script staged and ready. diff --git a/research/notes/progress/2026-04-24-30-data-quality-audit.md b/research/notes/progress/2026-04-24-30-data-quality-audit.md new file mode 100644 index 0000000000000000000000000000000000000000..4a24f19c6f2097bc2413f3fcdd26f8e0236b3633 --- /dev/null +++ b/research/notes/progress/2026-04-24-30-data-quality-audit.md @@ -0,0 +1,245 @@ +# Data Quality Audit — LLM-Annotated Cybersecurity NER Data + +**Date:** 2026-04-24 +**Auditor:** Automated script + manual review +**Scope:** All 13 files in `data/processed/`, 17,516 total records + +--- + +## Executive Summary + +| Issue | Count | Severity | Action Required | +|-------|-------|----------|-----------------| +| Offset errors | **0** | — | None | +| Duplicate texts | **1,727 unique** (4,408 records) | HIGH | Deduplicate before training | +| Short texts (<20 chars) | **71** | MEDIUM | Remove — too short for meaningful NER | +| Mislabeled entities | **~10,854** | CRITICAL | See breakdown — most are label-space design issues | +| Overlapping spans | **1,060** | HIGH | Fix or pick longest-match | +| Garbage text (real HTML) | **~471** | MEDIUM | Strip HTML markup | +| Repetitive entities (50+) | **100 entities** | MEDIUM | Review for template artifacts | +| Empty spans (no annotations) | **942** | LOW-MEDIUM | Decide: keep as negatives or remove | + +**Overall data health: FAIR.** Offsets are clean (big win), but label consistency, overlaps, and duplicates need remediation before training. + +--- + +## 1. Offset Errors: 0 ✅ + +All `text[start:end]` slices match their declared entity text across all 17,516 records. The annotation pipeline produced correct character offsets. + +--- + +## 2. Duplicate Texts: 1,727 unique texts appear 2+ times (4,408 total records) + +**Within-file duplicates:** 78 unique texts +**Cross-file duplicates:** 1,649 unique texts + +### Worst offenders: +- `"Ransomware."` — **44 copies** in `llm_annotated_apt.jsonl` +- `"Ransomware"` — 7 copies in same file +- Many MITRE descriptions appear in **both** `llm_annotated_mitre.jsonl` AND `llm_annotated_mitre_v2.jsonl` AND `llm_annotated_apt.jsonl` (3-4 copies each) +- Oracle NVD boilerplate descriptions appear 4-6 times in `llm_annotated_nvd_v2.jsonl` + +### Root cause: +- `mitre` and `mitre_v2` are overlapping dataset versions that were both kept +- `apt` dataset ingested MITRE descriptions alongside its own data +- Very short texts like "Ransomware." are degenerate entries from APT descriptions + +### Recommendation: +**Deduplicate globally.** Keep the version with the best annotations when spans differ. Priority: `mitre_v2` > `mitre`, `nvd_v2` > `nvd`. + +--- + +## 3. Short Texts (<20 chars): 71 + +All 71 are from `llm_annotated_apt.jsonl`. Examples: +- `"WebShell."` (9 chars) — 2 occurrences +- `"Ransomware."` (11 chars) — 44+ occurrences +- `"Keylogger."` (10 chars) +- `"PyVil RAT"` (9 chars) + +These are malware "descriptions" that are just a single word. They have no spans (empty annotations) and provide zero training signal. + +### Recommendation: +**Remove all records with text <20 chars.** They cannot produce useful span examples. + +--- + +## 4. Mislabeled Entities: ~10,854 flagged + +This is the highest-count issue but most are **label-space design disagreements**, not random errors. Breakdown: + +### 4a. Security vendors labeled as SYSTEM instead of ORGANIZATION (200 instances) + +| Entity | Count | +|--------|-------| +| ESET | 37 | +| Trend Micro | 25 | +| Kaspersky | 16 | +| Symantec | 11 | +| SentinelOne | 8 | +| Avast | 7 | +| Fortinet | 7 | +| Bitdefender | 3 | +| Sophos | 2 | +| Palo Alto | 2 | +| McAfee | 1 | + +**Analysis:** The LLM annotator confused security product names with their parent companies. "Kaspersky" the company vs "Kaspersky" the antivirus product. This is genuinely ambiguous, but for cybersecurity NER, these should be **ORGANIZATION**. + +**Severity: HIGH.** These are real errors that will confuse the model. Fix by relabeling. + +### 4b. CVE_ID vs VULNERABILITY label (30 instances) + +CVE identifiers (e.g., `CVE-2023-1389`) are labeled as `CVE_ID` but the audit expected `VULNERABILITY`. + +**Analysis:** This is actually a **label-space design question**. If the label space includes both `CVE_ID` and `VULNERABILITY`, then CVE IDs should indeed be `CVE_ID`. Check if `CVE_ID` is in the intended label space. + +**Severity: LOW** if `CVE_ID` is a valid label. **HIGH** if it's not in the final label space. + +### 4c. URL and HASH labeled as their own types instead of INDICATOR (51 instances) + +URLs labeled `URL`, hashes labeled `HASH` — audit expected `INDICATOR`. + +**Analysis:** Same as 4b — depends on label-space design. If `URL`, `HASH`, `IP_ADDRESS`, `DOMAIN`, `EMAIL` are all valid labels (they appear in the label distribution), then these are **correct**. The audit's expectation of a single `INDICATOR` class was wrong. + +**Severity: NOT AN ISSUE** — the data uses fine-grained IOC labels which is actually better for cybersecurity NER. + +### 4d. Revised mislabel count + +Excluding label-space design issues (4b, 4c), the **real mislabel count is ~200** (security vendors as SYSTEM). This is much more manageable. + +--- + +## 5. Overlapping Spans: 1,060 + +### Dominant patterns: + +1. **"Google Play" triple overlap** (~100+ instances): + - `ORGANIZATION: Google [26:32]` + - `SYSTEM: Google Play [26:37]` + - `MALWARE: Play [33:37]` ← **This is wrong** — "Play" (as in Google Play) is not malware + +2. **Nested entity annotations** (e.g., `SYSTEM: Cisco` inside `ORGANIZATION: Cisco Talos`) + +3. **Partial overlaps** (e.g., `SYSTEM: Android` overlapping `SYSTEM: Android operating system`) + +### Root cause: +The LLM annotator is producing **all possible readings** of ambiguous spans instead of picking one. The BIOES tagging scheme used by the model **cannot represent overlapping spans** — the Viterbi decoder produces exactly one label per token. + +### Recommendation: +**Resolve all overlaps before training.** Strategy: +- For nested spans: keep the **longest** span +- For `Google Play`: annotate as `SYSTEM: Google Play` only (not three separate entities) +- For `MALWARE: Play`: **remove** — this is a false annotation. "Play" in "Google Play" is not the Play ransomware group +- General rule: prefer the span that covers the full entity mention + +--- + +## 6. Garbage Text / HTML Artifacts: ~471 records with real HTML + +Of 1,119 records flagged for HTML-like patterns: +- **~471** contain actual HTML markup tags (`

`, ``, ``, etc.) +- **~648** contain legitimate code references (`