selfconstruct3d
/

FALCON

@@ -1,199 +1,297 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 ---
+language:
+- en
+license: apache-2.0
 library_name: transformers
+tags:
+- cybersecurity
+- APT
+- threat-intelligence
+- contrastive-learning
+- embeddings
+- attribution
+- MITRE-ATTACK
+- CTI
+- ModernBERT
+datasets:
+- mitre-attack
+base_model: cisco-ai/SecureBERT2.0-base
+pipeline_tag: feature-extraction
+model-index:
+- name: FALCON
+  results:
+  - task:
+      type: text-classification
+      name: APT Group Attribution
+    metrics:
+    - type: accuracy
+      value: 0.0
+      name: Accuracy (5-fold CV)
+    - type: f1
+      value: 0.0
+      name: F1 Weighted (5-fold CV)
+    - type: f1
+      value: 0.0
+      name: F1 Macro (5-fold CV)
 ---
+# FALCON — Finetuned Actor Linking via CONtrastive Learning
+<p align="center">
+  <strong>A domain-adapted embedding model for automated APT group attribution from cyber threat intelligence text.</strong>
+</p>
+| | |
+|---|---|
+| **Developed by** | AIT — Austrian Institute of Technology, Cybersecurity Group |
+| **Model type** | Transformer encoder (ModernBERT) with contrastive fine-tuning |
+| **Language** | English |
+| **License** | Apache 2.0 |
+| **Base model** | [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base) |
+| **Paper** | *Coming soon* |
+---
+## Model Description
+FALCON (**F**inetuned **A**ctor **L**inking via **CON**trastive learning) is a cybersecurity embedding model that maps textual descriptions of attack behaviors to a vector space where descriptions belonging to the same APT group are close together and descriptions from different groups are far apart.
+Given a sentence like *"The group has used spearphishing emails with malicious macro-enabled attachments to deliver initial payloads"*, FALCON produces a 768-dimensional embedding that can be used to classify which APT group performed that behavior.
+### Training Pipeline
+```
+cisco-ai/SecureBERT2.0-base (ModernBERT, 150M params)
+        ↓
+   Tokenizer Extension — Added APT group names + aliases as single tokens
+        ↓
+   MLM Fine-Tuning — Taught the model meaningful representations for new tokens
+        ↓
+   Supervised Contrastive Fine-Tuning (SupCon) — Shaped the embedding space
+        so same-group descriptions cluster together
+        ↓
+   FALCON
+```
+### What Makes FALCON Different
+- **Domain-adapted base**: Built on SecureBERT 2.0, which already understands cybersecurity terminology, rather than a generic language model.
+- **Contrastive objective**: Unlike classification-only models, FALCON optimizes the embedding geometry directly using Supervised Contrastive Loss (Khosla et al., 2020), producing embeddings suitable for retrieval, clustering, and few-shot classification.
+- **Name-agnostic**: Group names are masked during contrastive training with `[MASK]`, forcing the model to learn behavioral patterns rather than memorizing name co-occurrences.
+- **Alias-aware tokenizer**: APT group names and their vendor-specific aliases (e.g., APT29, Cozy Bear, Midnight Blizzard, NOBELIUM) are single tokens, preventing subword fragmentation.
+---
+## Intended Uses
 ### Direct Use
+- **APT group attribution**: Given a behavioral description from a CTI report, classify which threat actor is most likely responsible.
+- **Semantic search over CTI**: Retrieve the most relevant threat actor profiles given a description of observed attack behavior.
+- **Threat actor clustering**: Group unlabeled incident descriptions by behavioral similarity.
+- **Few-shot attribution**: Attribute newly emerging APT groups with very few reference samples.
+### Downstream Use
+- Fine-tuning for organization-specific threat actor taxonomies.
+- Integration into SIEM/SOAR pipelines for automated triage.
+- Enrichment of threat intelligence platforms with behavioral similarity scoring.
 ### Out-of-Scope Use
+- Attribution based on IOCs (hashes, IPs, domains) — FALCON operates on natural language text only.
+- Real-time network traffic classification.
+- Definitive legal or geopolitical attribution — FALCON is a decision-support tool, not an oracle.
+---
+## How to Use
+### Feature Extraction (Embeddings)
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("ait-cybersec/FALCON")
+tokenizer = AutoTokenizer.from_pretrained("ait-cybersec/FALCON")
+text = "The group used PowerShell scripts to download and execute additional payloads."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+with torch.no_grad():
+    outputs = model(**inputs)
+# Mean pooling (recommended)
+attention_mask = inputs["attention_mask"].unsqueeze(-1)
+token_embs = outputs.last_hidden_state
+embedding = (token_embs * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
+print(f"Embedding shape: {embedding.shape}")  # [1, 768]
+```
+### APT Group Classification (with sklearn probe)
+```python
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+# Encode your labeled corpus
+train_embeddings = np.array([get_embedding(text) for text in train_texts])
+test_embeddings = np.array([get_embedding(text) for text in test_texts])
+clf = LogisticRegression(max_iter=2000)
+clf.fit(train_embeddings, train_labels)
+predictions = clf.predict(test_embeddings)
+```
+### Semantic Similarity Between Descriptions
+```python
+from sklearn.metrics.pairwise import cosine_similarity
+emb1 = get_embedding("The actor used spearphishing with malicious attachments.")
+emb2 = get_embedding("The group sent phishing emails containing weaponized documents.")
+emb3 = get_embedding("The adversary exploited a SQL injection vulnerability.")
+print(f"Phishing vs Phishing: {cosine_similarity(emb1, emb2)[0][0]:.4f}")  # High
+print(f"Phishing vs SQLi:     {cosine_similarity(emb1, emb3)[0][0]:.4f}")  # Lower
+```
+---
+## Training Details
+### Training Data
+- **Source**: [MITRE ATT&CK Enterprise Groups](https://attack.mitre.org/groups/) — technique usage descriptions for all tracked APT groups.
+- **Preprocessing**:
+  - Canonicalized group aliases using `GroupID` (e.g., APT29 = Cozy Bear = Midnight Blizzard → single label).
+  - Filtered to groups with ≥30 unique technique usage descriptions.
+  - Masked all group names and aliases in training text with `[MASK]` to prevent name leakage.
+- **Final dataset**: ~144 unique APT groups, variable samples per group (30–200+).
+### Training Procedure
+#### Stage 1: Tokenizer Extension
+Extended the SecureBERT 2.0 tokenizer with APT group names and vendor-specific aliases as single tokens. This prevents names like "Kimsuky" from being split into subword fragments (`['Kim', '##su', '##ky']` → `['Kimsuky']`).
+#### Stage 2: Masked Language Modeling (MLM)
+| Hyperparameter | Value |
+|---|---|
+| Base model | cisco-ai/SecureBERT2.0-base |
+| Objective | MLM (15% masking probability) |
+| Learning rate | 2e-5 |
+| Batch size | 16 |
+| Epochs | 10 |
+| Weight decay | 0.01 |
+| Warmup ratio | 0.1 |
+| Max sequence length | 128 |
+| Text used | Unmasked (model sees group names to learn their embeddings) |
+#### Stage 3: Supervised Contrastive Learning (SupCon)
+| Hyperparameter | Value |
+|---|---|
+| Base checkpoint | Stage 2 MLM output |
+| Loss function | Supervised Contrastive Loss (Khosla et al., 2020) |
+| Temperature | 0.07 |
+| Projection head | 768 → 768 (ReLU) → 256 |
+| Unfrozen layers | Last 4 transformer layers + projection head |
+| Learning rate | 2e-5 |
+| Batch size | 64 |
+| Epochs | 15 |
+| Scheduler | Cosine annealing |
+| Gradient clipping | max_norm=1.0 |
+| Text used | Masked (group names replaced with `[MASK]`) |
+---
+## Evaluation
+Evaluation uses a **linear probing protocol**: freeze the model, extract embeddings, train a LogisticRegression classifier on top, and report metrics using **5-fold stratified cross-validation** with oversampling applied only to the training fold (no data leakage).
 ### Results
+<!-- UPDATE THESE WITH YOUR ACTUAL RESULTS -->
+| Model | Accuracy | F1 Weighted | F1 Macro |
+|---|---|---|---|
+| SecureBERT 2.0 (frozen baseline, CLS) | — | — | — |
+| SecureBERT 2.0 (frozen baseline, Mean) | — | — | — |
+| FALCON-base (MLM only) | — | — | — |
+| **FALCON (MLM + Contrastive)** | **—** | **—** | **—** |
+*Fill in after training completes.*
+### Evaluation Protocol Details
+- **No data leakage**: Oversampling is applied inside each training fold only; test folds contain only original, unique samples.
+- **Name masking**: All group names and aliases are replaced with `[MASK]` in evaluation text, ensuring the model is evaluated on behavioral understanding, not name recognition.
+- **Canonicalization**: All vendor-specific aliases are resolved to a single canonical label per `GroupID`, preventing inflated metrics from alias splits.
+---
+## Comparison with Related Models
+| Model | Domain | Architecture | Training Objective | Cybersecurity-Specific |
+|---|---|---|---|---|
+| BERT base | General | BERT | MLM + NSP | ❌ |
+| SecBERT | Cybersecurity | BERT | MLM | ✅ |
+| SecureBERT | Cybersecurity | RoBERTa | MLM (custom tokenizer) | ✅ |
+| ATTACK-BERT | Cybersecurity | Sentence-BERT | Sentence similarity | ✅ |
+| SecureBERT 2.0 | Cybersecurity | ModernBERT | MLM (text + code) | ✅ |
+| **FALCON** | **APT Attribution** | **ModernBERT** | **MLM + SupCon** | **✅ (task-specific)** |
+---
+## Limitations and Bias
+- **Training data bias**: MITRE ATT&CK over-represents well-documented state-sponsored groups (APT28, APT29, Lazarus). Less-known actors may have weaker representations.
+- **Behavioral overlap**: Many APT groups share identical TTPs (e.g., spearphishing, PowerShell usage). The model cannot reliably distinguish groups that employ the same techniques in the same way.
+- **English only**: The model is trained on English-language CTI text and will not perform well on non-English threat reports.
+- **Static knowledge**: The model reflects the MITRE ATT&CK knowledge base at training time and does not update as new groups or techniques emerge.
+- **Not a replacement for analyst judgment**: FALCON is a decision-support tool. Attribution conclusions should always be validated by human analysts.
+---
+## Ethical Considerations
+Automated threat attribution is a sensitive capability with potential for misuse. Incorrect attribution could lead to misguided defensive actions or geopolitical consequences. Users should:
+- Always treat model outputs as **hypotheses**, not conclusions.
+- Combine FALCON outputs with additional intelligence sources (IOCs, infrastructure analysis, geopolitical context).
+- Be aware that threat actors deliberately employ false-flag operations to mislead attribution.
+---
+## Citation
+```bibtex
+@misc{falcon2025,
+  title={FALCON: Finetuned Actor Linking via Contrastive Learning for APT Group Attribution},
+  author={AIT Austrian Institute of Technology, Cybersecurity Group},
+  year={2025},
+  url={https://huggingface.co/ait-cybersec/FALCON}
+}
+```
+### Related Work
+- Aghaei, E. et al. "SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence." arXiv:2510.00240 (2025).
+- Khosla, P. et al. "Supervised Contrastive Learning." NeurIPS (2020).
+- Irfan, S. et al. "A Comprehensive Survey of APT Attribution." arXiv:2409.11415 (2024).
+- Abdeen, B. et al. "SMET: Semantic Mapping of CVE to ATT&CK." (2023).
+---
+## Model Card Authors
+AIT — Austrian Institute of Technology, Cybersecurity Group
 ## Model Card Contact
+For inquiries, please open an issue on this repository or contact the AIT Cybersecurity Group.