PleoMorph
/

ctem-g2pm-models

@@ -1,296 +1,119 @@
-# Training Verification Summary
-## Executive Summary
-| Metric | Value | Status |
-|--------|-------|--------|
-| Total Embeddings | 279,304 | All used in training |
-| Labeled Samples | 121,655 | 43.6% of total |
-| Unique Techniques | 122 | MITRE ATT&CK mapped |
-| Validation Accuracy | 99.7% | Verified working |
-| High-Confidence Predictions | 120,464 (43.2%) | >95% confidence |
-| Coverage | 100% | All embeddings processed |
-**VERIFIED: Training used ALL 279,304 embeddings and the dual model approach is working correctly.**
 ---
-## 1. Embedding Data Coverage
-### Source: `COMPLETE_573K_EMBEDDINGS.pkl` (905.6 MB)
-- **Total embeddings**: 279,304 (deduplicated from 573K)
-- **Embedding dimension**: 768 (sentence-transformers)
-- **All normalized**: L2 normalization applied
-### Embedding Sources Breakdown:
-| Source | Count | Percentage |
-|--------|-------|------------|
-| g2pm_nodes | 156,652 | 56.1% |
-| COMPLETE_MASTER | 120,593 | 43.2% |
-| technology_permutations | 1,710 | 0.6% |
-| UNIFIED_MASTER | 349 | 0.1% |
-### Training Data Flow:
-```
-279,304 total embeddings
-    ↓
-149,488 sampled for k-NN graph (53.5%)
-    ↓
-121,655 labeled samples (43.6%)
-    ↓
-746,756 graph edges (k=5, similarity > 0.3)
-    ↓
-30 epochs training
-    ↓
-279,304 pseudo-labels generated (100% coverage)
-```
 ---
-## 2. Model Architecture: Dual Model Approach
-### Model 1: SemiSupervisedG2PM (Primary - Technique Classification)
-**Architecture:**
-```
-Input (768) → Encoder → Classifier (122 classes)
-                     ↘ Projection Head (contrastive learning)
-Encoder:
-  Linear(768, 256) → ReLU → Dropout(0.2)
-  Linear(256, 256) → ReLU → Dropout(0.2)
-Classifier:
-  Linear(256, 122)
-Projection Head:
-  Linear(256, 128) → ReLU → Linear(128, 64) → L2Normalize
-```
-**Performance:**
-- Validation Accuracy: **99.7%**
-- Number of Classes: 122 MITRE ATT&CK techniques
-- Model Size: 2.6 MB
-**Use Case:** Classifying embeddings into attack technique categories
-### Model 2: SpectralG2PM (Secondary - Transition Prediction)
-**Architecture:**
-```
-Input (768 + 256 spectral) → Encoder → G2PM Features
-                                    ↘ Transition Predictor
-Encoder:
-  Linear(1024, 384) → LayerNorm → GELU → Dropout(0.1)
-  Linear(384, 384) → LayerNorm → GELU → Dropout(0.1)
-  MultiheadAttention(8 heads)
-G2PM Encoder:
-  Linear(384, 256) → ReLU → Linear(256, 128)
-Transition Predictor:
-  Linear(256, 256) → ReLU → Linear(256, 128) → ReLU → Linear(128, 1) → Sigmoid
 ```
-**Performance:**
-- Unsupervised Accuracy: 59.1%
-- Techniques Indexed: 621
-- Model Size: 5.7 MB
-**Use Case:** Predicting attack path transitions (technique A → technique B)
----
-## 3. Training Process Verification
-### Labeling Strategy
-Labels were derived from `LABELED_ATTACK_TRAINING_DATA.json`:
-- **147 attack chains** with expert labels
-- **122 unique MITRE ATT&CK techniques**
-Label matching used multiple approaches:
-1. Direct `technique_id` field matching
-2. `technique` field matching
-3. Name-based fuzzy matching (technique ID in name)
-### Training Configuration
 ```python
-{
-    "epochs": 30,
-    "batch_size": 512,
-    "learning_rate": 1e-3,
-    "weight_decay": 1e-4,
-    "optimizer": "AdamW",
-    "scheduler": "CosineAnnealing",
-    "k_neighbors": 5,
-    "similarity_threshold": 0.3,
-    "graph_samples": 149488,
-    "validation_split": 300 samples
-}
 ```
-### Loss Function
-- **Supervised Loss**: Cross-entropy on labeled samples
-- **Contrastive Loss**: Self-supervised similarity learning
-- **Combined**: `loss = supervised + 0.1 * contrastive`
-### Training Curve
-```
-Epoch  1/30: Loss=0.3592, Val Acc=99.3%
-Epoch  6/30: Loss=0.0417, Val Acc=99.7% ← Best
-Epoch 11/30: Loss=0.0308, Val Acc=99.3%
-Epoch 16/30: Loss=0.0212, Val Acc=99.3%
-Epoch 21/30: Loss=0.0159, Val Acc=99.3%
-Epoch 26/30: Loss=0.0136, Val Acc=99.3%
-Epoch 30/30: Loss=0.0129, Val Acc=99.3%
 ```
----
-## 4. Pseudo-Label Generation
-All 279,304 embeddings received predictions:
-| Confidence | Count | Percentage |
-|------------|-------|------------|
-| >95% | 120,464 | 43.2% |
-| >90% | 120,602 | 43.2% |
-| >80% | 121,357 | 43.4% |
-| >70% | 123,214 | 44.1% |
-| >50% | 133,595 | 47.8% |
-| Mean | 58.6% | - |
-### Pseudo-Label Distribution (Top 10)
-| Technique | Count | Percentage |
-|-----------|-------|------------|
-| T1190 (Exploit Public-Facing Application) | 70,757 | 25.3% |
-| T1195 (Supply Chain Compromise) | 52,697 | 18.9% |
-| T1547 (Boot or Logon Autostart Execution) | 8,276 | 3.0% |
-| T1055 (Process Injection) | 8,266 | 3.0% |
-| T1027 (Obfuscated Files or Information) | 8,144 | 2.9% |
-| T1059 (Command and Scripting Interpreter) | 6,966 | 2.5% |
-| T1036 (Masquerading) | 6,331 | 2.3% |
-| T1556 (Modify Authentication Process) | 5,543 | 2.0% |
-| T1003 (OS Credential Dumping) | 5,283 | 1.9% |
-| T1552 (Unsecured Credentials) | 5,266 | 1.9% |
----
-## 5. Model Files
-### Files Ready for Deployment
 | File | Size | Description |
 |------|------|-------------|
-| `backend/models/semi_supervised_99_7_best.pt` | 2.6 MB | Primary classifier (99.7% acc) |
-| `backend/models/semi_supervised_cpu_best.pt` | 2.6 MB | Identical to above |
-| `backend/models/spectral_281k_best.pt` | 5.7 MB | Transition predictor |
-| `backend/models/spectral_281k_results.pkl` | 478.5 MB | Technique embeddings |
-| `backend/models/graphany_category_best.pt` | 3.6 MB | Category classifier (53.8% acc) |
-### Files in .gitignore
-Model files (.pt, .pkl) are excluded from git. They should be:
-1. Uploaded to Hugging Face Hub, or
-2. Stored in cloud storage (S3, GCS), or
-3. Committed to a separate model repository
----
-## 6. Backend Service Integration
-### Service Location
-`backend/services/g2pm_model_service.py`
-### Key Classes
-**G2PMModelService** (lines 124-461):
-- Loads both models automatically
-- Provides `classify_embedding()` for technique prediction
-- Provides `predict_transition_probability()` for attack paths
-- Provides `predict_attack_path()` for multi-step predictions
-**BatchedG2PMModelService** (lines 478-652):
-- Extends G2PMModelService with dynamic batching
-- Supports high-throughput inference
-- Thread-safe batch processing
-### Usage Example
-```python
-from backend.services.g2pm_model_service import get_g2pm_service
-service = get_g2pm_service()
-# Classify an embedding
-predictions = service.classify_embedding(embedding, top_k=5)
-# Returns: [{"technique": "T1190", "confidence": 0.85, "rank": 1}, ...]
-# Predict attack path
-path = service.predict_attack_path("T1566.001", max_steps=5)
-# Returns: [{"technique": "T1566.001", "probability": 1.0, "step": 0}, ...]
-# Get model info
-info = service.get_model_info()
-# Returns: {"classifier_accuracy": "99.7%", "techniques_count": 621, ...}
 ```
----
-## 7. Verification Test Results
-### High-Confidence Sample Test
-- **Samples tested**: 20 (randomly selected >95% confidence)
-- **Accuracy**: 100% (20/20)
-- **Status**: ✓ PASSED
-### Transition Prediction Test
-- **Sample**: T1574.006 → T1574
-- **Predicted probability**: 98.9%
-- **Status**: ✓ PASSED
-### Service Integration Test
-- **SemiSupervisedG2PM loaded**: ✓
-- **SpectralG2PM loaded**: ✓
-- **Technique indices loaded**: 621
-- **Status**: ✓ PASSED
----
-## 8. Known Limitations
-1. **Class Imbalance**: T1190 (25.3%) and T1195 (18.9%) dominate predictions
-2. **Low-Confidence Predictions**: 52.2% of embeddings have <50% confidence
-3. **Technique Coverage**: Model trained on 122 techniques, but MITRE has 700+
-4. **Validation Set Size**: Only 300 samples used for validation (small)
----
-## 9. Recommendations Before Deployment
-1. **Model Storage**: Upload to Hugging Face or cloud storage
-2. **Version Control**: Tag models with version (e.g., v1.0.0-99.7acc)
-3. **Monitoring**: Set up prediction confidence tracking in production
-4. **Fallback**: Use spectral model when classifier confidence < 50%
-5. **Documentation**: Update API docs with new endpoints
----
-## 10. Summary
-| Verification Item | Status |
-|------------------|--------|
-| All embeddings used in training | ✓ VERIFIED (279,304/279,304) |
-| Dual model architecture implemented | ✓ VERIFIED |
-| SemiSupervisedG2PM working | ✓ VERIFIED (99.7% accuracy) |
-| SpectralG2PM working | ✓ VERIFIED (transition predictions) |
-| Backend service integrated | ✓ VERIFIED |
-| Pseudo-labels generated for all | ✓ VERIFIED (100% coverage) |
-**Conclusion**: The training used all available embeddings, the dual model approach is working as intended, and the system is ready for model upload and deployment.
----
-*Generated: 2026-01-24*
-*Model Version: semi_supervised_99_7_best.pt*
-*Validation Accuracy: 99.7%*

 ---
+license: mit
+tags:
+  - attack-path-prediction
+  - graph-neural-networks
+  - cybersecurity
+  - mitre-attack
+  - threat-modeling
+datasets:
+  - custom
+language:
+  - en
+library_name: pytorch
 ---
+# CTEM G2PM Models
+**Graph-to-Path Models for Attack Path Prediction**
+Trained models for the [CTEM Enterprise Platform](https://github.com/LucPlessier/PleoMorphic) - Continuous Threat Exposure Management using Graph Neural Networks.
+## Research Foundation
+Based on [Michael Bronstein's geometric deep learning research](https://arxiv.org/abs/2104.13478) and GraphAny architecture for learning on arbitrary graph structures.
+## Models
+| Model | Accuracy | Parameters | Purpose |
+|-------|----------|------------|---------|
+| `semi_supervised_99_7_best.pt` | **99.7%** | 660K | Technique classification (122 MITRE ATT&CK techniques) |
+| `spectral_281k_best.pt` | 59.1% | 1.5M | Attack path transition prediction |
+| `graphany_category_best.pt` | 53.8% | 950K | Category classification (137 categories) |
+## Training Data
+- **279,304** attack technique embeddings (768-dim, sentence-transformers)
+- **147** expert-labeled attack chains
+- **122** MITRE ATT&CK techniques
+## Usage
+### Download Models
+```bash
+pip install huggingface_hub
+# Download all models
+huggingface-cli download PleoMorph/ctem-g2pm-models --local-dir ./models
 ```
+### Load in Python
 ```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download model
+model_path = hf_hub_download(
+    repo_id="PleoMorph/ctem-g2pm-models",
+    filename="semi_supervised_99_7_best.pt"
+)
+# Load checkpoint
+checkpoint = torch.load(model_path, map_location="cpu")
+print(f"Accuracy: {checkpoint['best_acc']*100:.1f}%")
+print(f"Techniques: {checkpoint['num_classes']}")
+print(f"Technique mapping: {list(checkpoint['technique_to_idx'].keys())[:10]}...")
 ```
+### Model Architecture
+**SemiSupervisedG2PM** (99.7% accuracy):
+```python
+class SemiSupervisedG2PM(nn.Module):
+    def __init__(self, input_dim=768, hidden_dim=256, num_classes=122):
+        self.encoder = nn.Sequential(
+            nn.Linear(768, 256), nn.ReLU(), nn.Dropout(0.2),
+            nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.2),
+        )
+        self.classifier = nn.Linear(256, 122)
 ```
+**SpectralG2PM** (transition prediction):
+```python
+class SpectralG2PM(nn.Module):
+    # Spectral graph convolution + transition predictor
+    # Input: embedding (768) + spectral features (256)
+    # Output: transition probability P(A → B)
+```
+## Files
 | File | Size | Description |
 |------|------|-------------|
+| `semi_supervised_99_7_best.pt` | 2.6 MB | Best classifier model |
+| `spectral_281k_best.pt` | 5.7 MB | Transition predictor |
+| `spectral_281k_results.pkl` | 478 MB | G2PM features & technique index |
+| `graphany_category_best.pt` | 3.6 MB | Category classifier |
+| `semi_supervised_cpu_results.pkl` | 3.2 MB | Pseudo-labels & confidences |
+## Related
+- **GitHub**: [CTEM Enterprise Platform](https://github.com/LucPlessier/PleoMorphic)
+- **Documentation**: [Model Architecture](https://github.com/LucPlessier/PleoMorphic/blob/clean-upload/docs/MODELS.md)
+## Citation
+```bibtex
+@software{ctem_g2pm_2025,
+  title={CTEM G2PM: Graph-to-Path Models for Attack Path Prediction},
+  author={PleoMorph},
+  year={2025},
+  url={https://huggingface.co/PleoMorph/ctem-g2pm-models}
+}
 ```
+## License
+MIT License