AndrewAndrewsen
/

secretmask-gate

@@ -1,321 +1,323 @@
----
-language:
-  - en
-license: mit
-tags:
-  - secret-detection
-  - mixture-of-experts
-  - gating-network
-  - security
-  - nlp
-  - token-classification
-pipeline_tag: token-classification
-library_name: pytorch
-datasets:
-  - custom
-metrics:
-  - accuracy
-  - f1
-model-index:
-  - name: secretmask-gate
-    results:
-      - task:
-          type: routing
-          name: Expert Routing
-        dataset:
-          name: SecretMask v2
-          type: custom
-        metrics:
-          - type: accuracy
-            value: 0.927
-            name: Test Accuracy
-          - type: accuracy
-            value: 1.0
-            name: Validation Accuracy
-base_model:
-  - andrewandrewsen/distilbert-secret-masker
-  - andrewandrewsen/longformer-secret-masker
----
-# SecretMask MoE Gating Network
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Model Size](https://img.shields.io/badge/Size-12KB-green.svg)](https://huggingface.co/andrewandrewsen/secretmask-gate)
-**Lightweight learned gating network for SecretMask Mixture-of-Experts routing.**
-This repository contains a trained 12KB neural network that learns optimal routing between two secret detection expert models. Use this for true MoE inference with weighted expert combination.
----
-## 📋 Overview
-The gating network is a tiny 3-layer MLP (3,042 parameters) that:
-1. Takes 10 features extracted from text
-2. Outputs routing weights `[w_fast, w_long]` (sum to 1.0)
-3. Enables weighted combination of expert model outputs
-**Training Results:**
-- ✅ 100% validation accuracy (200 examples)
-- ✅ 92.7% test accuracy (600 examples)
-- ✅ Only 0.19ms inference overhead
-- ✅ Matches heuristic routing performance
----
-## 🚀 Quick Start
-### Installation
-```bash
-pip install torch transformers huggingface-hub
-```
-### Download and Use
-```python
-from huggingface_hub import hf_hub_download
-from moe_gate import GatingNetwork, extract_features_tensor
-# Download gating network
-gate_path = hf_hub_download("andrewandrewsen/secretmask-gate", "best_gate.pt")
-# Load model
-gate = GatingNetwork.load(gate_path)
-gate.eval()
-# Extract features from text
-text = "AWS key: AKIAIOSFODNN7EXAMPLE"
-features = extract_features_tensor(text)
-# Get routing weights
-import torch
-with torch.no_grad():
-    weights = gate(features.unsqueeze(0))
-print(f"Fast expert weight: {weights[0][0]:.3f}")
-print(f"Long expert weight: {weights[0][1]:.3f}")
-# Output: Fast expert weight: 0.950, Long expert weight: 0.050
-```
-### Integration with SecretMask
-```bash
-# Clone SecretMask repository
-git clone https://github.com/andrewandrewsen/secmask.git
-cd secmask
-# Run inference with learned MoE routing
-python infer_moe.py \
-    --text "My AWS key is AKIAIOSFODNN7EXAMPLE" \
-    --routing-mode learned \
-    --fast-model andrewandrewsen/distilbert-secret-masker \
-    --long-model andrewandrewsen/longformer-secret-masker \
-    --gate-model andrewandrewsen/secretmask-gate \
-    --tau 0.80
-```
----
-## 🏗️ Model Architecture
-```
-Input: [10 features]
-    ↓
-Linear(10 → 64) + LayerNorm + ReLU + Dropout(0.1)
-    ↓
-Linear(64 → 32) + LayerNorm + ReLU + Dropout(0.1)
-    ↓
-Linear(32 → 2) + Softmax
-    ↓
-Output: [w_fast, w_long]  (sum = 1.0)
-```
-**Total Parameters:** 3,042
-**Model Size:** 12KB (float32)
-**Inference Time:** ~0.19ms on CPU
----
-## 📊 Input Features (10D)
-The gating network takes a normalized 10-dimensional feature vector:
-| Index | Feature           | Description             | Normalization |
-| ----- | ----------------- | ----------------------- | ------------- |
-| 0     | `token_count`     | Number of tokens        | / 1000        |
-| 1     | `entropy`         | Shannon entropy         | / 6           |
-| 2     | `has_pem`         | Has PEM block (binary)  | 0 or 1        |
-| 3     | `has_k8s`         | Has K8s secret (binary) | 0 or 1        |
-| 4     | `akia_count`      | AWS pattern count       | / 5           |
-| 5     | `github_count`    | GitHub token count      | / 5           |
-| 6     | `jwt_count`       | JWT token count         | / 5           |
-| 7     | `base64_count`    | Base64 pattern count    | / 50          |
-| 8     | `line_count`      | Number of lines         | / 100         |
-| 9     | `avg_line_length` | Avg chars per line      | / 100         |
----
-## 📈 Training Details
-**Dataset:**
-- Training: 6,000 examples
-- Validation: 200 examples
-- Test: 600 examples
-**Configuration:**
-- Optimizer: AdamW (lr=0.001, weight_decay=0.01)
-- Scheduler: Cosine annealing
-- Batch size: 32
-- Epochs: 10
-- Device: Apple M-series (MPS)
-**Training Results:**
-| Epoch | Train Loss | Train Acc | Val Loss | Val Acc  |
-| ----- | ---------- | --------- | -------- | -------- |
-| 1     | 0.0808     | 97.6%     | 0.0051   | **100%** |
-| 2     | 0.0036     | 100%      | 0.0010   | **100%** |
-| 10    | 0.0005     | 100%      | 0.0001   | **100%** |
-**Test Performance:**
-- Routing accuracy: 92.7%
-- Fast expert: 92.7% of examples
-- Long expert: 7.3% of examples
-- Matches heuristic routing exactly
----
-## 🔧 Usage with Expert Models
-This gating network coordinates two expert models:
-| Expert   | Model                                                                                                       | Size  | Max Tokens | Use Case                     |
-| -------- | ----------------------------------------------------------------------------------------------------------- | ----- | ---------- | ---------------------------- |
-| **Fast** | [andrewandrewsen/distilbert-secret-masker](https://huggingface.co/andrewandrewsen/distilbert-secret-masker) | 265MB | 512        | Short texts, code snippets   |
-| **Long** | [andrewandrewsen/longformer-secret-masker](https://huggingface.co/andrewandrewsen/longformer-secret-masker) | 592MB | 2048       | Long documents, config files |
-### How It Works
-```python
-# 1. Extract features
-features = extract_features_tensor(text)
-# 2. Get routing weights from gating network
-weights = gate(features)  # [w_fast, w_long]
-# 3. Run both expert models
-fast_output = fast_expert(text)
-long_output = long_expert(text)
-# 4. Combine outputs using learned weights
-final_output = weights[0] * fast_output + weights[1] * long_output
-```
----
-## 📦 Files in This Repository
-- **`best_gate.pt`** - Trained gating network (12KB)
-- **`final_gate.pt`** - Final checkpoint (12KB)
-- **`history.json`** - Training history (3.2KB)
-- **`README.md`** - This file
----
-## 🔬 Technical Details
-### Load Balancing
-The model was trained with a load balancing loss to encourage uniform expert usage:
-```python
-target_distribution = [0.5, 0.5]  # 50% fast, 50% long
-actual_distribution = weights.mean(dim=0)
-load_balance_loss = 0.01 * MSE(actual_distribution, target_distribution)
-```
-Despite this, the model learned to route 90.5% to fast expert and 9.5% to long expert, matching the natural data distribution.
-### Routing Metrics
-```python
-from moe_gate import compute_routing_metrics
-weights = gate(features)
-metrics = compute_routing_metrics(weights)
-# Returns:
-# {
-#   'fast_expert_pct': 92.7,
-#   'long_expert_pct': 7.3,
-#   'avg_fast_weight': 0.924,
-#   'avg_long_weight': 0.076,
-#   'entropy': 0.031
-# }
-```
-Low entropy (0.031) indicates confident routing decisions.
----
-## 🆚 Heuristic vs Learned Routing
-| Metric                | Heuristic            | Learned MoE              |
-| --------------------- | -------------------- | ------------------------ |
-| **Routing Accuracy**  | 92.7%                | 92.7%                    |
-| **Model Size**        | 0KB (rules only)     | 12KB                     |
-| **Latency**           | 0.065ms              | 0.256ms                  |
-| **Training Required** | No                   | Yes (10 epochs)          |
-| **Explainability**    | High (if-else rules) | Medium (learned weights) |
-| **Adaptability**      | Manual updates       | Data-driven              |
-**Recommendation:** Use heuristic routing for simplicity and explainability. Use learned routing when you want to fine-tune on your specific data distribution.
----
-## 📚 Citation
-If you use this model, please cite:
-```bibtex
-@model{secretmask-gate,
-  author = {Anders Andersson},
-  title = {SecretMask MoE Gating Network},
-  year = {2025},
-  publisher = {HuggingFace},
-  url = {https://huggingface.co/andrewandrewsen/secretmask-gate}
-}
-```
----
-## 📄 License
-MIT License - see [LICENSE](LICENSE) file.
-**Note:** This model is trained to work with the SecretMask expert models, which use Apache 2.0 licensed base models (DistilBERT, Longformer). See the expert model repositories for full licensing details.
----
-## 🔗 Related Resources
-- **SecretMask MoE Repository:** [GitHub](https://github.com/andrewandrewsen/secmask)
-- **Fast Expert Model:** [andrewandrewsen/distilbert-secret-masker](https://huggingface.co/andrewandrewsen/distilbert-secret-masker)
-- **Long Expert Model:** [andrewandrewsen/longformer-secret-masker](https://huggingface.co/andrewandrewsen/longformer-secret-masker)
-- **Documentation:** See repository for BENCHMARKS.md, USE_CASES.md, etc.
----
-## 🤝 Contributing
-Issues and pull requests welcome at [GitHub](https://github.com/andrewandrewsen/secmask).
----
-**Built with ❤️ for the open source community**

+---
+language:
+  - en
+license: mit
+tags:
+  - secret-detection
+  - mixture-of-experts
+  - gating-network
+  - security
+  - nlp
+  - token-classification
+pipeline_tag: token-classification
+library_name: pytorch
+datasets:
+  - custom
+metrics:
+  - accuracy
+  - f1
+model-index:
+  - name: secretmask-gate
+    results:
+      - task:
+          type: routing
+          name: Expert Routing
+        dataset:
+          name: SecretMask v2
+          type: custom
+        metrics:
+          - type: accuracy
+            value: 0.927
+            name: Test Accuracy
+          - type: accuracy
+            value: 1.0
+            name: Validation Accuracy
+base_model:
+  - andrewandrewsen/distilbert-secret-masker
+  - andrewandrewsen/longformer-secret-masker
+---
+# SecretMask MoE Gating Network
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Model Size](https://img.shields.io/badge/Size-12KB-green.svg)](https://huggingface.co/andrewandrewsen/secretmask-gate)
+**Lightweight learned gating network for SecretMask Mixture-of-Experts routing.**
+This repository contains a trained 12KB neural network that learns optimal routing between two secret detection expert models. Use this for true MoE inference with weighted expert combination.
+---
+## 📋 Overview
+The gating network is a tiny 3-layer MLP (3,042 parameters) that:
+1. Takes 10 features extracted from text
+2. Outputs routing weights `[w_fast, w_long]` (sum to 1.0)
+3. Enables weighted combination of expert model outputs
+**Training Results:**
+- ✅ 100% validation accuracy (200 examples)
+- ✅ 92.7% test accuracy (600 examples)
+- ✅ Only 0.19ms inference overhead
+- ✅ Matches heuristic routing performance
+> **Note**: This gating network is **optional and experimental**. Heuristic (rule-based) routing achieves identical results (92.7% accuracy) without requiring this model. The recommended production configuration uses **Fast Expert + Filters** without learned routing or the Long Expert. This gate is primarily for learning/experimentation with MoE architectures. See [Configuration Guide](https://github.com/AndrewAndrewsen/secmask/blob/main/CONFIGURATION_GUIDE.md) for details.
+---
+## 🚀 Quick Start
+### Installation
+```bash
+pip install torch transformers huggingface-hub
+```
+### Download and Use
+```python
+from huggingface_hub import hf_hub_download
+from moe_gate import GatingNetwork, extract_features_tensor
+# Download gating network
+gate_path = hf_hub_download("andrewandrewsen/secretmask-gate", "best_gate.pt")
+# Load model
+gate = GatingNetwork.load(gate_path)
+gate.eval()
+# Extract features from text
+text = "AWS key: AKIAIOSFODNN7EXAMPLE"
+features = extract_features_tensor(text)
+# Get routing weights
+import torch
+with torch.no_grad():
+    weights = gate(features.unsqueeze(0))
+print(f"Fast expert weight: {weights[0][0]:.3f}")
+print(f"Long expert weight: {weights[0][1]:.3f}")
+# Output: Fast expert weight: 0.950, Long expert weight: 0.050
+```
+### Integration with SecretMask
+```bash
+# Clone SecretMask repository
+git clone https://github.com/andrewandrewsen/secmask.git
+cd secmask
+# Run inference with learned MoE routing
+python infer_moe.py \
+    --text "My AWS key is AKIAIOSFODNN7EXAMPLE" \
+    --routing-mode learned \
+    --fast-model andrewandrewsen/distilbert-secret-masker \
+    --long-model andrewandrewsen/longformer-secret-masker \
+    --gate-model andrewandrewsen/secretmask-gate \
+    --tau 0.80
+```
+---
+## 🏗️ Model Architecture
+```
+Input: [10 features]
+    ↓
+Linear(10 → 64) + LayerNorm + ReLU + Dropout(0.1)
+    ↓
+Linear(64 → 32) + LayerNorm + ReLU + Dropout(0.1)
+    ↓
+Linear(32 → 2) + Softmax
+    ↓
+Output: [w_fast, w_long]  (sum = 1.0)
+```
+**Total Parameters:** 3,042
+**Model Size:** 12KB (float32)
+**Inference Time:** ~0.19ms on CPU
+---
+## 📊 Input Features (10D)
+The gating network takes a normalized 10-dimensional feature vector:
+| Index | Feature           | Description             | Normalization |
+| ----- | ----------------- | ----------------------- | ------------- |
+| 0     | `token_count`     | Number of tokens        | / 1000        |
+| 1     | `entropy`         | Shannon entropy         | / 6           |
+| 2     | `has_pem`         | Has PEM block (binary)  | 0 or 1        |
+| 3     | `has_k8s`         | Has K8s secret (binary) | 0 or 1        |
+| 4     | `akia_count`      | AWS pattern count       | / 5           |
+| 5     | `github_count`    | GitHub token count      | / 5           |
+| 6     | `jwt_count`       | JWT token count         | / 5           |
+| 7     | `base64_count`    | Base64 pattern count    | / 50          |
+| 8     | `line_count`      | Number of lines         | / 100         |
+| 9     | `avg_line_length` | Avg chars per line      | / 100         |
+---
+## 📈 Training Details
+**Dataset:**
+- Training: 6,000 examples
+- Validation: 200 examples
+- Test: 600 examples
+**Configuration:**
+- Optimizer: AdamW (lr=0.001, weight_decay=0.01)
+- Scheduler: Cosine annealing
+- Batch size: 32
+- Epochs: 10
+- Device: Apple M-series (MPS)
+**Training Results:**
+| Epoch | Train Loss | Train Acc | Val Loss | Val Acc  |
+| ----- | ---------- | --------- | -------- | -------- |
+| 1     | 0.0808     | 97.6%     | 0.0051   | **100%** |
+| 2     | 0.0036     | 100%      | 0.0010   | **100%** |
+| 10    | 0.0005     | 100%      | 0.0001   | **100%** |
+**Test Performance:**
+- Routing accuracy: 92.7%
+- Fast expert: 92.7% of examples
+- Long expert: 7.3% of examples
+- Matches heuristic routing exactly
+---
+## 🔧 Usage with Expert Models
+This gating network coordinates two expert models:
+| Expert   | Model                                                                                                       | Size  | Max Tokens | Use Case                     |
+| -------- | ----------------------------------------------------------------------------------------------------------- | ----- | ---------- | ---------------------------- |
+| **Fast** | [andrewandrewsen/distilbert-secret-masker](https://huggingface.co/andrewandrewsen/distilbert-secret-masker) | 265MB | 512        | Short texts, code snippets   |
+| **Long** | [andrewandrewsen/longformer-secret-masker](https://huggingface.co/andrewandrewsen/longformer-secret-masker) | 592MB | 2048       | Long documents, config files |
+### How It Works
+```python
+# 1. Extract features
+features = extract_features_tensor(text)
+# 2. Get routing weights from gating network
+weights = gate(features)  # [w_fast, w_long]
+# 3. Run both expert models
+fast_output = fast_expert(text)
+long_output = long_expert(text)
+# 4. Combine outputs using learned weights
+final_output = weights[0] * fast_output + weights[1] * long_output
+```
+---
+## 📦 Files in This Repository
+- **`best_gate.pt`** - Trained gating network (12KB)
+- **`final_gate.pt`** - Final checkpoint (12KB)
+- **`history.json`** - Training history (3.2KB)
+- **`README.md`** - This file
+---
+## 🔬 Technical Details
+### Load Balancing
+The model was trained with a load balancing loss to encourage uniform expert usage:
+```python
+target_distribution = [0.5, 0.5]  # 50% fast, 50% long
+actual_distribution = weights.mean(dim=0)
+load_balance_loss = 0.01 * MSE(actual_distribution, target_distribution)
+```
+Despite this, the model learned to route 90.5% to fast expert and 9.5% to long expert, matching the natural data distribution.
+### Routing Metrics
+```python
+from moe_gate import compute_routing_metrics
+weights = gate(features)
+metrics = compute_routing_metrics(weights)
+# Returns:
+# {
+#   'fast_expert_pct': 92.7,
+#   'long_expert_pct': 7.3,
+#   'avg_fast_weight': 0.924,
+#   'avg_long_weight': 0.076,
+#   'entropy': 0.031
+# }
+```
+Low entropy (0.031) indicates confident routing decisions.
+---
+## 🆚 Heuristic vs Learned Routing
+| Metric                | Heuristic            | Learned MoE              |
+| --------------------- | -------------------- | ------------------------ |
+| **Routing Accuracy**  | 92.7%                | 92.7%                    |
+| **Model Size**        | 0KB (rules only)     | 12KB                     |
+| **Latency**           | 0.065ms              | 0.256ms                  |
+| **Training Required** | No                   | Yes (10 epochs)          |
+| **Explainability**    | High (if-else rules) | Medium (learned weights) |
+| **Adaptability**      | Manual updates       | Data-driven              |
+**Recommendation:** Use heuristic routing for simplicity and explainability. Use learned routing when you want to fine-tune on your specific data distribution.
+---
+## 📚 Citation
+If you use this model, please cite:
+```bibtex
+@model{secretmask-gate,
+  author = {Anders Andersson},
+  title = {SecretMask MoE Gating Network},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/andrewandrewsen/secretmask-gate}
+}
+```
+---
+## 📄 License
+MIT License - see [LICENSE](LICENSE) file.
+**Note:** This model is trained to work with the SecretMask expert models, which use Apache 2.0 licensed base models (DistilBERT, Longformer). See the expert model repositories for full licensing details.
+---
+## 🔗 Related Resources
+- **SecretMask MoE Repository:** [GitHub](https://github.com/andrewandrewsen/secmask)
+- **Fast Expert Model:** [andrewandrewsen/distilbert-secret-masker](https://huggingface.co/andrewandrewsen/distilbert-secret-masker)
+- **Long Expert Model:** [andrewandrewsen/longformer-secret-masker](https://huggingface.co/andrewandrewsen/longformer-secret-masker)
+- **Documentation:** See repository for BENCHMARKS.md, USE_CASES.md, etc.
+---
+## 🤝 Contributing
+Issues and pull requests welcome at [GitHub](https://github.com/andrewandrewsen/secmask).
+---
+**Built with ❤️ for the open source community**