SecretMask MoE Gating Network
Lightweight learned gating network for SecretMask Mixture-of-Experts routing.
This repository contains a trained 12KB neural network that learns optimal routing between two secret detection expert models. Use this for true MoE inference with weighted expert combination.
π Overview
The gating network is a tiny 3-layer MLP (3,042 parameters) that:
- Takes 10 features extracted from text
- Outputs routing weights
[w_fast, w_long](sum to 1.0) - Enables weighted combination of expert model outputs
Training Results:
- β 100% validation accuracy (200 examples)
- β 92.7% test accuracy (600 examples)
- β Only 0.19ms inference overhead
- β Matches heuristic routing performance
Note: This gating network is optional and experimental. Heuristic (rule-based) routing achieves identical results (92.7% accuracy) without requiring this model. The recommended production configuration uses Fast Expert + Filters without learned routing or the Long Expert. This gate is primarily for learning/experimentation with MoE architectures. See Configuration Guide for details.
π Quick Start
Installation
pip install torch transformers huggingface-hub
Download and Use
from huggingface_hub import hf_hub_download
from moe_gate import GatingNetwork, extract_features_tensor
# Download gating network
gate_path = hf_hub_download("andrewandrewsen/secretmask-gate", "best_gate.pt")
# Load model
gate = GatingNetwork.load(gate_path)
gate.eval()
# Extract features from text
text = "AWS key: AKIAIOSFODNN7EXAMPLE"
features = extract_features_tensor(text)
# Get routing weights
import torch
with torch.no_grad():
weights = gate(features.unsqueeze(0))
print(f"Fast expert weight: {weights[0][0]:.3f}")
print(f"Long expert weight: {weights[0][1]:.3f}")
# Output: Fast expert weight: 0.950, Long expert weight: 0.050
Integration with SecretMask
# Clone SecretMask repository
git clone https://github.com/andrewandrewsen/secmask.git
cd secmask
# Run inference with learned MoE routing
python infer_moe.py \
--text "My AWS key is AKIAIOSFODNN7EXAMPLE" \
--routing-mode learned \
--fast-model andrewandrewsen/distilbert-secret-masker \
--long-model andrewandrewsen/longformer-secret-masker \
--gate-model andrewandrewsen/secretmask-gate \
--tau 0.80
ποΈ Model Architecture
Input: [10 features]
β
Linear(10 β 64) + LayerNorm + ReLU + Dropout(0.1)
β
Linear(64 β 32) + LayerNorm + ReLU + Dropout(0.1)
β
Linear(32 β 2) + Softmax
β
Output: [w_fast, w_long] (sum = 1.0)
Total Parameters: 3,042
Model Size: 12KB (float32)
Inference Time: ~0.19ms on CPU
π Input Features (10D)
The gating network takes a normalized 10-dimensional feature vector:
| Index | Feature | Description | Normalization |
|---|---|---|---|
| 0 | token_count |
Number of tokens | / 1000 |
| 1 | entropy |
Shannon entropy | / 6 |
| 2 | has_pem |
Has PEM block (binary) | 0 or 1 |
| 3 | has_k8s |
Has K8s secret (binary) | 0 or 1 |
| 4 | akia_count |
AWS pattern count | / 5 |
| 5 | github_count |
GitHub token count | / 5 |
| 6 | jwt_count |
JWT token count | / 5 |
| 7 | base64_count |
Base64 pattern count | / 50 |
| 8 | line_count |
Number of lines | / 100 |
| 9 | avg_line_length |
Avg chars per line | / 100 |
π Training Details
Dataset:
- Training: 6,000 examples
- Validation: 200 examples
- Test: 600 examples
Configuration:
- Optimizer: AdamW (lr=0.001, weight_decay=0.01)
- Scheduler: Cosine annealing
- Batch size: 32
- Epochs: 10
- Device: Apple M-series (MPS)
Training Results:
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 0.0808 | 97.6% | 0.0051 | 100% |
| 2 | 0.0036 | 100% | 0.0010 | 100% |
| 10 | 0.0005 | 100% | 0.0001 | 100% |
Test Performance:
- Routing accuracy: 92.7%
- Fast expert: 92.7% of examples
- Long expert: 7.3% of examples
- Matches heuristic routing exactly
π§ Usage with Expert Models
This gating network coordinates two expert models:
| Expert | Model | Size | Max Tokens | Use Case |
|---|---|---|---|---|
| Fast | andrewandrewsen/distilbert-secret-masker | 265MB | 512 | Short texts, code snippets |
| Long | andrewandrewsen/longformer-secret-masker | 592MB | 2048 | Long documents, config files |
How It Works
# 1. Extract features
features = extract_features_tensor(text)
# 2. Get routing weights from gating network
weights = gate(features) # [w_fast, w_long]
# 3. Run both expert models
fast_output = fast_expert(text)
long_output = long_expert(text)
# 4. Combine outputs using learned weights
final_output = weights[0] * fast_output + weights[1] * long_output
π¦ Files in This Repository
best_gate.pt- Trained gating network (12KB)final_gate.pt- Final checkpoint (12KB)history.json- Training history (3.2KB)README.md- This file
π¬ Technical Details
Load Balancing
The model was trained with a load balancing loss to encourage uniform expert usage:
target_distribution = [0.5, 0.5] # 50% fast, 50% long
actual_distribution = weights.mean(dim=0)
load_balance_loss = 0.01 * MSE(actual_distribution, target_distribution)
Despite this, the model learned to route 90.5% to fast expert and 9.5% to long expert, matching the natural data distribution.
Routing Metrics
from moe_gate import compute_routing_metrics
weights = gate(features)
metrics = compute_routing_metrics(weights)
# Returns:
# {
# 'fast_expert_pct': 92.7,
# 'long_expert_pct': 7.3,
# 'avg_fast_weight': 0.924,
# 'avg_long_weight': 0.076,
# 'entropy': 0.031
# }
Low entropy (0.031) indicates confident routing decisions.
π Heuristic vs Learned Routing
| Metric | Heuristic | Learned MoE |
|---|---|---|
| Routing Accuracy | 92.7% | 92.7% |
| Model Size | 0KB (rules only) | 12KB |
| Latency | 0.065ms | 0.256ms |
| Training Required | No | Yes (10 epochs) |
| Explainability | High (if-else rules) | Medium (learned weights) |
| Adaptability | Manual updates | Data-driven |
Recommendation: Use heuristic routing for simplicity and explainability. Use learned routing when you want to fine-tune on your specific data distribution.
π Citation
If you use this model, please cite:
@model{secretmask-gate,
author = {Anders Andersson},
title = {SecretMask MoE Gating Network},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/andrewandrewsen/secretmask-gate}
}
π License
MIT License - see LICENSE file.
Note: This model is trained to work with the SecretMask expert models, which use Apache 2.0 licensed base models (DistilBERT, Longformer). See the expert model repositories for full licensing details.
π Related Resources
- SecretMask MoE Repository: GitHub
- Fast Expert Model: andrewandrewsen/distilbert-secret-masker
- Long Expert Model: andrewandrewsen/longformer-secret-masker
- Documentation: See repository for BENCHMARKS.md, USE_CASES.md, etc.
π€ Contributing
Issues and pull requests welcome at GitHub.
Built with β€οΈ for the open source community
Evaluation results
- Test Accuracy on SecretMask v2self-reported0.927
- Validation Accuracy on SecretMask v2self-reported1.000