SecretMask MoE Gating Network

Lightweight learned gating network for SecretMask Mixture-of-Experts routing.

This repository contains a trained 12KB neural network that learns optimal routing between two secret detection expert models. Use this for true MoE inference with weighted expert combination.

📋 Overview

The gating network is a tiny 3-layer MLP (3,042 parameters) that:

Takes 10 features extracted from text
Outputs routing weights [w_fast, w_long] (sum to 1.0)
Enables weighted combination of expert model outputs

Training Results:

✅ 100% validation accuracy (200 examples)
✅ 92.7% test accuracy (600 examples)
✅ Only 0.19ms inference overhead
✅ Matches heuristic routing performance

Note: This gating network is optional and experimental. Heuristic (rule-based) routing achieves identical results (92.7% accuracy) without requiring this model. The recommended production configuration uses Fast Expert + Filters without learned routing or the Long Expert. This gate is primarily for learning/experimentation with MoE architectures. See Configuration Guide for details.

🚀 Quick Start

Installation

pip install torch transformers huggingface-hub

Download and Use

from huggingface_hub import hf_hub_download
from moe_gate import GatingNetwork, extract_features_tensor

# Download gating network
gate_path = hf_hub_download("andrewandrewsen/secretmask-gate", "best_gate.pt")

# Load model
gate = GatingNetwork.load(gate_path)
gate.eval()

# Extract features from text
text = "AWS key: AKIAIOSFODNN7EXAMPLE"
features = extract_features_tensor(text)

# Get routing weights
import torch
with torch.no_grad():
    weights = gate(features.unsqueeze(0))

print(f"Fast expert weight: {weights[0][0]:.3f}")
print(f"Long expert weight: {weights[0][1]:.3f}")
# Output: Fast expert weight: 0.950, Long expert weight: 0.050

Integration with SecretMask

# Clone SecretMask repository
git clone https://github.com/andrewandrewsen/secmask.git
cd secmask

# Run inference with learned MoE routing
python infer_moe.py \
    --text "My AWS key is AKIAIOSFODNN7EXAMPLE" \
    --routing-mode learned \
    --fast-model andrewandrewsen/distilbert-secret-masker \
    --long-model andrewandrewsen/longformer-secret-masker \
    --gate-model andrewandrewsen/secretmask-gate \
    --tau 0.80

🏗️ Model Architecture

Input: [10 features]
    ↓
Linear(10 → 64) + LayerNorm + ReLU + Dropout(0.1)
    ↓
Linear(64 → 32) + LayerNorm + ReLU + Dropout(0.1)
    ↓
Linear(32 → 2) + Softmax
    ↓
Output: [w_fast, w_long]  (sum = 1.0)

Total Parameters: 3,042
Model Size: 12KB (float32)
Inference Time: ~0.19ms on CPU

📊 Input Features (10D)

The gating network takes a normalized 10-dimensional feature vector:

Index	Feature	Description	Normalization
0	`token_count`	Number of tokens	/ 1000
1	`entropy`	Shannon entropy	/ 6
2	`has_pem`	Has PEM block (binary)	0 or 1
3	`has_k8s`	Has K8s secret (binary)	0 or 1
4	`akia_count`	AWS pattern count	/ 5
5	`github_count`	GitHub token count	/ 5
6	`jwt_count`	JWT token count	/ 5
7	`base64_count`	Base64 pattern count	/ 50
8	`line_count`	Number of lines	/ 100
9	`avg_line_length`	Avg chars per line	/ 100

📈 Training Details

Dataset:

Training: 6,000 examples
Validation: 200 examples
Test: 600 examples

Configuration:

Optimizer: AdamW (lr=0.001, weight_decay=0.01)
Scheduler: Cosine annealing
Batch size: 32
Epochs: 10
Device: Apple M-series (MPS)

Training Results:

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	0.0808	97.6%	0.0051	100%
2	0.0036	100%	0.0010	100%
10	0.0005	100%	0.0001	100%

Test Performance:

Routing accuracy: 92.7%
Fast expert: 92.7% of examples
Long expert: 7.3% of examples
Matches heuristic routing exactly

🔧 Usage with Expert Models

This gating network coordinates two expert models:

Expert	Model	Size	Max Tokens	Use Case
Fast	andrewandrewsen/distilbert-secret-masker	265MB	512	Short texts, code snippets
Long	andrewandrewsen/longformer-secret-masker	592MB	2048	Long documents, config files

How It Works

# 1. Extract features
features = extract_features_tensor(text)

# 2. Get routing weights from gating network
weights = gate(features)  # [w_fast, w_long]

# 3. Run both expert models
fast_output = fast_expert(text)
long_output = long_expert(text)

# 4. Combine outputs using learned weights
final_output = weights[0] * fast_output + weights[1] * long_output

📦 Files in This Repository

best_gate.pt - Trained gating network (12KB)
final_gate.pt - Final checkpoint (12KB)
history.json - Training history (3.2KB)
README.md - This file

🔬 Technical Details

Load Balancing

The model was trained with a load balancing loss to encourage uniform expert usage:

target_distribution = [0.5, 0.5]  # 50% fast, 50% long
actual_distribution = weights.mean(dim=0)
load_balance_loss = 0.01 * MSE(actual_distribution, target_distribution)

Despite this, the model learned to route 90.5% to fast expert and 9.5% to long expert, matching the natural data distribution.

Routing Metrics

from moe_gate import compute_routing_metrics

weights = gate(features)
metrics = compute_routing_metrics(weights)

# Returns:
# {
#   'fast_expert_pct': 92.7,
#   'long_expert_pct': 7.3,
#   'avg_fast_weight': 0.924,
#   'avg_long_weight': 0.076,
#   'entropy': 0.031
# }

Low entropy (0.031) indicates confident routing decisions.

🆚 Heuristic vs Learned Routing

Metric	Heuristic	Learned MoE
Routing Accuracy	92.7%	92.7%
Model Size	0KB (rules only)	12KB
Latency	0.065ms	0.256ms
Training Required	No	Yes (10 epochs)
Explainability	High (if-else rules)	Medium (learned weights)
Adaptability	Manual updates	Data-driven

Recommendation: Use heuristic routing for simplicity and explainability. Use learned routing when you want to fine-tune on your specific data distribution.

📚 Citation

If you use this model, please cite:

@model{secretmask-gate,
  author = {Anders Andersson},
  title = {SecretMask MoE Gating Network},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/andrewandrewsen/secretmask-gate}
}

📄 License

MIT License - see LICENSE file.

Note: This model is trained to work with the SecretMask expert models, which use Apache 2.0 licensed base models (DistilBERT, Longformer). See the expert model repositories for full licensing details.

🔗 Related Resources

SecretMask MoE Repository: GitHub
Fast Expert Model: andrewandrewsen/distilbert-secret-masker
Long Expert Model: andrewandrewsen/longformer-secret-masker
Documentation: See repository for BENCHMARKS.md, USE_CASES.md, etc.

🤝 Contributing

Issues and pull requests welcome at GitHub.

Built with ❤️ for the open source community

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Test Accuracy on SecretMask v2
self-reported

0.927
Validation Accuracy on SecretMask v2
self-reported

1.000