SecretMask MoE Gating Network

License: MIT Model Size

Lightweight learned gating network for SecretMask Mixture-of-Experts routing.

This repository contains a trained 12KB neural network that learns optimal routing between two secret detection expert models. Use this for true MoE inference with weighted expert combination.


πŸ“‹ Overview

The gating network is a tiny 3-layer MLP (3,042 parameters) that:

  1. Takes 10 features extracted from text
  2. Outputs routing weights [w_fast, w_long] (sum to 1.0)
  3. Enables weighted combination of expert model outputs

Training Results:

  • βœ… 100% validation accuracy (200 examples)
  • βœ… 92.7% test accuracy (600 examples)
  • βœ… Only 0.19ms inference overhead
  • βœ… Matches heuristic routing performance

Note: This gating network is optional and experimental. Heuristic (rule-based) routing achieves identical results (92.7% accuracy) without requiring this model. The recommended production configuration uses Fast Expert + Filters without learned routing or the Long Expert. This gate is primarily for learning/experimentation with MoE architectures. See Configuration Guide for details.


πŸš€ Quick Start

Installation

pip install torch transformers huggingface-hub

Download and Use

from huggingface_hub import hf_hub_download
from moe_gate import GatingNetwork, extract_features_tensor

# Download gating network
gate_path = hf_hub_download("andrewandrewsen/secretmask-gate", "best_gate.pt")

# Load model
gate = GatingNetwork.load(gate_path)
gate.eval()

# Extract features from text
text = "AWS key: AKIAIOSFODNN7EXAMPLE"
features = extract_features_tensor(text)

# Get routing weights
import torch
with torch.no_grad():
    weights = gate(features.unsqueeze(0))

print(f"Fast expert weight: {weights[0][0]:.3f}")
print(f"Long expert weight: {weights[0][1]:.3f}")
# Output: Fast expert weight: 0.950, Long expert weight: 0.050

Integration with SecretMask

# Clone SecretMask repository
git clone https://github.com/andrewandrewsen/secmask.git
cd secmask

# Run inference with learned MoE routing
python infer_moe.py \
    --text "My AWS key is AKIAIOSFODNN7EXAMPLE" \
    --routing-mode learned \
    --fast-model andrewandrewsen/distilbert-secret-masker \
    --long-model andrewandrewsen/longformer-secret-masker \
    --gate-model andrewandrewsen/secretmask-gate \
    --tau 0.80

πŸ—οΈ Model Architecture

Input: [10 features]
    ↓
Linear(10 β†’ 64) + LayerNorm + ReLU + Dropout(0.1)
    ↓
Linear(64 β†’ 32) + LayerNorm + ReLU + Dropout(0.1)
    ↓
Linear(32 β†’ 2) + Softmax
    ↓
Output: [w_fast, w_long]  (sum = 1.0)

Total Parameters: 3,042
Model Size: 12KB (float32)
Inference Time: ~0.19ms on CPU


πŸ“Š Input Features (10D)

The gating network takes a normalized 10-dimensional feature vector:

Index Feature Description Normalization
0 token_count Number of tokens / 1000
1 entropy Shannon entropy / 6
2 has_pem Has PEM block (binary) 0 or 1
3 has_k8s Has K8s secret (binary) 0 or 1
4 akia_count AWS pattern count / 5
5 github_count GitHub token count / 5
6 jwt_count JWT token count / 5
7 base64_count Base64 pattern count / 50
8 line_count Number of lines / 100
9 avg_line_length Avg chars per line / 100

πŸ“ˆ Training Details

Dataset:

  • Training: 6,000 examples
  • Validation: 200 examples
  • Test: 600 examples

Configuration:

  • Optimizer: AdamW (lr=0.001, weight_decay=0.01)
  • Scheduler: Cosine annealing
  • Batch size: 32
  • Epochs: 10
  • Device: Apple M-series (MPS)

Training Results:

Epoch Train Loss Train Acc Val Loss Val Acc
1 0.0808 97.6% 0.0051 100%
2 0.0036 100% 0.0010 100%
10 0.0005 100% 0.0001 100%

Test Performance:

  • Routing accuracy: 92.7%
  • Fast expert: 92.7% of examples
  • Long expert: 7.3% of examples
  • Matches heuristic routing exactly

πŸ”§ Usage with Expert Models

This gating network coordinates two expert models:

Expert Model Size Max Tokens Use Case
Fast andrewandrewsen/distilbert-secret-masker 265MB 512 Short texts, code snippets
Long andrewandrewsen/longformer-secret-masker 592MB 2048 Long documents, config files

How It Works

# 1. Extract features
features = extract_features_tensor(text)

# 2. Get routing weights from gating network
weights = gate(features)  # [w_fast, w_long]

# 3. Run both expert models
fast_output = fast_expert(text)
long_output = long_expert(text)

# 4. Combine outputs using learned weights
final_output = weights[0] * fast_output + weights[1] * long_output

πŸ“¦ Files in This Repository

  • best_gate.pt - Trained gating network (12KB)
  • final_gate.pt - Final checkpoint (12KB)
  • history.json - Training history (3.2KB)
  • README.md - This file

πŸ”¬ Technical Details

Load Balancing

The model was trained with a load balancing loss to encourage uniform expert usage:

target_distribution = [0.5, 0.5]  # 50% fast, 50% long
actual_distribution = weights.mean(dim=0)
load_balance_loss = 0.01 * MSE(actual_distribution, target_distribution)

Despite this, the model learned to route 90.5% to fast expert and 9.5% to long expert, matching the natural data distribution.

Routing Metrics

from moe_gate import compute_routing_metrics

weights = gate(features)
metrics = compute_routing_metrics(weights)

# Returns:
# {
#   'fast_expert_pct': 92.7,
#   'long_expert_pct': 7.3,
#   'avg_fast_weight': 0.924,
#   'avg_long_weight': 0.076,
#   'entropy': 0.031
# }

Low entropy (0.031) indicates confident routing decisions.


πŸ†š Heuristic vs Learned Routing

Metric Heuristic Learned MoE
Routing Accuracy 92.7% 92.7%
Model Size 0KB (rules only) 12KB
Latency 0.065ms 0.256ms
Training Required No Yes (10 epochs)
Explainability High (if-else rules) Medium (learned weights)
Adaptability Manual updates Data-driven

Recommendation: Use heuristic routing for simplicity and explainability. Use learned routing when you want to fine-tune on your specific data distribution.


πŸ“š Citation

If you use this model, please cite:

@model{secretmask-gate,
  author = {Anders Andersson},
  title = {SecretMask MoE Gating Network},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/andrewandrewsen/secretmask-gate}
}

πŸ“„ License

MIT License - see LICENSE file.

Note: This model is trained to work with the SecretMask expert models, which use Apache 2.0 licensed base models (DistilBERT, Longformer). See the expert model repositories for full licensing details.


πŸ”— Related Resources


🀝 Contributing

Issues and pull requests welcome at GitHub.


Built with ❀️ for the open source community

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results