---
license: mit
license_link: LICENSE
library_name: custom
pipeline_tag: text-classification
tags:
- dns
- security
- threat-detection
- bert
- domain-classification
- zero-day
- malware
- phishing
- dga
- cybersecurity
- network-security
thumbnail: thumbnail.jpg
---
# zmsBERT - Zero-Millisecond Security
Real-time AI DNS threat classification by doxx.net
zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required.
## Files
The model requires the following files to run:
| File | Size | Description |
|------|------|-------------|
| `weights.bin` | 423 MB | Model weights (flat float32 binary) |
| `config.json` | 1 KB | Model architecture config (layers, heads, hidden size, labels) |
| `vocab.json` | 567 KB | BPE vocabulary (token to ID mapping) |
| `merges.json` | 377 KB | BPE merge rules (31,173 pairs) |
| `manifest.json` | 28 KB | Tensor layout manifest (name, shape, offset for each weight tensor) |
All files are included in this repository. Download them to a single directory and point ZMS at it with `-weights /path/to/dir`.
Additionally, these optional data files improve classification accuracy:
| File | Description |
|------|-------------|
| `domain_categories.json` | Parent domain trust categories (1.6M+ domains mapped to hosting types) |
| `spam_tlds.txt` | Risky TLD list (437 TLDs from hagezi spam-tlds) |
These are available in the [ZMS repo](https://github.com/doxxcorp/ZMS).
## How It Works
### The Problem
Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the **zero-day gap**. zmsBERT closes this gap by classifying domains from their name alone.
### The Insight
Attackers face an unsolvable naming problem. Malicious domains must either:
- **Deceive humans** (phishing): `secure-paypal-login.xyz`, `microsoft365-verify.club`
- **Be algorithmically generated** (DGA/C2): `w10b8jin2uib3a6fl.shop`, `nexozerapexidexoviro.digital`
- **Mimic legitimate patterns** (typosquatting): `staemcommuniity.com`, `m1cr0s0ft.com.ru`
In all cases, the domain string carries signal that a language model can learn.
### Three Context Signals
Each domain gets three context tags prepended before classification:
**1. Hosting Provider (25 categories)**
The model knows *who hosts* the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure:
```
[CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro -> benign (Akamai, enterprise CDN)
[CDN_FREE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro -> phishing (Cloudflare free tier)
```
Categories are split by actual abuse rates:
- **CDN_ENTERPRISE**: Akamai International, Imperva, Edgecast (<3% abuse)
- **CDN_STANDARD**: Fastly, Akamai Connected Cloud (~17% abuse)
- **CDN_FREE**: Cloudflare (~27% abuse, free tier)
- **TECH_CURATED**: Apple, Microsoft (<5% abuse)
- **TECH_CLOUD**: Amazon AWS, Google Cloud (~17% abuse)
- **HOST_FREESITE**: Wix, Squarespace, Vercel, Netlify (free tier, high abuse)
- **HOST_BUDGET**: Hostinger, Namecheap, GoDaddy, OVH
- Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more
**2. TLD Risk**
- `TLD_SAFE`: .com, .org, .net, etc.
- `TLD_RISKY`: 437 spam TLDs (.xyz, .top, .club, .live, etc.)
**3. Geographic Risk**
Based on MaxMind GeoLite2 ASN lookup of the hosting IP:
- `GEO_HOSTILE`: RU, CN, IR, KP, SY, BY
- `GEO_SKETCHY`: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens)
- `GEO_MODERATE`: BR, ID, TH, PK, BD, BG, MY, etc.
- `GEO_NEUTRAL`: US, DE, NL, CA, SG, AU, SE, etc.
- `GEO_TRUSTED`: JP, GB, FR, IE, KR, PL, FI, CH, etc.
### Subdomain Isolation
When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context:
```
claim-150pro.firebaseapp.com -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro
helix-go-webview.uber.com -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview
```
This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms.
## Categories
| ID | Label | Description | Examples |
|----|-------|-------------|----------|
| 0 | benign | Legitimate domains | google.com, zoom.us |
| 1 | malware | Malware C2, distribution | urlhaus, malware_filter sources |
| 2 | phishing | Phishing, credential theft, fake shops | phishing_filter, hagezi fake |
| 3 | ads | Advertising networks | adguard, goodbyeads |
| 4 | mixed | Multi-category blocklist domains | stevenblack unified |
| 5 | trackers | Tracking, native telemetry | hagezi tif/pro/ultimate, native device telemetry |
| 6 | content | Gambling, adult, social media, fake news | Combined content categories |
| 7 | dga | Domain generation algorithm | hagezi dga7, campaign-deduplicated |
| 8 | nrd | Newly registered domains (past 7 days) | hagezi nrd7 |
| 9 | piracy | Piracy-related domains | hagezi anti.piracy |
| 10 | bypass | DoH/VPN/proxy bypass | hagezi doh-vpn-proxy-bypass |
## Architecture
- **Base model**: [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (110M parameters)
- **Classifier head**: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11)
- **Tokenizer**: BPE with 31,173 merge rules + 36 special context tag tokens
- **Max sequence length**: 128 tokens
- **Training data**: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic
- **Oversampling**: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x)
- **Weight format**: Flat float32 binary (no PyTorch, no ONNX)
## Performance
| Metric | Value |
|--------|-------|
| Model load time | 249ms |
| First classification | 30-50ms |
| Cached classification | <1 microsecond |
| CPU throughput | 30 domains/sec |
| GPU throughput | 4,585 domains/sec |
| Model size | 423 MB |
| Binary size | ~10 MB (static Go binary) |
## Usage
This model is designed for use with the [ZMS inference engine](https://github.com/doxxcorp/ZMS) - a pure Go BERT implementation with no Python or ONNX dependencies:
```bash
# Download the model
zms -update-model
# Start the DNS classifier
zms -bind-ipv4 127.0.0.1 -listen 54
# Query via DNS TXT
dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short
# {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"}
```
## Zero-Day Catch Examples
```
99.8% malware narr9-vector.aurorift.in.net [FREE_HOSTING]
99.7% phishing pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev [CLOUD_STORAGE]
99.7% phishing reappeal-site-c9843io.vercel.app [FREE_HOSTING]
99.6% malware solflare-blocklist.moonshot.workers.dev [FREE_HOSTING]
99.6% phishing mintptojects211.vercel.app [FREE_HOSTING]
99.5% malware svc2base.absolutecontinuity.in.net [FREE_HOSTING]
99.4% phishing claim-nwomyboxpro.firebaseapp.com [FREE_HOSTING]
99.4% phishing smartwebcontractdapps.netlify.app [FREE_HOSTING]
99.3% phishing blocksdappsrectify.vercel.app [FREE_HOSTING]
99.1% phishing trustwalletsupport.vercel.app [FREE_HOSTING]
98.5% phishing claim-150pro.firebaseapp.com [FREE_HOSTING]
```
Correctly benign (no false positives on infrastructure):
```
99.6% benign google.com [TECH_PLATFORM]
99.6% benign microsoft.com [TECH_PLATFORM]
98.4% benign apple.com [TECH_PLATFORM]
98.8% benign zoom.us [COMMS]
97.4% benign statuspage.io [ENTERPRISE_APP]
```
## Training Data Sources
- **Blocklists**: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists
- **Benign**: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic
## License
This model is released under the [MIT License](LICENSE) with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing.
This model is a derivative work based on [BERT](https://github.com/google-research/bert) (Apache 2.0, Google) and [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (Abdelkader Mekaoui).
## Citation
```
zmsBERT: Zero-Millisecond Security DNS Classifier
doxx.net, 2026
https://huggingface.co/doxxnet/zmsBERT
```
---
doxx.net - Privacy without compromise