| --- |
| license: mit |
| license_link: LICENSE |
| library_name: custom |
| pipeline_tag: text-classification |
| tags: |
| - dns |
| - security |
| - threat-detection |
| - bert |
| - domain-classification |
| - zero-day |
| - malware |
| - phishing |
| - dga |
| - cybersecurity |
| - network-security |
| thumbnail: thumbnail.jpg |
| --- |
| |
| # zmsBERT - Zero-Millisecond Security |
|
|
| <p align="center"> |
| <img src="doxxnet-logo.png" width="256" alt="doxx.net"> |
| <br><br> |
| <strong>Real-time AI DNS threat classification by <a href="https://doxx.net">doxx.net</a></strong> |
| </p> |
|
|
| zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required. |
|
|
| ## Files |
|
|
| The model requires the following files to run: |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `weights.bin` | 423 MB | Model weights (flat float32 binary) | |
| | `config.json` | 1 KB | Model architecture config (layers, heads, hidden size, labels) | |
| | `vocab.json` | 567 KB | BPE vocabulary (token to ID mapping) | |
| | `merges.json` | 377 KB | BPE merge rules (31,173 pairs) | |
| | `manifest.json` | 28 KB | Tensor layout manifest (name, shape, offset for each weight tensor) | |
|
|
| All files are included in this repository. Download them to a single directory and point ZMS at it with `-weights /path/to/dir`. |
|
|
| Additionally, these optional data files improve classification accuracy: |
|
|
| | File | Description | |
| |------|-------------| |
| | `domain_categories.json` | Parent domain trust categories (1.6M+ domains mapped to hosting types) | |
| | `spam_tlds.txt` | Risky TLD list (437 TLDs from hagezi spam-tlds) | |
|
|
| These are available in the [ZMS repo](https://github.com/doxxcorp/ZMS). |
|
|
| ## How It Works |
|
|
| ### The Problem |
|
|
| Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the **zero-day gap**. zmsBERT closes this gap by classifying domains from their name alone. |
|
|
| ### The Insight |
|
|
| Attackers face an unsolvable naming problem. Malicious domains must either: |
| - **Deceive humans** (phishing): `secure-paypal-login.xyz`, `microsoft365-verify.club` |
| - **Be algorithmically generated** (DGA/C2): `w10b8jin2uib3a6fl.shop`, `nexozerapexidexoviro.digital` |
| - **Mimic legitimate patterns** (typosquatting): `staemcommuniity.com`, `m1cr0s0ft.com.ru` |
|
|
| In all cases, the domain string carries signal that a language model can learn. |
|
|
| ### Three Context Signals |
|
|
| Each domain gets three context tags prepended before classification: |
|
|
| **1. Hosting Provider (25 categories)** |
|
|
| The model knows *who hosts* the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure: |
|
|
| ``` |
| [CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro -> benign (Akamai, enterprise CDN) |
| [CDN_FREE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro -> phishing (Cloudflare free tier) |
| ``` |
|
|
| Categories are split by actual abuse rates: |
| - **CDN_ENTERPRISE**: Akamai International, Imperva, Edgecast (<3% abuse) |
| - **CDN_STANDARD**: Fastly, Akamai Connected Cloud (~17% abuse) |
| - **CDN_FREE**: Cloudflare (~27% abuse, free tier) |
| - **TECH_CURATED**: Apple, Microsoft (<5% abuse) |
| - **TECH_CLOUD**: Amazon AWS, Google Cloud (~17% abuse) |
| - **HOST_FREESITE**: Wix, Squarespace, Vercel, Netlify (free tier, high abuse) |
| - **HOST_BUDGET**: Hostinger, Namecheap, GoDaddy, OVH |
| - Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more |
| |
| **2. TLD Risk** |
| - `TLD_SAFE`: .com, .org, .net, etc. |
| - `TLD_RISKY`: 437 spam TLDs (.xyz, .top, .club, .live, etc.) |
|
|
| **3. Geographic Risk** |
|
|
| Based on MaxMind GeoLite2 ASN lookup of the hosting IP: |
| - `GEO_HOSTILE`: RU, CN, IR, KP, SY, BY |
| - `GEO_SKETCHY`: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens) |
| - `GEO_MODERATE`: BR, ID, TH, PK, BD, BG, MY, etc. |
| - `GEO_NEUTRAL`: US, DE, NL, CA, SG, AU, SE, etc. |
| - `GEO_TRUSTED`: JP, GB, FR, IE, KR, PL, FI, CH, etc. |
|
|
| ### Subdomain Isolation |
|
|
| When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context: |
|
|
| ``` |
| claim-150pro.firebaseapp.com -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro |
| helix-go-webview.uber.com -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview |
| ``` |
|
|
| This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms. |
|
|
| ## Categories |
|
|
| | ID | Label | Description | Examples | |
| |----|-------|-------------|----------| |
| | 0 | benign | Legitimate domains | google.com, zoom.us | |
| | 1 | malware | Malware C2, distribution | urlhaus, malware_filter sources | |
| | 2 | phishing | Phishing, credential theft, fake shops | phishing_filter, hagezi fake | |
| | 3 | ads | Advertising networks | adguard, goodbyeads | |
| | 4 | mixed | Multi-category blocklist domains | stevenblack unified | |
| | 5 | trackers | Tracking, native telemetry | hagezi tif/pro/ultimate, native device telemetry | |
| | 6 | content | Gambling, adult, social media, fake news | Combined content categories | |
| | 7 | dga | Domain generation algorithm | hagezi dga7, campaign-deduplicated | |
| | 8 | nrd | Newly registered domains (past 7 days) | hagezi nrd7 | |
| | 9 | piracy | Piracy-related domains | hagezi anti.piracy | |
| | 10 | bypass | DoH/VPN/proxy bypass | hagezi doh-vpn-proxy-bypass | |
|
|
| ## Architecture |
|
|
| - **Base model**: [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (110M parameters) |
| - **Classifier head**: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11) |
| - **Tokenizer**: BPE with 31,173 merge rules + 36 special context tag tokens |
| - **Max sequence length**: 128 tokens |
| - **Training data**: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic |
| - **Oversampling**: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x) |
| - **Weight format**: Flat float32 binary (no PyTorch, no ONNX) |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Model load time | 249ms | |
| | First classification | 30-50ms | |
| | Cached classification | <1 microsecond | |
| | CPU throughput | 30 domains/sec | |
| | GPU throughput | 4,585 domains/sec | |
| | Model size | 423 MB | |
| | Binary size | ~10 MB (static Go binary) | |
|
|
| ## Usage |
|
|
| This model is designed for use with the [ZMS inference engine](https://github.com/doxxcorp/ZMS) - a pure Go BERT implementation with no Python or ONNX dependencies: |
|
|
| ```bash |
| # Download the model |
| zms -update-model |
| |
| # Start the DNS classifier |
| zms -bind-ipv4 127.0.0.1 -listen 54 |
| |
| # Query via DNS TXT |
| dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short |
| # {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"} |
| ``` |
|
|
| ## Zero-Day Catch Examples |
|
|
| ``` |
| 99.8% malware narr9-vector.aurorift.in.net [FREE_HOSTING] |
| 99.7% phishing pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev [CLOUD_STORAGE] |
| 99.7% phishing reappeal-site-c9843io.vercel.app [FREE_HOSTING] |
| 99.6% malware solflare-blocklist.moonshot.workers.dev [FREE_HOSTING] |
| 99.6% phishing mintptojects211.vercel.app [FREE_HOSTING] |
| 99.5% malware svc2base.absolutecontinuity.in.net [FREE_HOSTING] |
| 99.4% phishing claim-nwomyboxpro.firebaseapp.com [FREE_HOSTING] |
| 99.4% phishing smartwebcontractdapps.netlify.app [FREE_HOSTING] |
| 99.3% phishing blocksdappsrectify.vercel.app [FREE_HOSTING] |
| 99.1% phishing trustwalletsupport.vercel.app [FREE_HOSTING] |
| 98.5% phishing claim-150pro.firebaseapp.com [FREE_HOSTING] |
| ``` |
|
|
| Correctly benign (no false positives on infrastructure): |
| ``` |
| 99.6% benign google.com [TECH_PLATFORM] |
| 99.6% benign microsoft.com [TECH_PLATFORM] |
| 98.4% benign apple.com [TECH_PLATFORM] |
| 98.8% benign zoom.us [COMMS] |
| 97.4% benign statuspage.io [ENTERPRISE_APP] |
| ``` |
|
|
| ## Training Data Sources |
|
|
| - **Blocklists**: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists |
| - **Benign**: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic |
|
|
| ## License |
|
|
| This model is released under the [MIT License](LICENSE) with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing. |
|
|
| This model is a derivative work based on [BERT](https://github.com/google-research/bert) (Apache 2.0, Google) and [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (Abdelkader Mekaoui). |
|
|
| ## Citation |
|
|
| ``` |
| zmsBERT: Zero-Millisecond Security DNS Classifier |
| doxx.net, 2026 |
| https://huggingface.co/doxxnet/zmsBERT |
| ``` |
|
|
| --- |
|
|
| <p align="center"> |
| <a href="https://doxx.net">doxx.net</a> - Privacy without compromise |
| </p> |
|
|