--- license: mit license_link: LICENSE library_name: custom pipeline_tag: text-classification tags: - dns - security - threat-detection - bert - domain-classification - zero-day - malware - phishing - dga - cybersecurity - network-security thumbnail: thumbnail.jpg --- # zmsBERT - Zero-Millisecond Security

Real-time AI DNS threat classification by doxx.net

zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required. ## Files The model requires the following files to run: | File | Size | Description | |------|------|-------------| | `weights.bin` | 423 MB | Model weights (flat float32 binary) | | `config.json` | 1 KB | Model architecture config (layers, heads, hidden size, labels) | | `vocab.json` | 567 KB | BPE vocabulary (token to ID mapping) | | `merges.json` | 377 KB | BPE merge rules (31,173 pairs) | | `manifest.json` | 28 KB | Tensor layout manifest (name, shape, offset for each weight tensor) | All files are included in this repository. Download them to a single directory and point ZMS at it with `-weights /path/to/dir`. Additionally, these optional data files improve classification accuracy: | File | Description | |------|-------------| | `domain_categories.json` | Parent domain trust categories (1.6M+ domains mapped to hosting types) | | `spam_tlds.txt` | Risky TLD list (437 TLDs from hagezi spam-tlds) | These are available in the [ZMS repo](https://github.com/doxxcorp/ZMS). ## How It Works ### The Problem Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the **zero-day gap**. zmsBERT closes this gap by classifying domains from their name alone. ### The Insight Attackers face an unsolvable naming problem. Malicious domains must either: - **Deceive humans** (phishing): `secure-paypal-login.xyz`, `microsoft365-verify.club` - **Be algorithmically generated** (DGA/C2): `w10b8jin2uib3a6fl.shop`, `nexozerapexidexoviro.digital` - **Mimic legitimate patterns** (typosquatting): `staemcommuniity.com`, `m1cr0s0ft.com.ru` In all cases, the domain string carries signal that a language model can learn. ### Three Context Signals Each domain gets three context tags prepended before classification: **1. Hosting Provider (25 categories)** The model knows *who hosts* the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure: ``` [CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro -> benign (Akamai, enterprise CDN) [CDN_FREE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro -> phishing (Cloudflare free tier) ``` Categories are split by actual abuse rates: - **CDN_ENTERPRISE**: Akamai International, Imperva, Edgecast (<3% abuse) - **CDN_STANDARD**: Fastly, Akamai Connected Cloud (~17% abuse) - **CDN_FREE**: Cloudflare (~27% abuse, free tier) - **TECH_CURATED**: Apple, Microsoft (<5% abuse) - **TECH_CLOUD**: Amazon AWS, Google Cloud (~17% abuse) - **HOST_FREESITE**: Wix, Squarespace, Vercel, Netlify (free tier, high abuse) - **HOST_BUDGET**: Hostinger, Namecheap, GoDaddy, OVH - Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more **2. TLD Risk** - `TLD_SAFE`: .com, .org, .net, etc. - `TLD_RISKY`: 437 spam TLDs (.xyz, .top, .club, .live, etc.) **3. Geographic Risk** Based on MaxMind GeoLite2 ASN lookup of the hosting IP: - `GEO_HOSTILE`: RU, CN, IR, KP, SY, BY - `GEO_SKETCHY`: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens) - `GEO_MODERATE`: BR, ID, TH, PK, BD, BG, MY, etc. - `GEO_NEUTRAL`: US, DE, NL, CA, SG, AU, SE, etc. - `GEO_TRUSTED`: JP, GB, FR, IE, KR, PL, FI, CH, etc. ### Subdomain Isolation When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context: ``` claim-150pro.firebaseapp.com -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro helix-go-webview.uber.com -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview ``` This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms. ## Categories | ID | Label | Description | Examples | |----|-------|-------------|----------| | 0 | benign | Legitimate domains | google.com, zoom.us | | 1 | malware | Malware C2, distribution | urlhaus, malware_filter sources | | 2 | phishing | Phishing, credential theft, fake shops | phishing_filter, hagezi fake | | 3 | ads | Advertising networks | adguard, goodbyeads | | 4 | mixed | Multi-category blocklist domains | stevenblack unified | | 5 | trackers | Tracking, native telemetry | hagezi tif/pro/ultimate, native device telemetry | | 6 | content | Gambling, adult, social media, fake news | Combined content categories | | 7 | dga | Domain generation algorithm | hagezi dga7, campaign-deduplicated | | 8 | nrd | Newly registered domains (past 7 days) | hagezi nrd7 | | 9 | piracy | Piracy-related domains | hagezi anti.piracy | | 10 | bypass | DoH/VPN/proxy bypass | hagezi doh-vpn-proxy-bypass | ## Architecture - **Base model**: [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (110M parameters) - **Classifier head**: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11) - **Tokenizer**: BPE with 31,173 merge rules + 36 special context tag tokens - **Max sequence length**: 128 tokens - **Training data**: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic - **Oversampling**: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x) - **Weight format**: Flat float32 binary (no PyTorch, no ONNX) ## Performance | Metric | Value | |--------|-------| | Model load time | 249ms | | First classification | 30-50ms | | Cached classification | <1 microsecond | | CPU throughput | 30 domains/sec | | GPU throughput | 4,585 domains/sec | | Model size | 423 MB | | Binary size | ~10 MB (static Go binary) | ## Usage This model is designed for use with the [ZMS inference engine](https://github.com/doxxcorp/ZMS) - a pure Go BERT implementation with no Python or ONNX dependencies: ```bash # Download the model zms -update-model # Start the DNS classifier zms -bind-ipv4 127.0.0.1 -listen 54 # Query via DNS TXT dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short # {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"} ``` ## Zero-Day Catch Examples ``` 99.8% malware narr9-vector.aurorift.in.net [FREE_HOSTING] 99.7% phishing pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev [CLOUD_STORAGE] 99.7% phishing reappeal-site-c9843io.vercel.app [FREE_HOSTING] 99.6% malware solflare-blocklist.moonshot.workers.dev [FREE_HOSTING] 99.6% phishing mintptojects211.vercel.app [FREE_HOSTING] 99.5% malware svc2base.absolutecontinuity.in.net [FREE_HOSTING] 99.4% phishing claim-nwomyboxpro.firebaseapp.com [FREE_HOSTING] 99.4% phishing smartwebcontractdapps.netlify.app [FREE_HOSTING] 99.3% phishing blocksdappsrectify.vercel.app [FREE_HOSTING] 99.1% phishing trustwalletsupport.vercel.app [FREE_HOSTING] 98.5% phishing claim-150pro.firebaseapp.com [FREE_HOSTING] ``` Correctly benign (no false positives on infrastructure): ``` 99.6% benign google.com [TECH_PLATFORM] 99.6% benign microsoft.com [TECH_PLATFORM] 98.4% benign apple.com [TECH_PLATFORM] 98.8% benign zoom.us [COMMS] 97.4% benign statuspage.io [ENTERPRISE_APP] ``` ## Training Data Sources - **Blocklists**: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists - **Benign**: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic ## License This model is released under the [MIT License](LICENSE) with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing. This model is a derivative work based on [BERT](https://github.com/google-research/bert) (Apache 2.0, Google) and [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (Abdelkader Mekaoui). ## Citation ``` zmsBERT: Zero-Millisecond Security DNS Classifier doxx.net, 2026 https://huggingface.co/doxxnet/zmsBERT ``` ---

doxx.net - Privacy without compromise