zmsBERT / README.md

Upload README.md with huggingface_hub

406433e verified 4 days ago

9.33 kB

	---
	license: mit
	license_link: LICENSE
	library_name: custom
	pipeline_tag: text-classification
	tags:
	- dns
	- security
	- threat-detection
	- bert
	- domain-classification
	- zero-day
	- malware
	- phishing
	- dga
	- cybersecurity
	- network-security
	thumbnail: thumbnail.jpg
	---

	# zmsBERT - Zero-Millisecond Security

	<p align="center">
	<img src="doxxnet-logo.png" width="256" alt="doxx.net">
	<br><br>
	<strong>Real-time AI DNS threat classification by <a href="https://doxx.net">doxx.net</a></strong>
	</p>

	zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required.

	## Files

	The model requires the following files to run:

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `weights.bin` \| 423 MB \| Model weights (flat float32 binary) \|
	\| `config.json` \| 1 KB \| Model architecture config (layers, heads, hidden size, labels) \|
	\| `vocab.json` \| 567 KB \| BPE vocabulary (token to ID mapping) \|
	\| `merges.json` \| 377 KB \| BPE merge rules (31,173 pairs) \|
	\| `manifest.json` \| 28 KB \| Tensor layout manifest (name, shape, offset for each weight tensor) \|

	All files are included in this repository. Download them to a single directory and point ZMS at it with `-weights /path/to/dir`.

	Additionally, these optional data files improve classification accuracy:

	\| File \| Description \|
	\|------\|-------------\|
	\| `domain_categories.json` \| Parent domain trust categories (1.6M+ domains mapped to hosting types) \|
	\| `spam_tlds.txt` \| Risky TLD list (437 TLDs from hagezi spam-tlds) \|

	These are available in the [ZMS repo](https://github.com/doxxcorp/ZMS).

	## How It Works

	### The Problem

	Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the zero-day gap. zmsBERT closes this gap by classifying domains from their name alone.

	### The Insight

	Attackers face an unsolvable naming problem. Malicious domains must either:
	- Deceive humans (phishing): `secure-paypal-login.xyz`, `microsoft365-verify.club`
	- Be algorithmically generated (DGA/C2): `w10b8jin2uib3a6fl.shop`, `nexozerapexidexoviro.digital`
	- Mimic legitimate patterns (typosquatting): `staemcommuniity.com`, `m1cr0s0ft.com.ru`

	In all cases, the domain string carries signal that a language model can learn.

	### Three Context Signals

	Each domain gets three context tags prepended before classification:

	1. Hosting Provider (25 categories)

	The model knows who hosts the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure:

	```
	[CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro -> benign (Akamai, enterprise CDN)
	[CDN_FREE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro -> phishing (Cloudflare free tier)
	```

	Categories are split by actual abuse rates:
	- CDN_ENTERPRISE: Akamai International, Imperva, Edgecast (<3% abuse)
	- CDN_STANDARD: Fastly, Akamai Connected Cloud (~17% abuse)
	- CDN_FREE: Cloudflare (~27% abuse, free tier)
	- TECH_CURATED: Apple, Microsoft (<5% abuse)
	- TECH_CLOUD: Amazon AWS, Google Cloud (~17% abuse)
	- HOST_FREESITE: Wix, Squarespace, Vercel, Netlify (free tier, high abuse)
	- HOST_BUDGET: Hostinger, Namecheap, GoDaddy, OVH
	- Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more

	2. TLD Risk
	- `TLD_SAFE`: .com, .org, .net, etc.
	- `TLD_RISKY`: 437 spam TLDs (.xyz, .top, .club, .live, etc.)

	3. Geographic Risk

	Based on MaxMind GeoLite2 ASN lookup of the hosting IP:
	- `GEO_HOSTILE`: RU, CN, IR, KP, SY, BY
	- `GEO_SKETCHY`: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens)
	- `GEO_MODERATE`: BR, ID, TH, PK, BD, BG, MY, etc.
	- `GEO_NEUTRAL`: US, DE, NL, CA, SG, AU, SE, etc.
	- `GEO_TRUSTED`: JP, GB, FR, IE, KR, PL, FI, CH, etc.

	### Subdomain Isolation

	When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context:

	```
	claim-150pro.firebaseapp.com -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro
	helix-go-webview.uber.com -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview
	```

	This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms.

	## Categories

	\| ID \| Label \| Description \| Examples \|
	\|----\|-------\|-------------\|----------\|
	\| 0 \| benign \| Legitimate domains \| google.com, zoom.us \|
	\| 1 \| malware \| Malware C2, distribution \| urlhaus, malware_filter sources \|
	\| 2 \| phishing \| Phishing, credential theft, fake shops \| phishing_filter, hagezi fake \|
	\| 3 \| ads \| Advertising networks \| adguard, goodbyeads \|
	\| 4 \| mixed \| Multi-category blocklist domains \| stevenblack unified \|
	\| 5 \| trackers \| Tracking, native telemetry \| hagezi tif/pro/ultimate, native device telemetry \|
	\| 6 \| content \| Gambling, adult, social media, fake news \| Combined content categories \|
	\| 7 \| dga \| Domain generation algorithm \| hagezi dga7, campaign-deduplicated \|
	\| 8 \| nrd \| Newly registered domains (past 7 days) \| hagezi nrd7 \|
	\| 9 \| piracy \| Piracy-related domains \| hagezi anti.piracy \|
	\| 10 \| bypass \| DoH/VPN/proxy bypass \| hagezi doh-vpn-proxy-bypass \|

	## Architecture

	- Base model: [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (110M parameters)
	- Classifier head: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11)
	- Tokenizer: BPE with 31,173 merge rules + 36 special context tag tokens
	- Max sequence length: 128 tokens
	- Training data: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic
	- Oversampling: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x)
	- Weight format: Flat float32 binary (no PyTorch, no ONNX)

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Model load time \| 249ms \|
	\| First classification \| 30-50ms \|
	\| Cached classification \| <1 microsecond \|
	\| CPU throughput \| 30 domains/sec \|
	\| GPU throughput \| 4,585 domains/sec \|
	\| Model size \| 423 MB \|
	\| Binary size \| ~10 MB (static Go binary) \|

	## Usage

	This model is designed for use with the [ZMS inference engine](https://github.com/doxxcorp/ZMS) - a pure Go BERT implementation with no Python or ONNX dependencies:

	```bash
	# Download the model
	zms -update-model

	# Start the DNS classifier
	zms -bind-ipv4 127.0.0.1 -listen 54

	# Query via DNS TXT
	dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short
	# {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"}
	```

	## Zero-Day Catch Examples

	```
	99.8% malware narr9-vector.aurorift.in.net [FREE_HOSTING]
	99.7% phishing pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev [CLOUD_STORAGE]
	99.7% phishing reappeal-site-c9843io.vercel.app [FREE_HOSTING]
	99.6% malware solflare-blocklist.moonshot.workers.dev [FREE_HOSTING]
	99.6% phishing mintptojects211.vercel.app [FREE_HOSTING]
	99.5% malware svc2base.absolutecontinuity.in.net [FREE_HOSTING]
	99.4% phishing claim-nwomyboxpro.firebaseapp.com [FREE_HOSTING]
	99.4% phishing smartwebcontractdapps.netlify.app [FREE_HOSTING]
	99.3% phishing blocksdappsrectify.vercel.app [FREE_HOSTING]
	99.1% phishing trustwalletsupport.vercel.app [FREE_HOSTING]
	98.5% phishing claim-150pro.firebaseapp.com [FREE_HOSTING]
	```

	Correctly benign (no false positives on infrastructure):
	```
	99.6% benign google.com [TECH_PLATFORM]
	99.6% benign microsoft.com [TECH_PLATFORM]
	98.4% benign apple.com [TECH_PLATFORM]
	98.8% benign zoom.us [COMMS]
	97.4% benign statuspage.io [ENTERPRISE_APP]
	```

	## Training Data Sources

	- Blocklists: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists
	- Benign: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic

	## License

	This model is released under the [MIT License](LICENSE) with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing.

	This model is a derivative work based on [BERT](https://github.com/google-research/bert) (Apache 2.0, Google) and [DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) (Abdelkader Mekaoui).

	## Citation

	```
	zmsBERT: Zero-Millisecond Security DNS Classifier
	doxx.net, 2026
	https://huggingface.co/doxxnet/zmsBERT
	```

	---

	<p align="center">
	<a href="https://doxx.net">doxx.net</a> - Privacy without compromise
	</p>