tld-embedding / README.md
gpriday's picture
Upload folder using huggingface_hub
0cc27d5 verified
---
library_name: transformers
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- tld
- embeddings
- domains
- multi-task-learning
- bert
pipeline_tag: feature-extraction
widget:
- text: "com"
- text: "io"
- text: "ai"
- text: "co.za"
model-index:
- name: TLD Embedding Model
results:
- task:
type: feature-extraction
name: TLD Embedding
metrics:
- type: spearman_correlation
value: 0.8976
name: Average Spearman Correlation
---
# TLD Embedding Model
A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional **0.8976 average Spearman correlation** across 63 features during training.
## Model Overview
This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks:
1. **Research Metrics** (18 features): Brand perception, trust scores, memorability, premium brand indices
2. **Technical Metrics** (5 features): Registration statistics, domain rankings, usage patterns
3. **Economic Indicators** (21 features): Country-level GDP sector breakdowns mapped to TLD registries
4. **Price Predictions** (18 features): Industry-specific market value scores from domain sales data
The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD.
## Training Performance
**Final Training Results (Epoch 25/25):**
- **Overall Average Score**: 0.8976 (89.76% Spearman correlation)
- **Training Loss**: 0.0034
**Task-Specific Performance:**
- **Research Task**: 0.80+ correlation on trust, adoption, and brand metrics
- **Technical Task**: 0.93-0.99 correlation on registration and ranking metrics
- **Economic Task**: 0.89-0.96 correlation on GDP sector predictions
- **Price Task**: 0.90-0.99 correlation on industry-specific price scores
**Best Individual Metrics:**
- `overall_score`: 0.990 Spearman correlation
- `global_top_1m_share`: 0.993 Spearman correlation
- `score_food`: 0.973 Spearman correlation
- `three_letter_registration_percent`: 0.969 Spearman correlation
## Architecture
- **Base Model**: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT)
- **Embedding Dimension**: 96 (optimized for data size)
- **Max Sequence Length**: 8 tokens (optimized for TLDs)
- **MLP Hidden Size**: 192 with 15% dropout
- **Task Weighting**: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40)
## Training Data Sources
### Research Data (`tld_research_data.jsonl`)
- **Coverage**: 150 TLDs with research metrics
- **Features**: Trust scores, brand associations, memorability, adoption rates
- **Source**: Survey data, brand perception studies, market research
### Technical Data (`tld_technical_data.jsonl`)
- **Coverage**: 716 TLDs with technical metrics
- **Features**: Registration patterns, domain rankings (Majestic), sales volumes
- **Source**: Registry statistics, web crawl data, domain marketplaces
### Economic Data (`country_economic_data.jsonl`)
- **Coverage**: 126 TLDs mapped to country economies
- **Features**: GDP breakdowns by 21 industry sectors
- **Source**: World Bank, IMF economic data mapped to ccTLD registries
### Price Data (`tld_price_scores_by_industry_2025.csv`)
- **Coverage**: 722 TLDs with price predictions
- **Features**: 18 industry-specific price scores plus overall score
- **Source**: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`)
- **Industries**: Finance, healthcare, technology, automotive, food, gaming, etc.
## Installation & Usage
### Loading the Model
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Load model and tokenizer
model_name = "humbleworth/tld-embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
```
### Getting TLD Embeddings
```python
def get_tld_embedding(tld, model, tokenizer):
"""Get 96-dimensional embedding for a single TLD"""
# Use special token format if available, otherwise prefix with dot
tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}"
inputs = tokenizer(
tld_text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=8
)
with torch.no_grad():
outputs = model.encoder(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0, :]
tld_embedding = model.projection(cls_embedding)
return tld_embedding.squeeze().numpy()
# Example
com_embedding = get_tld_embedding("com", model, tokenizer)
print(f"Embedding shape: {com_embedding.shape}") # (96,)
```
### Batch Processing
```python
def get_tld_embeddings_batch(tlds, model, tokenizer):
"""Get embeddings for multiple TLDs efficiently"""
# Use special token format if available, otherwise prefix with dot
tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds]
inputs = tokenizer(
tld_texts,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=8
)
with torch.no_grad():
outputs = model.encoder(**inputs)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
tld_embeddings = model.projection(cls_embeddings)
return tld_embeddings.numpy()
# Process multiple TLDs
tlds = ["com", "io", "ai", "co.za", "tech"]
embeddings = get_tld_embeddings_batch(tlds, model, tokenizer)
print(f"Embeddings shape: {embeddings.shape}") # (5, 96)
```
## Key Features
### Multi-Task Learning Benefits
- **Robust Representations**: Joint learning across diverse tasks creates more stable embeddings
- **Transfer Learning**: Knowledge from technical metrics improves price prediction and vice versa
- **Percentile Normalization**: All features converted to percentiles for balanced learning
### Industry-Specific Intelligence
- **18 Industry Scores**: Specialized predictions for finance, technology, healthcare, etc.
- **Economic Mapping**: Country-level economic data enhances ccTLD understanding
- **Market Dynamics**: Real domain sales data captures market preferences
### Technical Optimizations
- **MPS Support**: Optimized for Apple Silicon (M1/M2) training
- **Gradient Accumulation**: Stable training with effective batch size of 64
- **Early Stopping**: Prevents overfitting with patience-based stopping
- **Task Weighting**: Balanced learning prioritizing price prediction (40% weight)
## Use Cases
1. **Domain Valuation**: Use embeddings as features for ML-based domain appraisal
2. **TLD Recommendation**: Find similar TLDs for branding or investment decisions
3. **Market Analysis**: Cluster TLDs by business characteristics or market positioning
4. **Portfolio Optimization**: Analyze TLD portfolios using semantic similarity
5. **Cross-Market Analysis**: Compare TLD performance across different industries
## Training Configuration
**Optimal Hyperparameters (Based on Data Analysis):**
- Epochs: 25 (early stopping at patience=5)
- Batch Size: 16 (effective 64 with accumulation)
- Learning Rate: 5e-4 with warmup
- Warmup Steps: 200
- Gradient Accumulation: 4 steps
- Dropout: 15%
**Training Command:**
```bash
python train_dual_task_embeddings.py \
--epochs 25 \
--batch-size 16 \
--learning-rate 5e-4 \
--warmup-steps 200 \
--output-dir models/tld_embedding_model
```
## Model Files
```
tld_embedding_model/
β”œβ”€β”€ config.json # Model configuration
β”œβ”€β”€ pytorch_model.bin # Model weights
β”œβ”€β”€ tokenizer.json # Tokenizer
β”œβ”€β”€ tokenizer_config.json # Tokenizer config
β”œβ”€β”€ vocab.txt # Vocabulary
β”œβ”€β”€ special_tokens_map.json # Special tokens
β”œβ”€β”€ training_metrics.pt # Training metrics
β”œβ”€β”€ tld_embeddings.json # Pre-computed embeddings
└── README.md # This file
```
## Citation
If you use this model in your research, please cite:
```bibtex
@software{tld_embedding_2025,
title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions},
author = {HumbleWorth},
year = {2025},
note = {Achieved 0.8976 average Spearman correlation across 63 features},
url = {https://huggingface.co/humbleworth/tld-embedding}
}
```
## License
This model is released under the Apache 2.0 License.