File size: 8,607 Bytes
0cc27d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
library_name: transformers
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- tld
- embeddings
- domains
- multi-task-learning
- bert
pipeline_tag: feature-extraction
widget:
- text: "com"
- text: "io"  
- text: "ai"
- text: "co.za"
model-index:
- name: TLD Embedding Model
  results:
  - task:
      type: feature-extraction
      name: TLD Embedding
    metrics:
    - type: spearman_correlation
      value: 0.8976
      name: Average Spearman Correlation
---

# TLD Embedding Model

A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional **0.8976 average Spearman correlation** across 63 features during training.

## Model Overview

This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks:

1. **Research Metrics** (18 features): Brand perception, trust scores, memorability, premium brand indices
2. **Technical Metrics** (5 features): Registration statistics, domain rankings, usage patterns  
3. **Economic Indicators** (21 features): Country-level GDP sector breakdowns mapped to TLD registries
4. **Price Predictions** (18 features): Industry-specific market value scores from domain sales data

The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD.

## Training Performance

**Final Training Results (Epoch 25/25):**
- **Overall Average Score**: 0.8976 (89.76% Spearman correlation)
- **Training Loss**: 0.0034

**Task-Specific Performance:**
- **Research Task**: 0.80+ correlation on trust, adoption, and brand metrics
- **Technical Task**: 0.93-0.99 correlation on registration and ranking metrics  
- **Economic Task**: 0.89-0.96 correlation on GDP sector predictions
- **Price Task**: 0.90-0.99 correlation on industry-specific price scores

**Best Individual Metrics:**
- `overall_score`: 0.990 Spearman correlation
- `global_top_1m_share`: 0.993 Spearman correlation
- `score_food`: 0.973 Spearman correlation
- `three_letter_registration_percent`: 0.969 Spearman correlation

## Architecture

- **Base Model**: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT)
- **Embedding Dimension**: 96 (optimized for data size)
- **Max Sequence Length**: 8 tokens (optimized for TLDs)
- **MLP Hidden Size**: 192 with 15% dropout
- **Task Weighting**: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40)

## Training Data Sources

### Research Data (`tld_research_data.jsonl`)
- **Coverage**: 150 TLDs with research metrics
- **Features**: Trust scores, brand associations, memorability, adoption rates
- **Source**: Survey data, brand perception studies, market research

### Technical Data (`tld_technical_data.jsonl`) 
- **Coverage**: 716 TLDs with technical metrics
- **Features**: Registration patterns, domain rankings (Majestic), sales volumes
- **Source**: Registry statistics, web crawl data, domain marketplaces

### Economic Data (`country_economic_data.jsonl`)
- **Coverage**: 126 TLDs mapped to country economies  
- **Features**: GDP breakdowns by 21 industry sectors
- **Source**: World Bank, IMF economic data mapped to ccTLD registries

### Price Data (`tld_price_scores_by_industry_2025.csv`)
- **Coverage**: 722 TLDs with price predictions
- **Features**: 18 industry-specific price scores plus overall score
- **Source**: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`)
- **Industries**: Finance, healthcare, technology, automotive, food, gaming, etc.

## Installation & Usage

### Loading the Model

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
model_name = "humbleworth/tld-embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
```

### Getting TLD Embeddings

```python
def get_tld_embedding(tld, model, tokenizer):
    """Get 96-dimensional embedding for a single TLD"""
    # Use special token format if available, otherwise prefix with dot
    tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}"
    
    inputs = tokenizer(
        tld_text,
        return_tensors="pt",
        padding="max_length", 
        truncation=True,
        max_length=8
    )
    
    with torch.no_grad():
        outputs = model.encoder(**inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        tld_embedding = model.projection(cls_embedding)
    
    return tld_embedding.squeeze().numpy()

# Example
com_embedding = get_tld_embedding("com", model, tokenizer)
print(f"Embedding shape: {com_embedding.shape}")  # (96,)
```

### Batch Processing

```python
def get_tld_embeddings_batch(tlds, model, tokenizer):
    """Get embeddings for multiple TLDs efficiently"""
    # Use special token format if available, otherwise prefix with dot
    tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds]
    
    inputs = tokenizer(
        tld_texts,
        return_tensors="pt",
        padding="max_length",
        truncation=True, 
        max_length=8
    )
    
    with torch.no_grad():
        outputs = model.encoder(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        tld_embeddings = model.projection(cls_embeddings)
    
    return tld_embeddings.numpy()

# Process multiple TLDs
tlds = ["com", "io", "ai", "co.za", "tech"]
embeddings = get_tld_embeddings_batch(tlds, model, tokenizer)
print(f"Embeddings shape: {embeddings.shape}")  # (5, 96)
```

## Key Features

### Multi-Task Learning Benefits
- **Robust Representations**: Joint learning across diverse tasks creates more stable embeddings
- **Transfer Learning**: Knowledge from technical metrics improves price prediction and vice versa
- **Percentile Normalization**: All features converted to percentiles for balanced learning

### Industry-Specific Intelligence  
- **18 Industry Scores**: Specialized predictions for finance, technology, healthcare, etc.
- **Economic Mapping**: Country-level economic data enhances ccTLD understanding
- **Market Dynamics**: Real domain sales data captures market preferences

### Technical Optimizations
- **MPS Support**: Optimized for Apple Silicon (M1/M2) training
- **Gradient Accumulation**: Stable training with effective batch size of 64
- **Early Stopping**: Prevents overfitting with patience-based stopping
- **Task Weighting**: Balanced learning prioritizing price prediction (40% weight)

## Use Cases

1. **Domain Valuation**: Use embeddings as features for ML-based domain appraisal
2. **TLD Recommendation**: Find similar TLDs for branding or investment decisions  
3. **Market Analysis**: Cluster TLDs by business characteristics or market positioning
4. **Portfolio Optimization**: Analyze TLD portfolios using semantic similarity
5. **Cross-Market Analysis**: Compare TLD performance across different industries

## Training Configuration

**Optimal Hyperparameters (Based on Data Analysis):**
- Epochs: 25 (early stopping at patience=5)
- Batch Size: 16 (effective 64 with accumulation) 
- Learning Rate: 5e-4 with warmup
- Warmup Steps: 200
- Gradient Accumulation: 4 steps
- Dropout: 15%

**Training Command:**
```bash
python train_dual_task_embeddings.py \
    --epochs 25 \
    --batch-size 16 \
    --learning-rate 5e-4 \
    --warmup-steps 200 \
    --output-dir models/tld_embedding_model
```

## Model Files

```
tld_embedding_model/
β”œβ”€β”€ config.json                    # Model configuration
β”œβ”€β”€ pytorch_model.bin              # Model weights  
β”œβ”€β”€ tokenizer.json                 # Tokenizer
β”œβ”€β”€ tokenizer_config.json          # Tokenizer config
β”œβ”€β”€ vocab.txt                      # Vocabulary
β”œβ”€β”€ special_tokens_map.json        # Special tokens
β”œβ”€β”€ training_metrics.pt            # Training metrics
β”œβ”€β”€ tld_embeddings.json           # Pre-computed embeddings
└── README.md                      # This file
```

## Citation

If you use this model in your research, please cite:

```bibtex
@software{tld_embedding_2025,
  title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions},
  author = {HumbleWorth},
  year = {2025},
  note = {Achieved 0.8976 average Spearman correlation across 63 features},
  url = {https://huggingface.co/humbleworth/tld-embedding}
}
```

## License

This model is released under the Apache 2.0 License.