|
|
--- |
|
|
library_name: transformers |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
base_model: google/bert_uncased_L-4_H-256_A-4 |
|
|
tags: |
|
|
- tld |
|
|
- embeddings |
|
|
- domains |
|
|
- multi-task-learning |
|
|
- bert |
|
|
pipeline_tag: feature-extraction |
|
|
widget: |
|
|
- text: "com" |
|
|
- text: "io" |
|
|
- text: "ai" |
|
|
- text: "co.za" |
|
|
model-index: |
|
|
- name: TLD Embedding Model |
|
|
results: |
|
|
- task: |
|
|
type: feature-extraction |
|
|
name: TLD Embedding |
|
|
metrics: |
|
|
- type: spearman_correlation |
|
|
value: 0.8976 |
|
|
name: Average Spearman Correlation |
|
|
--- |
|
|
|
|
|
# TLD Embedding Model |
|
|
|
|
|
A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional **0.8976 average Spearman correlation** across 63 features during training. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks: |
|
|
|
|
|
1. **Research Metrics** (18 features): Brand perception, trust scores, memorability, premium brand indices |
|
|
2. **Technical Metrics** (5 features): Registration statistics, domain rankings, usage patterns |
|
|
3. **Economic Indicators** (21 features): Country-level GDP sector breakdowns mapped to TLD registries |
|
|
4. **Price Predictions** (18 features): Industry-specific market value scores from domain sales data |
|
|
|
|
|
The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD. |
|
|
|
|
|
## Training Performance |
|
|
|
|
|
**Final Training Results (Epoch 25/25):** |
|
|
- **Overall Average Score**: 0.8976 (89.76% Spearman correlation) |
|
|
- **Training Loss**: 0.0034 |
|
|
|
|
|
**Task-Specific Performance:** |
|
|
- **Research Task**: 0.80+ correlation on trust, adoption, and brand metrics |
|
|
- **Technical Task**: 0.93-0.99 correlation on registration and ranking metrics |
|
|
- **Economic Task**: 0.89-0.96 correlation on GDP sector predictions |
|
|
- **Price Task**: 0.90-0.99 correlation on industry-specific price scores |
|
|
|
|
|
**Best Individual Metrics:** |
|
|
- `overall_score`: 0.990 Spearman correlation |
|
|
- `global_top_1m_share`: 0.993 Spearman correlation |
|
|
- `score_food`: 0.973 Spearman correlation |
|
|
- `three_letter_registration_percent`: 0.969 Spearman correlation |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Base Model**: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT) |
|
|
- **Embedding Dimension**: 96 (optimized for data size) |
|
|
- **Max Sequence Length**: 8 tokens (optimized for TLDs) |
|
|
- **MLP Hidden Size**: 192 with 15% dropout |
|
|
- **Task Weighting**: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40) |
|
|
|
|
|
## Training Data Sources |
|
|
|
|
|
### Research Data (`tld_research_data.jsonl`) |
|
|
- **Coverage**: 150 TLDs with research metrics |
|
|
- **Features**: Trust scores, brand associations, memorability, adoption rates |
|
|
- **Source**: Survey data, brand perception studies, market research |
|
|
|
|
|
### Technical Data (`tld_technical_data.jsonl`) |
|
|
- **Coverage**: 716 TLDs with technical metrics |
|
|
- **Features**: Registration patterns, domain rankings (Majestic), sales volumes |
|
|
- **Source**: Registry statistics, web crawl data, domain marketplaces |
|
|
|
|
|
### Economic Data (`country_economic_data.jsonl`) |
|
|
- **Coverage**: 126 TLDs mapped to country economies |
|
|
- **Features**: GDP breakdowns by 21 industry sectors |
|
|
- **Source**: World Bank, IMF economic data mapped to ccTLD registries |
|
|
|
|
|
### Price Data (`tld_price_scores_by_industry_2025.csv`) |
|
|
- **Coverage**: 722 TLDs with price predictions |
|
|
- **Features**: 18 industry-specific price scores plus overall score |
|
|
- **Source**: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`) |
|
|
- **Industries**: Finance, healthcare, technology, automotive, food, gaming, etc. |
|
|
|
|
|
## Installation & Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "humbleworth/tld-embedding" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
### Getting TLD Embeddings |
|
|
|
|
|
```python |
|
|
def get_tld_embedding(tld, model, tokenizer): |
|
|
"""Get 96-dimensional embedding for a single TLD""" |
|
|
# Use special token format if available, otherwise prefix with dot |
|
|
tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" |
|
|
|
|
|
inputs = tokenizer( |
|
|
tld_text, |
|
|
return_tensors="pt", |
|
|
padding="max_length", |
|
|
truncation=True, |
|
|
max_length=8 |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.encoder(**inputs) |
|
|
cls_embedding = outputs.last_hidden_state[:, 0, :] |
|
|
tld_embedding = model.projection(cls_embedding) |
|
|
|
|
|
return tld_embedding.squeeze().numpy() |
|
|
|
|
|
# Example |
|
|
com_embedding = get_tld_embedding("com", model, tokenizer) |
|
|
print(f"Embedding shape: {com_embedding.shape}") # (96,) |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
def get_tld_embeddings_batch(tlds, model, tokenizer): |
|
|
"""Get embeddings for multiple TLDs efficiently""" |
|
|
# Use special token format if available, otherwise prefix with dot |
|
|
tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds] |
|
|
|
|
|
inputs = tokenizer( |
|
|
tld_texts, |
|
|
return_tensors="pt", |
|
|
padding="max_length", |
|
|
truncation=True, |
|
|
max_length=8 |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.encoder(**inputs) |
|
|
cls_embeddings = outputs.last_hidden_state[:, 0, :] |
|
|
tld_embeddings = model.projection(cls_embeddings) |
|
|
|
|
|
return tld_embeddings.numpy() |
|
|
|
|
|
# Process multiple TLDs |
|
|
tlds = ["com", "io", "ai", "co.za", "tech"] |
|
|
embeddings = get_tld_embeddings_batch(tlds, model, tokenizer) |
|
|
print(f"Embeddings shape: {embeddings.shape}") # (5, 96) |
|
|
``` |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### Multi-Task Learning Benefits |
|
|
- **Robust Representations**: Joint learning across diverse tasks creates more stable embeddings |
|
|
- **Transfer Learning**: Knowledge from technical metrics improves price prediction and vice versa |
|
|
- **Percentile Normalization**: All features converted to percentiles for balanced learning |
|
|
|
|
|
### Industry-Specific Intelligence |
|
|
- **18 Industry Scores**: Specialized predictions for finance, technology, healthcare, etc. |
|
|
- **Economic Mapping**: Country-level economic data enhances ccTLD understanding |
|
|
- **Market Dynamics**: Real domain sales data captures market preferences |
|
|
|
|
|
### Technical Optimizations |
|
|
- **MPS Support**: Optimized for Apple Silicon (M1/M2) training |
|
|
- **Gradient Accumulation**: Stable training with effective batch size of 64 |
|
|
- **Early Stopping**: Prevents overfitting with patience-based stopping |
|
|
- **Task Weighting**: Balanced learning prioritizing price prediction (40% weight) |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
1. **Domain Valuation**: Use embeddings as features for ML-based domain appraisal |
|
|
2. **TLD Recommendation**: Find similar TLDs for branding or investment decisions |
|
|
3. **Market Analysis**: Cluster TLDs by business characteristics or market positioning |
|
|
4. **Portfolio Optimization**: Analyze TLD portfolios using semantic similarity |
|
|
5. **Cross-Market Analysis**: Compare TLD performance across different industries |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
**Optimal Hyperparameters (Based on Data Analysis):** |
|
|
- Epochs: 25 (early stopping at patience=5) |
|
|
- Batch Size: 16 (effective 64 with accumulation) |
|
|
- Learning Rate: 5e-4 with warmup |
|
|
- Warmup Steps: 200 |
|
|
- Gradient Accumulation: 4 steps |
|
|
- Dropout: 15% |
|
|
|
|
|
**Training Command:** |
|
|
```bash |
|
|
python train_dual_task_embeddings.py \ |
|
|
--epochs 25 \ |
|
|
--batch-size 16 \ |
|
|
--learning-rate 5e-4 \ |
|
|
--warmup-steps 200 \ |
|
|
--output-dir models/tld_embedding_model |
|
|
``` |
|
|
|
|
|
## Model Files |
|
|
|
|
|
``` |
|
|
tld_embedding_model/ |
|
|
βββ config.json # Model configuration |
|
|
βββ pytorch_model.bin # Model weights |
|
|
βββ tokenizer.json # Tokenizer |
|
|
βββ tokenizer_config.json # Tokenizer config |
|
|
βββ vocab.txt # Vocabulary |
|
|
βββ special_tokens_map.json # Special tokens |
|
|
βββ training_metrics.pt # Training metrics |
|
|
βββ tld_embeddings.json # Pre-computed embeddings |
|
|
βββ README.md # This file |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{tld_embedding_2025, |
|
|
title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions}, |
|
|
author = {HumbleWorth}, |
|
|
year = {2025}, |
|
|
note = {Achieved 0.8976 average Spearman correlation across 63 features}, |
|
|
url = {https://huggingface.co/humbleworth/tld-embedding} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. |