File size: 8,607 Bytes

0cc27d5

---
library_name: transformers
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- tld
- embeddings
- domains
- multi-task-learning
- bert
pipeline_tag: feature-extraction
widget:
- text: "com"
- text: "io"  
- text: "ai"
- text: "co.za"
model-index:
- name: TLD Embedding Model
  results:
  - task:
      type: feature-extraction
      name: TLD Embedding
    metrics:
    - type: spearman_correlation
      value: 0.8976
      name: Average Spearman Correlation
---

# TLD Embedding Model

A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional **0.8976 average Spearman correlation** across 63 features during training.

## Model Overview

This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks:

1. **Research Metrics** (18 features): Brand perception, trust scores, memorability, premium brand indices
2. **Technical Metrics** (5 features): Registration statistics, domain rankings, usage patterns  
3. **Economic Indicators** (21 features): Country-level GDP sector breakdowns mapped to TLD registries
4. **Price Predictions** (18 features): Industry-specific market value scores from domain sales data

The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD.

## Training Performance

**Final Training Results (Epoch 25/25):**
- **Overall Average Score**: 0.8976 (89.76% Spearman correlation)
- **Training Loss**: 0.0034

**Task-Specific Performance:**
- **Research Task**: 0.80+ correlation on trust, adoption, and brand metrics
- **Technical Task**: 0.93-0.99 correlation on registration and ranking metrics  
- **Economic Task**: 0.89-0.96 correlation on GDP sector predictions
- **Price Task**: 0.90-0.99 correlation on industry-specific price scores

**Best Individual Metrics:**
- `overall_score`: 0.990 Spearman correlation
- `global_top_1m_share`: 0.993 Spearman correlation
- `score_food`: 0.973 Spearman correlation
- `three_letter_registration_percent`: 0.969 Spearman correlation

## Architecture

- **Base Model**: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT)
- **Embedding Dimension**: 96 (optimized for data size)
- **Max Sequence Length**: 8 tokens (optimized for TLDs)
- **MLP Hidden Size**: 192 with 15% dropout
- **Task Weighting**: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40)

## Training Data Sources

### Research Data (`tld_research_data.jsonl`)
- **Coverage**: 150 TLDs with research metrics
- **Features**: Trust scores, brand associations, memorability, adoption rates
- **Source**: Survey data, brand perception studies, market research

### Technical Data (`tld_technical_data.jsonl`) 
- **Coverage**: 716 TLDs with technical metrics
- **Features**: Registration patterns, domain rankings (Majestic), sales volumes
- **Source**: Registry statistics, web crawl data, domain marketplaces

### Economic Data (`country_economic_data.jsonl`)
- **Coverage**: 126 TLDs mapped to country economies  
- **Features**: GDP breakdowns by 21 industry sectors
- **Source**: World Bank, IMF economic data mapped to ccTLD registries

### Price Data (`tld_price_scores_by_industry_2025.csv`)
- **Coverage**: 722 TLDs with price predictions
- **Features**: 18 industry-specific price scores plus overall score
- **Source**: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`)
- **Industries**: Finance, healthcare, technology, automotive, food, gaming, etc.

## Installation & Usage

### Loading the Model

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
model_name = "humbleworth/tld-embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
```

### Getting TLD Embeddings

```python
def get_tld_embedding(tld, model, tokenizer):
    """Get 96-dimensional embedding for a single TLD"""
    # Use special token format if available, otherwise prefix with dot
    tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}"
    
    inputs = tokenizer(
        tld_text,
        return_tensors="pt",
        padding="max_length", 
        truncation=True,
        max_length=8
    )
    
    with torch.no_grad():
        outputs = model.encoder(**inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        tld_embedding = model.projection(cls_embedding)
    
    return tld_embedding.squeeze().numpy()

# Example
com_embedding = get_tld_embedding("com", model, tokenizer)
print(f"Embedding shape: {com_embedding.shape}")  # (96,)
```

### Batch Processing

```python
def get_tld_embeddings_batch(tlds, model, tokenizer):
    """Get embeddings for multiple TLDs efficiently"""
    # Use special token format if available, otherwise prefix with dot
    tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds]
    
    inputs = tokenizer(
        tld_texts,
        return_tensors="pt",
        padding="max_length",
        truncation=True, 
        max_length=8
    )
    
    with torch.no_grad():
        outputs = model.encoder(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        tld_embeddings = model.projection(cls_embeddings)
    
    return tld_embeddings.numpy()

# Process multiple TLDs
tlds = ["com", "io", "ai", "co.za", "tech"]
embeddings = get_tld_embeddings_batch(tlds, model, tokenizer)
print(f"Embeddings shape: {embeddings.shape}")  # (5, 96)
```

## Key Features

### Multi-Task Learning Benefits
- **Robust Representations**: Joint learning across diverse tasks creates more stable embeddings
- **Transfer Learning**: Knowledge from technical metrics improves price prediction and vice versa
- **Percentile Normalization**: All features converted to percentiles for balanced learning

### Industry-Specific Intelligence  
- **18 Industry Scores**: Specialized predictions for finance, technology, healthcare, etc.
- **Economic Mapping**: Country-level economic data enhances ccTLD understanding
- **Market Dynamics**: Real domain sales data captures market preferences

### Technical Optimizations
- **MPS Support**: Optimized for Apple Silicon (M1/M2) training
- **Gradient Accumulation**: Stable training with effective batch size of 64
- **Early Stopping**: Prevents overfitting with patience-based stopping
- **Task Weighting**: Balanced learning prioritizing price prediction (40% weight)

## Use Cases

1. **Domain Valuation**: Use embeddings as features for ML-based domain appraisal
2. **TLD Recommendation**: Find similar TLDs for branding or investment decisions  
3. **Market Analysis**: Cluster TLDs by business characteristics or market positioning
4. **Portfolio Optimization**: Analyze TLD portfolios using semantic similarity
5. **Cross-Market Analysis**: Compare TLD performance across different industries

## Training Configuration

**Optimal Hyperparameters (Based on Data Analysis):**
- Epochs: 25 (early stopping at patience=5)
- Batch Size: 16 (effective 64 with accumulation) 
- Learning Rate: 5e-4 with warmup
- Warmup Steps: 200
- Gradient Accumulation: 4 steps
- Dropout: 15%

**Training Command:**
```bash
python train_dual_task_embeddings.py \
    --epochs 25 \
    --batch-size 16 \
    --learning-rate 5e-4 \
    --warmup-steps 200 \
    --output-dir models/tld_embedding_model
```

## Model Files

```
tld_embedding_model/
├── config.json                    # Model configuration
├── pytorch_model.bin              # Model weights  
├── tokenizer.json                 # Tokenizer
├── tokenizer_config.json          # Tokenizer config
├── vocab.txt                      # Vocabulary
├── special_tokens_map.json        # Special tokens
├── training_metrics.pt            # Training metrics
├── tld_embeddings.json           # Pre-computed embeddings
└── README.md                      # This file
```

## Citation

If you use this model in your research, please cite:

```bibtex
@software{tld_embedding_2025,
  title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions},
  author = {HumbleWorth},
  year = {2025},
  note = {Achieved 0.8976 average Spearman correlation across 63 features},
  url = {https://huggingface.co/humbleworth/tld-embedding}
}
```

## License

This model is released under the Apache 2.0 License.