tld-embedding / README.md

Upload folder using huggingface_hub

0cc27d5 verified 6 months ago

8.61 kB

	---
	library_name: transformers
	language: en
	license: apache-2.0
	base_model: google/bert_uncased_L-4_H-256_A-4
	tags:
	- tld
	- embeddings
	- domains
	- multi-task-learning
	- bert
	pipeline_tag: feature-extraction
	widget:
	- text: "com"
	- text: "io"
	- text: "ai"
	- text: "co.za"
	model-index:
	- name: TLD Embedding Model
	results:
	- task:
	type: feature-extraction
	name: TLD Embedding
	metrics:
	- type: spearman_correlation
	value: 0.8976
	name: Average Spearman Correlation
	---

	# TLD Embedding Model

	A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional 0.8976 average Spearman correlation across 63 features during training.

	## Model Overview

	This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks:

	1. Research Metrics (18 features): Brand perception, trust scores, memorability, premium brand indices
	2. Technical Metrics (5 features): Registration statistics, domain rankings, usage patterns
	3. Economic Indicators (21 features): Country-level GDP sector breakdowns mapped to TLD registries
	4. Price Predictions (18 features): Industry-specific market value scores from domain sales data

	The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD.

	## Training Performance

	Final Training Results (Epoch 25/25):
	- Overall Average Score: 0.8976 (89.76% Spearman correlation)
	- Training Loss: 0.0034

	Task-Specific Performance:
	- Research Task: 0.80+ correlation on trust, adoption, and brand metrics
	- Technical Task: 0.93-0.99 correlation on registration and ranking metrics
	- Economic Task: 0.89-0.96 correlation on GDP sector predictions
	- Price Task: 0.90-0.99 correlation on industry-specific price scores

	Best Individual Metrics:
	- `overall_score`: 0.990 Spearman correlation
	- `global_top_1m_share`: 0.993 Spearman correlation
	- `score_food`: 0.973 Spearman correlation
	- `three_letter_registration_percent`: 0.969 Spearman correlation

	## Architecture

	- Base Model: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT)
	- Embedding Dimension: 96 (optimized for data size)
	- Max Sequence Length: 8 tokens (optimized for TLDs)
	- MLP Hidden Size: 192 with 15% dropout
	- Task Weighting: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40)

	## Training Data Sources

	### Research Data (`tld_research_data.jsonl`)
	- Coverage: 150 TLDs with research metrics
	- Features: Trust scores, brand associations, memorability, adoption rates
	- Source: Survey data, brand perception studies, market research

	### Technical Data (`tld_technical_data.jsonl`)
	- Coverage: 716 TLDs with technical metrics
	- Features: Registration patterns, domain rankings (Majestic), sales volumes
	- Source: Registry statistics, web crawl data, domain marketplaces

	### Economic Data (`country_economic_data.jsonl`)
	- Coverage: 126 TLDs mapped to country economies
	- Features: GDP breakdowns by 21 industry sectors
	- Source: World Bank, IMF economic data mapped to ccTLD registries

	### Price Data (`tld_price_scores_by_industry_2025.csv`)
	- Coverage: 722 TLDs with price predictions
	- Features: 18 industry-specific price scores plus overall score
	- Source: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`)
	- Industries: Finance, healthcare, technology, automotive, food, gaming, etc.

	## Installation & Usage

	### Loading the Model

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load model and tokenizer
	model_name = "humbleworth/tld-embedding"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)
	model.eval()
	```

	### Getting TLD Embeddings

	```python
	def get_tld_embedding(tld, model, tokenizer):
	"""Get 96-dimensional embedding for a single TLD"""
	# Use special token format if available, otherwise prefix with dot
	tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}"

	inputs = tokenizer(
	tld_text,
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=8
	)

	with torch.no_grad():
	outputs = model.encoder(**inputs)
	cls_embedding = outputs.last_hidden_state[:, 0, :]
	tld_embedding = model.projection(cls_embedding)

	return tld_embedding.squeeze().numpy()

	# Example
	com_embedding = get_tld_embedding("com", model, tokenizer)
	print(f"Embedding shape: {com_embedding.shape}") # (96,)
	```

	### Batch Processing

	```python
	def get_tld_embeddings_batch(tlds, model, tokenizer):
	"""Get embeddings for multiple TLDs efficiently"""
	# Use special token format if available, otherwise prefix with dot
	tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds]

	inputs = tokenizer(
	tld_texts,
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=8
	)

	with torch.no_grad():
	outputs = model.encoder(**inputs)
	cls_embeddings = outputs.last_hidden_state[:, 0, :]
	tld_embeddings = model.projection(cls_embeddings)

	return tld_embeddings.numpy()

	# Process multiple TLDs
	tlds = ["com", "io", "ai", "co.za", "tech"]
	embeddings = get_tld_embeddings_batch(tlds, model, tokenizer)
	print(f"Embeddings shape: {embeddings.shape}") # (5, 96)
	```

	## Key Features

	### Multi-Task Learning Benefits
	- Robust Representations: Joint learning across diverse tasks creates more stable embeddings
	- Transfer Learning: Knowledge from technical metrics improves price prediction and vice versa
	- Percentile Normalization: All features converted to percentiles for balanced learning

	### Industry-Specific Intelligence
	- 18 Industry Scores: Specialized predictions for finance, technology, healthcare, etc.
	- Economic Mapping: Country-level economic data enhances ccTLD understanding
	- Market Dynamics: Real domain sales data captures market preferences

	### Technical Optimizations
	- MPS Support: Optimized for Apple Silicon (M1/M2) training
	- Gradient Accumulation: Stable training with effective batch size of 64
	- Early Stopping: Prevents overfitting with patience-based stopping
	- Task Weighting: Balanced learning prioritizing price prediction (40% weight)

	## Use Cases

	1. Domain Valuation: Use embeddings as features for ML-based domain appraisal
	2. TLD Recommendation: Find similar TLDs for branding or investment decisions
	3. Market Analysis: Cluster TLDs by business characteristics or market positioning
	4. Portfolio Optimization: Analyze TLD portfolios using semantic similarity
	5. Cross-Market Analysis: Compare TLD performance across different industries

	## Training Configuration

	Optimal Hyperparameters (Based on Data Analysis):
	- Epochs: 25 (early stopping at patience=5)
	- Batch Size: 16 (effective 64 with accumulation)
	- Learning Rate: 5e-4 with warmup
	- Warmup Steps: 200
	- Gradient Accumulation: 4 steps
	- Dropout: 15%

	Training Command:
	```bash
	python train_dual_task_embeddings.py \
	--epochs 25 \
	--batch-size 16 \
	--learning-rate 5e-4 \
	--warmup-steps 200 \
	--output-dir models/tld_embedding_model
	```

	## Model Files

	```
	tld_embedding_model/
	├── config.json # Model configuration
	├── pytorch_model.bin # Model weights
	├── tokenizer.json # Tokenizer
	├── tokenizer_config.json # Tokenizer config
	├── vocab.txt # Vocabulary
	├── special_tokens_map.json # Special tokens
	├── training_metrics.pt # Training metrics
	├── tld_embeddings.json # Pre-computed embeddings
	└── README.md # This file
	```

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@software{tld_embedding_2025,
	title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions},
	author = {HumbleWorth},
	year = {2025},
	note = {Achieved 0.8976 average Spearman correlation across 63 features},
	url = {https://huggingface.co/humbleworth/tld-embedding}
	}
	```

	## License

	This model is released under the Apache 2.0 License.