quantumly commited on
Commit
15c7ad4
·
verified ·
1 Parent(s): b4b4d01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +249 -87
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  license: mit
3
  library_name: xgboost
 
4
  tags:
5
  - tabular-regression
6
  - ens
@@ -8,13 +9,16 @@ tags:
8
  - web3
9
  - domain-names
10
  - price-prediction
 
11
  datasets:
12
  - quantumly/ens-appraiser-data
 
13
  metrics:
14
  - r_squared
15
  - mape
 
16
  model-index:
17
- - name: ENS Appraiser v0
18
  results:
19
  - task:
20
  type: tabular-regression
@@ -24,92 +28,230 @@ model-index:
24
  type: quantumly/ens-appraiser-data
25
  metrics:
26
  - type: r_squared
27
- value: TODO
28
- name: R² (log USD)
29
  - type: median_ape
30
- value: TODO
31
- name: Median APE
32
  - type: rmse
33
- value: TODO
34
- name: RMSE (log USD)
35
  ---
36
 
37
- # ENS Appraiser
38
 
39
- A gradient-boosted regression model that predicts the USD sale price of an
40
- ENS (`.eth`) domain name. This is the v0 baseline handcrafted features +
41
- mpnet semantic embeddings + KNN comparable-sale aggregates.
42
 
43
- > ⚠️ Numeric values in the YAML frontmatter (`TODO`) and the **Evaluation**
44
- > table below should be filled in with the values from the training
45
- > notebook's `=== v0 SUMMARY ===` block. The notebook prints exact
46
- > R²/RMSE/MAPE for train/val/test — copy them here before merging.
47
 
48
- ## Model Details
49
 
50
- - **Architecture**: XGBoost regressor on `log(sale_price_usd)`
51
- - **Features**: ~150 total
52
- - 15 handcrafted (length, character composition, palindrome/repetition flags)
53
- - 8 wordlist hits (Wikipedia, GeoNames, US firstnames, ISO 3166, stock tickers, SEC EDGAR, Wiktionary EN)
54
- - ~45 grails club memberships (binary per club)
55
- - 1 trademark conflict flag (active USPTO marks in Nice classes 9/35/36/38/41/42/45)
56
- - 3 holder behavior (name age, registrant portfolio size, lifetime transfer count)
57
- - 5 macro context (Fear & Greed, ETH TVL, ETH stablecoin mcap, ETH DEX volume, NFT marketplace fees)
58
- - 64 PCA-reduced mpnet embedding dims (from `sentence-transformers/all-mpnet-base-v2`)
59
- - 8 KNN comparable-sale aggregates (count, mean/median/p90 log price of nearest neighbors with prior sales)
60
- - **Training data**: ENS secondary sales, Jan 2022 — May 2024 (~384k events)
61
- - **Validation**: temporal split (80/10/10 by sale date, no shuffle to prevent KNN-comp leakage)
62
 
63
- ## Evaluation
 
 
 
 
64
 
65
- | Split | (log USD) | RMSE (log USD) | Median APE |
66
- |---|---|---|---|
67
- | Train | TODO | TODO | TODO |
68
- | Val | TODO | TODO | TODO |
69
- | Test | TODO | TODO | TODO |
70
 
71
- ## Intended Use
72
 
73
- This model predicts sale prices for ENS `.eth` domain names. It's intended
74
- for **research and analytics**, not for live trading or as a price oracle.
 
 
 
 
 
 
 
 
75
 
76
- **Use cases it handles well:**
77
 
78
- - Bulk valuation of mid-tier names ($50–$5,000 range)
79
- - Identifying obviously over- or under-priced listings
80
- - Portfolio-level mark-to-market for ENS holdings
81
- - Sanity-checking listing prices
 
 
 
 
 
 
 
 
 
 
82
 
83
- **Use cases where it's weak:**
84
 
85
- - Celebrity/brand-name premium tail ($50k+ sales) — the model lacks fame data
86
- - Future names not in training distribution (post-May 2024)
87
- - Names registered through pathways the subgraph doesn't index
88
- - Blur-marketplace sales — Alchemy `getNFTSales` v2 doesn't index Blur for ENS,
89
- so the training data has a marketplace coverage gap
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ## Limitations
92
 
93
- - **Sales coverage limitation**: Training data covers Jan 2022 May 2024 only.
94
- Alchemy's `getNFTSales` v2 endpoint truncates ENS coverage at block 19768978
95
- (~May 2024) and doesn't index Blur sales.
96
- - **Celebrity tail**: Names with significant out-of-band brand value
97
- (`coinbase.eth`, `vault.eth`) will be systematically underpriced because
98
- the model lacks features for "is this a famous person/brand."
99
- - **Out-of-distribution labels**: Pure-digit labels (`0001`), punycode/emoji,
100
- and l33tspeak get less benefit from mpnet embeddings since they were
101
- out-of-distribution for the pretrained model.
102
- - **Time drift**: ENS market regime shifts in 2024-2025 are not captured.
103
- Predictions for current names will lag those regime shifts.
 
 
 
 
 
 
 
 
 
 
104
 
105
- ## How to Use
106
 
107
  ```python
108
  from huggingface_hub import hf_hub_download
109
  import xgboost as xgb
110
  import pickle
111
 
112
- # Download model artifacts
113
  model_path = hf_hub_download(
114
  repo_id="quantumly/ens-appraiser",
115
  filename="v0_appraiser_xgb.json",
@@ -119,48 +261,68 @@ pca_path = hf_hub_download(
119
  filename="v0_pca_mpnet.pkl",
120
  )
121
 
122
- # Load
123
  booster = xgb.Booster()
124
  booster.load_model(model_path)
125
  with open(pca_path, "rb") as f:
126
  pca = pickle.load(f)
127
 
128
- # To make predictions you'll also need:
129
- # 1. The mpnet embedding for the label (run sentence-transformers all-mpnet-base-v2)
130
- # 2. The handcrafted features, wordlist lookups, club memberships, trademark check
131
- # 3. Macro context for the prediction date (ETH price, Fear & Greed, etc.)
132
- # 4. KNN comp lookup against the FAISS index from the dataset repo
133
  #
134
- # See the inference notebook in the dataset repo for the full pipeline.
135
  ```
136
 
137
- ## Training Data
 
138
 
139
- Built from the [`quantumly/ens-appraiser-data`](https://huggingface.co/datasets/quantumly/ens-appraiser-data)
140
- dataset, which assembles:
141
-
142
- - ENS on-chain registrations, renewals, transfers (The Graph subgraph)
143
- - ENS secondary sales (Alchemy `getNFTSales`)
144
- - CoinGecko hourly OHLC for label denomination
145
- - Discourse forums for governance signal
146
- - DefiLlama for macro signals (TVL, stablecoin mcap, DEX volume, NFT marketplace fees)
147
- - USPTO trademark registry for brand-conflict flags
148
- - Grails club memberships
149
- - Wiktionary, Wikipedia, GeoNames, US Census, SEC EDGAR for wordlist hits
150
- - `sentence-transformers/all-mpnet-base-v2` for semantic embeddings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  ## Citation
153
 
154
  ```bibtex
155
  @misc{ens_appraiser_2026,
156
- author = {Drobnič, Nejc},
157
- title = {ENS Appraiser},
158
- year = {2026},
159
  publisher = {Hugging Face},
160
- url = {https://huggingface.co/quantumly/ens-appraiser}
161
  }
162
  ```
163
 
164
- ## Contact
165
 
166
- nejc@nejc.dev
 
1
  ---
2
  license: mit
3
  library_name: xgboost
4
+ pipeline_tag: tabular-regression
5
  tags:
6
  - tabular-regression
7
  - ens
 
9
  - web3
10
  - domain-names
11
  - price-prediction
12
+ - nft
13
  datasets:
14
  - quantumly/ens-appraiser-data
15
+ base_model: sentence-transformers/all-mpnet-base-v2
16
  metrics:
17
  - r_squared
18
  - mape
19
+ - rmse
20
  model-index:
21
+ - name: ENS Appraiser v0.2
22
  results:
23
  - task:
24
  type: tabular-regression
 
28
  type: quantumly/ens-appraiser-data
29
  metrics:
30
  - type: r_squared
31
+ value: 0.3081
32
+ name: R² (log USD, test)
33
  - type: median_ape
34
+ value: 1.383
35
+ name: Median APE (test)
36
  - type: rmse
37
+ value: 1.5469
38
+ name: RMSE (log USD, test)
39
  ---
40
 
41
+ # ENS Appraiser v0.2
42
 
43
+ A gradient-boosted regressor that predicts the USD sale price of an
44
+ ENS (`.eth`) domain name from on-chain history, semantic embeddings of the
45
+ label, and macro-market context.
46
 
47
+ This is the **v0 baseline** handcrafted features + mpnet PCA + KNN
48
+ comparable-sale aggregates. Built to establish an honest, leakage-free
49
+ floor that future versions improve on.
 
50
 
51
+ ## Quick numbers
52
 
53
+ Trained on ~265k ENS secondary sales (Jan 2022 – Sep 2023), evaluated on
54
+ 2,744 sales in **Q1–Q2 2024** (held out by date, never seen during training):
 
 
 
 
 
 
 
 
 
 
55
 
56
+ | Split | n | R² (log USD) | RMSE (log USD) | Median APE | Bias |
57
+ |-------|--------|--------------|----------------|------------|-------|
58
+ | Train | 265,240 | 0.7700 | 0.7744 | 32.5% | +0.000 |
59
+ | Val | 3,545 | 0.6602 | 1.0678 | 57.0% | +0.203 |
60
+ | Test | 2,744 | **0.3081** | 1.5469 | 138.3% | +0.732 |
61
 
62
+ **Plain-English read:** for a typical mid-tier name in test, the model is
63
+ within ~2× of the actual sale price. The long tail — celebrity names,
64
+ 3-letter premiums, regime shifts is where it misses, often by 100×+ in
65
+ either direction.
 
66
 
67
+ ## What's good
68
 
69
+ - **Mid-tier names, $50–$5,000 range:** usually within of actual.
70
+ - **Length and character composition:** strong signals captured well.
71
+ The model knows 3-letter ASCII names are premium and 12-letter random
72
+ handles are cheap.
73
+ - **Wordlist hits:** matches against Wikipedia, GeoNames, US first names,
74
+ stock tickers, and SEC EDGAR are picked up correctly. `paris.eth` is
75
+ flagged as a city, `nike.eth` as a brand.
76
+ - **Comparable-sale anchoring:** the top two features are `knn_mean_log`
77
+ and `knn_p90_log` — the model leans heavily on "what did similar names
78
+ sell for recently?" which is the right intuition for valuation.
79
 
80
+ ## What's not
81
 
82
+ - **Celebrity / brand premium:** a name's value to a known buyer
83
+ (Coinbase wanting `coinbase.eth`, a luxury brand wanting their mark)
84
+ is invisible to this model. It can detect that `nike.eth` is a brand
85
+ word, but not that the sale price reflects Nike's interest specifically.
86
+ - **3-letter premium tail:** names like `mph.eth`, `uma.eth` sold for
87
+ $20k–$40k in test; the model predicted $100–$200. The training set
88
+ underweights short premiums because most sales there are 5+ letters.
89
+ - **Regime shift on test:** test set median price is ~4× higher than
90
+ training median due to the 2023 → 2024 ENS market shift. Recency-weighted
91
+ training (1-year half-life) helps but doesn't fully close the gap.
92
+ - **Bidirectional errors:** worst predictions split roughly evenly
93
+ between under-prediction (hot names the model didn't recognize) and
94
+ over-prediction (cold names that just didn't move). 138% MedianAPE is
95
+ honest but uncomfortable.
96
 
97
+ ## How it's built
98
 
99
+ | Component | Detail |
100
+ |---|---|
101
+ | Algorithm | XGBoost regressor (170 boosted trees, max_depth=7) |
102
+ | Target | `log(sale_price_usd)` |
103
+ | Features | 146 total |
104
+ | Training data | 265,240 sales, Jan 2022 – Sep 2023 |
105
+ | Training time | ~10 min on a single A100 |
106
+ | Model size | 3.3 MB |
107
+
108
+ ### Feature breakdown
109
+
110
+ - **Handcrafted (15):** length, n_digits, n_letters, n_special, palindrome,
111
+ is_all_digits, is_all_letters, is_ascii, has_unicode, starts/ends_digit,
112
+ max_char_run, n_unique_chars
113
+ - **Wordlist hits (8):** Wikipedia titles, GeoNames cities, US first names,
114
+ ISO 3166 countries, stock tickers, SEC EDGAR companies, Wiktionary EN,
115
+ plus a `wordlist_hits` total
116
+ - **Grails clubs (~45):** binary membership in each curated `.eth` club
117
+ (`999club`, `pre-punks`, `palindromes`, `pokemon_gen1`, etc.)
118
+ - **Trademark conflict (1):** active USPTO mark in Nice classes 9, 35, 36,
119
+ 38, 41, 42, 45 with matching `mark_text_norm`
120
+ - **Holder behavior (2):** `name_age_days`, `prior_transfer_count`
121
+ (leakage-safe — only counts transfers strictly before the sale block)
122
+ - **Macro context (5):** Fear & Greed Index, ETH chain TVL, ETH stablecoin
123
+ market cap, ETH DEX volume, total NFT marketplace fees on the sale day
124
+ - **mpnet PCA (64):** 768-dim `all-mpnet-base-v2` embeddings of the label,
125
+ PCA-reduced to 64 dims (95% explained variance)
126
+ - **KNN comparable sales (8):** for each label, FAISS-retrieve top-50
127
+ semantic neighbors (HNSW index), filter near-duplicates (sim > 0.999),
128
+ take the most-recent prior sale of each, aggregate as `knn_count`,
129
+ `knn_mean_log`, `knn_median_log`, `knn_p90_log`, `knn_max_sim`,
130
+ `knn_min_sim`, `knn_log_max`, `knn_log_min`. **Strict leakage prevention:**
131
+ only neighbors with sales **before** the current sale's date count.
132
+
133
+ ### Top 10 features by gain
134
+
135
+ | Rank | Feature | Gain |
136
+ |---:|---|---:|
137
+ | 1 | `knn_mean_log` | 1,714 |
138
+ | 2 | `knn_p90_log` | 1,613 |
139
+ | 3 | `len` | 1,364 |
140
+ | 4 | `in_wikipedia` | 1,052 |
141
+ | 5 | `is_all_digits` | 944 |
142
+ | 6 | `knn_median_log` | 604 |
143
+ | 7 | `n_digits` | 338 |
144
+ | 8 | `pca_000` | 289 |
145
+ | 9 | `n_clubs` | 282 |
146
+ | 10 | `ends_digit` | 277 |
147
+
148
+ Five of the top ten are KNN-comp or PCA features, which means the
149
+ embedding pipeline is doing real work — it's not just paying for itself,
150
+ it's the dominant signal alongside length.
151
+
152
+ ## Training data + leakage controls
153
+
154
+ Built from the [`quantumly/ens-appraiser-data`](https://huggingface.co/datasets/quantumly/ens-appraiser-data)
155
+ dataset:
156
+
157
+ - **Sales labels:** Alchemy `getNFTSales` for ENS BaseRegistrar + NameWrapper
158
+ contracts. Wei amounts converted to USD via CoinGecko hourly OHLC at
159
+ the sale's block timestamp. **Coverage gap:** Alchemy `getNFTSales` v2
160
+ truncates at block 19,768,978 (May 2024) and does not index Blur
161
+ marketplace sales. v0 ships with this gap; closing it is a v1 priority.
162
+ - **Registrations + transfers:** The Graph's [ENS subgraph](https://thegraph.com/explorer/subgraphs/5XqPmWe6gjyrJtFn9cLy237i4cWw2j9HcUJEXsP5qGtH).
163
+ - **Wordlists:** Wiktionary dumps, Wikipedia EN article titles, GeoNames
164
+ `cities500`, US Census baby names, NASDAQ Trader ticker dumps,
165
+ SEC EDGAR company tickers, ISO 3166 country list.
166
+ - **Macro:** alternative.me Fear & Greed Index, DefiLlama (TVL, stablecoin
167
+ mcap, DEX volume, NFT marketplace fees).
168
+ - **Trademarks:** USPTO Trademark Case Files Dataset (annual research dump).
169
+ - **Embeddings:** `sentence-transformers/all-mpnet-base-v2`, encoded once
170
+ for all 3.5M ENS labels in the dataset.
171
+
172
+ ### Leakage controls
173
+
174
+ The first version of this model accidentally leaked future information
175
+ through `lifetime_transfer_count` (it counted *all* transfers ever for a
176
+ labelhash, including transfers that happened *after* the sale being
177
+ predicted). The leaky model showed **train R² 0.81 / test R² −0.29** — the
178
+ classic catastrophic-overfit signature where the model collapses to
179
+ predicting the population mean on held-out data.
180
+
181
+ The current model uses `prior_transfer_count`, which only counts transfers
182
+ where `transfer_block < sale_block` per row. It moved to rank #11 in
183
+ feature importance (was #1 by 3.3×). KNN comparable-sale features have a
184
+ similar safeguard: a neighbor's sale only counts if it happened strictly
185
+ before the sale being predicted.
186
+
187
+ ### Train/Val/Test split
188
+
189
+ Fixed-window temporal split:
190
+
191
+ - **Train:** sales with `sale_date < 2023-10-01`
192
+ - **Val:** sales 2023-10-01 → 2023-12-31
193
+ - **Test:** sales 2024-01-01 onwards
194
+
195
+ This prevents the v0.1 mistake of training on 2022 prices and asking the
196
+ model to extrapolate to a 2024 market regime that's ~4× more expensive
197
+ on average. Val and test are in the same regime so val RMSE is a
198
+ meaningful proxy for test.
199
+
200
+ Training rows are weighted with an exponential recency decay (1-year
201
+ half-life, normalized to mean=1.0) so the model leans on 2023 dynamics
202
+ without throwing away the older data entirely.
203
+
204
+ ## Intended use
205
+
206
+ This model is intended for **research and analytics**, not as a price
207
+ oracle and not for live trading.
208
+
209
+ **Reasonable uses:**
210
+
211
+ - Bulk valuation of mid-tier ENS portfolios for tax/accounting purposes
212
+ - Identifying obviously over- or under-listed names on secondary markets
213
+ - Sanity-checking a listing price before posting
214
+ - Producing comparable-sale ranges for negotiation context
215
+
216
+ **Out of scope:**
217
+
218
+ - Pricing 3-letter, 1-2 letter, or otherwise-premium names with confidence
219
+ - Pricing celebrity / known-brand names where the buyer pool is concentrated
220
+ - Predicting prices for names in the post-May-2024 marketplace mix
221
+ (Blur dominance, marketplace fee changes)
222
+ - Any high-stakes financial decision based on a single point estimate
223
 
224
  ## Limitations
225
 
226
+ - **Sales coverage**: Jan 2022 May 2024 only, no Blur. ~2 years of recent
227
+ sales (mid-2024 onwards) are missing entirely from training. Closing
228
+ this gap requires either a new sales source (Reservoir/SimpleHash both
229
+ defunct as of 2024–2025) or direct `eth_getLogs` decoding of Seaport,
230
+ Blur, X2Y2, LooksRare events, planned for v1.
231
+ - **Celebrity premium**: there's no feature here for "is this a famous
232
+ person/place/thing?" beyond Wikipedia-title matching. v1 adds
233
+ LLM-derived structured features (`fame_score`, `name_kind`,
234
+ `crypto_relevance`, `brand_collision_risk`) which should close most
235
+ of this gap.
236
+ - **Out-of-distribution labels**: pure-digit labels (`0001`),
237
+ punycode/emoji, and l33tspeak get less benefit from mpnet embeddings
238
+ since they're out of distribution for the pretrained model. Length and
239
+ charset features partially compensate.
240
+ - **Time drift**: the ENS market shifts noticeably every 6–12 months as
241
+ marketplace dominance, fee structures, and DAO actions move. Predictions
242
+ on names sold "right now" will lag any regime shift since the training
243
+ cutoff.
244
+ - **Test-set thinness**: only 2,744 sales meet the $10 floor and post-Jan-2024
245
+ cutoff. The reported test R² has roughly ±0.08 95% CI — useful as a
246
+ ballpark, not a precise number.
247
 
248
+ ## How to use
249
 
250
  ```python
251
  from huggingface_hub import hf_hub_download
252
  import xgboost as xgb
253
  import pickle
254
 
 
255
  model_path = hf_hub_download(
256
  repo_id="quantumly/ens-appraiser",
257
  filename="v0_appraiser_xgb.json",
 
261
  filename="v0_pca_mpnet.pkl",
262
  )
263
 
 
264
  booster = xgb.Booster()
265
  booster.load_model(model_path)
266
  with open(pca_path, "rb") as f:
267
  pca = pickle.load(f)
268
 
269
+ # Inference also requires:
270
+ # 1. mpnet embedding for the label (sentence-transformers/all-mpnet-base-v2)
271
+ # 2. Handcrafted/wordlist/club/trademark/holder/macro features
272
+ # 3. KNN comp lookup against the dataset repo's FAISS index
 
273
  #
274
+ # A self-contained inference notebook is planned in the dataset repo.
275
  ```
276
 
277
+ The 146 features expected by the booster are listed in `v0_metadata.json`
278
+ under `feature_cols`, in the exact order required by `xgb.DMatrix`.
279
 
280
+ ## Reproducibility
281
+
282
+ The training notebook ([`v0_appraiser_v2.ipynb`](https://huggingface.co/datasets/quantumly/ens-appraiser-data/blob/main/notebooks/v0_appraiser_v2.ipynb))
283
+ runs end-to-end on a Colab A100 high-RAM instance in ~25 minutes:
284
+
285
+ 1. Downloads all source parquets from the dataset repo
286
+ 2. Reconstructs USD prices via CoinGecko hourly OHLC join
287
+ 3. Resolves labels for both BaseRegistrar and NameWrapper sales
288
+ 4. Computes all features
289
+ 5. Builds HNSW index for KNN
290
+ 6. Trains XGBoost with early stopping
291
+ 7. Saves model + metadata + diagnostics
292
+ 8. Uploads to this model repo
293
+
294
+ All randomness is seeded (`seed=42` for XGBoost, PCA, sample weights).
295
+
296
+ ## Roadmap
297
+
298
+ **v1 priorities** (in expected R² delta order):
299
+
300
+ 1. **LLM-derived features** — Llama 3.1 8B local inference over all 3.5M
301
+ labels, extracting `fame_score`, `name_kind`, `cultural_origin`,
302
+ `crypto_relevance`, `brand_collision_risk`, plus a description-embedding.
303
+ Expected delta: +0.05–0.10 test R².
304
+ 2. **Recent sales backfill** via direct `eth_getLogs` decoding of
305
+ Seaport / Blur / Wyvern / X2Y2 / LooksRare events. Closes the
306
+ May 2024 → present coverage gap and adds Blur. Expected delta:
307
+ +0.03–0.06 test R² and a much bigger test set.
308
+ 3. **Multi-embedding ensemble** — concatenate mpnet with `bge-base-en-v1.5`
309
+ and `e5-base-v2`, PCA the joint space. Expected delta: +0.02–0.04.
310
+ 4. **Cross-encoder reranker** for KNN comps. Expected delta: +0.02–0.03.
311
+ 5. **Contrastive fine-tuning** of mpnet on price-similarity triplets.
312
+ Expected delta: +0.03–0.05.
313
 
314
  ## Citation
315
 
316
  ```bibtex
317
  @misc{ens_appraiser_2026,
318
+ author = {Drobnič, Nejc},
319
+ title = {ENS Appraiser v0.2},
320
+ year = {2026},
321
  publisher = {Hugging Face},
322
+ url = {https://huggingface.co/quantumly/ens-appraiser}
323
  }
324
  ```
325
 
326
+ ## License + contact
327
 
328
+ MIT. Questions, corrections, pull requests: nejc@nejc.dev