zirobtc commited on
Commit
1cdc0af
·
1 Parent(s): 68040ce

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  log.log filter=lfs diff=lfs merge=lfs -text
37
  store/74c/74c70007-cccd-4669-bfd4-e25f8348ad8c/all_1_35_2/primary.cidx filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  log.log filter=lfs diff=lfs merge=lfs -text
37
  store/74c/74c70007-cccd-4669-bfd4-e25f8348ad8c/all_1_35_2/primary.cidx filter=lfs diff=lfs merge=lfs -text
38
+ data/quality_scores.jsonl filter=lfs diff=lfs merge=lfs -text
QUALITY_SCORE_ARCHITECTURE.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Token Quality / Health Score (q) - Architecture
2
+
3
+ This document defines the "quality/health" scalar `q` used by Apollo.
4
+
5
+ ## 1) What problem this solves
6
+
7
+ We want a single number that captures **how healthy / organic vs controlled** a token looks, so a downstream trading policy (e.g., RL agent) can treat it as a **risk/health input**.
8
+
9
+ Key points:
10
+ - This is **not model confidence**.
11
+ - `q` is computed **offline** using a token's **full lifetime** (for labels / training targets).
12
+ - At **inference**, the model predicts `q` from **partial observations**.
13
+ - We avoid hard thresholds and raw-scale features (USD, SOL, counts) by using **within-regime distributions**.
14
+
15
+ ## 2) Core idea (distribution-first, not rules-first)
16
+
17
+ Raw totals (fees, volume, holders) are mostly **scale** and are extremely heavy-tailed. Using them directly:
18
+ - makes the signal unstable across regimes,
19
+ - makes it sensitive to market-wide shifts,
20
+ - and invites hand-tuned weights ("human bias").
21
+
22
+ Instead we map each metric to a **percentile** within a comparable peer group, then aggregate.
23
+
24
+ ## 3) Return bucketing (why it is required)
25
+
26
+ The dataset is highly imbalanced: most tokens die early (<2-3x), while a tiny tail produces 10x-1000x outcomes.
27
+
28
+ If you compute percentiles globally:
29
+ - 100x tokens will always dominate "good" percentiles for scale metrics,
30
+ - and "quality" will collapse into "return magnitude".
31
+
32
+ So we compute distributions **within return regimes**.
33
+
34
+ ### 3.1 Bucket definition (example)
35
+
36
+ Let `R_max` be the token's lifetime max return multiple (e.g., ATH / launch).
37
+
38
+ Use coarse buckets for the bulk and finer buckets for the tail, e.g.:
39
+ - B0: `R_max < 3`
40
+ - B1: `3 <= R_max < 10`
41
+ - B2: `10 <= R_max < 20`
42
+ - B3: `20 <= R_max < 100`
43
+ - B4: `100 <= R_max < 10_000`
44
+
45
+ Notes:
46
+ - If a bucket has too few samples, merge with a neighbor.
47
+ - For the extreme tail you can also replace fixed buckets with **quantile buckets** on `log(R_max)` to keep sample counts stable.
48
+
49
+ Interpretation (important):
50
+ - `q` is **relative within the bucket**.
51
+ - The "best garbage" can have high `q` in B0.
52
+ - A 100x token can have low `q` in B4 if it looks worst vs other 100x+ tokens.
53
+
54
+ This is intentional: return and quality are different axes.
55
+
56
+ ## 4) Feature set and sign conventions
57
+
58
+ We want `q` to increase for "healthy/organic" structure and decrease for "controlled/manipulated" structure.
59
+
60
+ All features below are evaluated **within the token's return bucket**.
61
+
62
+ ### 4.1 Scale / activity (high is usually better within-bucket)
63
+
64
+ Use log transforms for stability before percentiles:
65
+ - `log1p(total_volume_usd)`
66
+ - `log1p(total_fees_sol)`
67
+ - `log1p(unique_holders)`
68
+ - `log1p(time_to_ath_sec)` (optional; see note below)
69
+
70
+ Ratio features (less pure scale):
71
+ - `fees_per_volume = total_fees_sol / (total_volume_usd + eps)`
72
+ - `fees_per_trade = total_fees_sol / (n_trades + eps)` (if `n_trades` exists)
73
+ - `holders_per_trade = unique_holders / (n_trades + eps)` (if `n_trades` exists)
74
+ - `holders_per_volume = unique_holders / (total_volume_usd + eps)`
75
+
76
+ Rationale:
77
+ - Fees and fee-per-* help separate "real urgency / competition" from "cheap wash".
78
+ - Holders and holders-per-* help separate broad participation from concentrated looping.
79
+
80
+ ### 4.2 Manipulation / control (high is worse; flip sign)
81
+
82
+ These are typically "the higher, the less healthy":
83
+ - `snipers_pct_supply_top70`
84
+ - `bundled_pct_supply`
85
+ - `dev_hold_pct_supply`
86
+ - `insiders_pct_supply`
87
+
88
+ We treat exceptions as rare; the model can learn edge cases from context, but the label should reflect the dominant interpretation.
89
+
90
+ ### 4.3 Time-to-ATH note
91
+
92
+ `time_to_ath_sec` can behave differently across return buckets.
93
+ - In high-return buckets, very short times can look like a single spike / control.
94
+ - In low-return buckets, many tokens have near-zero times because they never move.
95
+
96
+ Include it only if it improves downstream behavior; keep it **bucket-relative** either way.
97
+
98
+ ## 5) Turning raw metrics into a signed scalar
99
+
100
+ We want a single `q` in `[-1, +1]` with direction:
101
+ - `+1` = looks healthiest vs peers in the same return bucket
102
+ - `-1` = looks most unhealthy vs peers in the same return bucket
103
+
104
+ ### 5.1 Within-bucket percentile (ECDF)
105
+
106
+ For each feature value `x_i`:
107
+ - compute percentile `p_i = ECDF_b(x_i)` using only tokens in bucket `b`
108
+ - `p_i` is in `[0, 1]`
109
+
110
+ Implementation detail:
111
+ - Use a rank-based ECDF with a small offset to avoid exact 0/1 if desired:
112
+ - `p_i = (rank(x_i) - 0.5) / n`
113
+
114
+ ### 5.2 Signed percentile
115
+
116
+ Convert to signed value:
117
+ - `s_i = 2 * p_i - 1` (now `s_i` is in `[-1, +1]`)
118
+
119
+ If "high is bad" for that feature, flip it:
120
+ - `s_i := -s_i`
121
+
122
+ This gives direction + magnitude in a single number.
123
+
124
+ ### 5.3 Aggregate without hand weights
125
+
126
+ To avoid hand-tuned weights, use a symmetric aggregator:
127
+ - `q_raw = mean_i(s_i)`
128
+
129
+ Optional robustness:
130
+ - clip each `s_i` to `[-0.99, 0.99]` before averaging (limits extreme leverage)
131
+ - use a trimmed mean (drop top/bottom k% of `s_i`) if a single metric can be noisy
132
+
133
+ ### 5.4 Optional: re-rank the aggregate (final calibration)
134
+
135
+ If you want the final `q` to be strictly comparable across time / retrains and more uniform within bucket:
136
+ - `q = 2 * ECDF_b(q_raw) - 1`
137
+
138
+ This keeps the "relative within bucket" meaning while stabilizing scale.
139
+
140
+ ## 6) Training vs inference (how it is used)
141
+
142
+ Offline labeling (training target):
143
+ 1) Compute `R_max` from full lifetime.
144
+ 2) Assign return bucket `b`.
145
+ 3) Compute all chosen metrics from full lifetime.
146
+ 4) Convert metrics -> signed percentiles -> `q`.
147
+
148
+ Inference (model output):
149
+ - The model only sees partial history and must predict the *final* `q` (computed above).
150
+ - The trading policy uses predicted return signals + predicted `q` to decide position sizing / risk.
151
+
152
+ ## 7) Practical notes
153
+
154
+ - Use `eps` (e.g., `1e-9`) in denominators to avoid divide-by-zero.
155
+ - If a metric is missing for a token, drop it from the mean for that token (or impute with bucket median).
156
+ - When bucket sample counts drift, prefer merging buckets rather than letting ECDF be noisy.
157
+ - Recompute distributions on the same "source-of-truth" dataset used for training (not ad-hoc caches).
158
+
159
+ ## 8) Summary
160
+
161
+ `q` is a **return-regime-relative**, **distribution-normalized**, **signed** health score:
162
+ - It is not a threshold classifier.
163
+ - It avoids raw-scale dependence and hand weighting.
164
+ - It cleanly separates "made money" (return) from "looks healthy" (quality).
data/quality_scores.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:484a4722dcc9d5bf4928e0926be256df7abecfe74cfbf7f75b04aeab91c2ca23
3
+ size 11849315
log.log CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:47df087d32a6eacfb7758e1bcb7f7f82eb11905edbe984e3d1d5fe1fd905a155
3
- size 233598
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7e8559fa0dfc6a9356d4078d582a479a5a3cbf8a3348183b3baf336ef73db25
3
+ size 2302
scripts/compute_quality_score.py ADDED
@@ -0,0 +1,545 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import json
4
+ import math
5
+ import argparse
6
+ from typing import Dict, List, Tuple
7
+
8
+ from clickhouse_driver import Client as ClickHouseClient
9
+
10
+ # Add parent to path
11
+ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
12
+
13
+ from models.vocabulary import RETURN_THRESHOLDS
14
+
15
+ CLICKHOUSE_HOST = os.getenv("CLICKHOUSE_HOST", "localhost")
16
+ CLICKHOUSE_PORT = int(os.getenv("CLICKHOUSE_PORT", 9000))
17
+ CLICKHOUSE_USER = os.getenv("CLICKHOUSE_USER", "default")
18
+ CLICKHOUSE_PASSWORD = os.getenv("CLICKHOUSE_PASSWORD", "")
19
+ CLICKHOUSE_DATABASE = os.getenv("CLICKHOUSE_DATABASE", "default")
20
+
21
+ LAUNCH_PRICE_USD = 0.000004
22
+ EPS = 1e-9
23
+
24
+
25
+ def get_client():
26
+ return ClickHouseClient(
27
+ host=CLICKHOUSE_HOST,
28
+ port=CLICKHOUSE_PORT,
29
+ user=CLICKHOUSE_USER,
30
+ password=CLICKHOUSE_PASSWORD,
31
+ database=CLICKHOUSE_DATABASE,
32
+ )
33
+
34
+
35
+ def _midrank_percentiles(items: List[Tuple[str, float]]) -> Dict[str, float]:
36
+ """
37
+ Compute midrank percentiles for a list of (token, value).
38
+ Returns p in (0,1) via (rank - 0.5) / n. Ties get the same midrank.
39
+ """
40
+ if not items:
41
+ return {}
42
+ items_sorted = sorted(items, key=lambda x: x[1])
43
+ n = len(items_sorted)
44
+ out = {}
45
+ i = 0
46
+ while i < n:
47
+ j = i
48
+ v = items_sorted[i][1]
49
+ while j + 1 < n and items_sorted[j + 1][1] == v:
50
+ j += 1
51
+ # midrank is average of ranks i..j (1-based)
52
+ rank_lo = i + 1
53
+ rank_hi = j + 1
54
+ midrank = 0.5 * (rank_lo + rank_hi)
55
+ p = (midrank - 0.5) / n
56
+ for k in range(i, j + 1):
57
+ out[items_sorted[k][0]] = p
58
+ i = j + 1
59
+ return out
60
+
61
+
62
+ def _bucket_id(ret_val: float) -> int:
63
+ for i in range(len(RETURN_THRESHOLDS) - 1):
64
+ lower = RETURN_THRESHOLDS[i]
65
+ upper = RETURN_THRESHOLDS[i + 1]
66
+ if ret_val >= lower and ret_val < upper:
67
+ return i
68
+ return -1
69
+
70
+
71
+ def fetch_token_metrics(client) -> List[dict]:
72
+ """
73
+ Fetches lifetime metrics needed for quality scoring.
74
+ Returns a list of dicts keyed by token_address.
75
+ """
76
+ query = f"""
77
+ WITH
78
+ trade_agg AS (
79
+ SELECT
80
+ base_address,
81
+ sum(priority_fee + coin_creator_fee) AS fees_sol,
82
+ sum(total_usd) AS volume_usd,
83
+ count() AS n_trades,
84
+ min(timestamp) AS t0,
85
+ argMax(timestamp, price_usd) AS t_ath
86
+ FROM trades
87
+ GROUP BY base_address
88
+ ),
89
+ ret_agg AS (
90
+ SELECT
91
+ token_address,
92
+ (argMax(ath_price_usd, updated_at) / {LAUNCH_PRICE_USD}) AS ret,
93
+ argMax(unique_holders, updated_at) AS unique_holders
94
+ FROM token_metrics
95
+ GROUP BY token_address
96
+ ),
97
+ snipers AS (
98
+ SELECT
99
+ m.base_address AS token_address,
100
+ (m.val / t.total_supply * 100) AS snipers_pct
101
+ FROM (
102
+ SELECT
103
+ base_address,
104
+ sumIf(base_amount, buyer_rank <= 70) AS val
105
+ FROM (
106
+ SELECT
107
+ base_address,
108
+ base_amount,
109
+ dense_rank() OVER (PARTITION BY base_address ORDER BY min_slot, min_idx) AS buyer_rank
110
+ FROM (
111
+ SELECT
112
+ base_address,
113
+ maker,
114
+ min(slot) AS min_slot,
115
+ min(transaction_index) AS min_idx,
116
+ sum(base_amount) AS base_amount
117
+ FROM trades
118
+ WHERE trade_type = 0
119
+ GROUP BY base_address, maker
120
+ )
121
+ )
122
+ GROUP BY base_address
123
+ ) m
124
+ JOIN (
125
+ SELECT token_address, argMax(total_supply, updated_at) AS total_supply
126
+ FROM tokens
127
+ GROUP BY token_address
128
+ ) t ON m.base_address = t.token_address
129
+ WHERE t.total_supply > 0
130
+ ),
131
+ bundled AS (
132
+ SELECT
133
+ m.base_address AS token_address,
134
+ (m.val / t.total_supply * 100) AS bundled_pct
135
+ FROM (
136
+ SELECT
137
+ t.base_address,
138
+ sum(t.base_amount) AS val
139
+ FROM trades t
140
+ JOIN (
141
+ SELECT base_address, min(slot) AS min_slot
142
+ FROM trades
143
+ GROUP BY base_address
144
+ ) m ON t.base_address = m.base_address AND t.slot = m.min_slot
145
+ WHERE t.trade_type = 0
146
+ GROUP BY t.base_address
147
+ ) m
148
+ JOIN (
149
+ SELECT token_address, argMax(total_supply, updated_at) AS total_supply
150
+ FROM tokens
151
+ GROUP BY token_address
152
+ ) t ON m.base_address = t.token_address
153
+ WHERE t.total_supply > 0
154
+ ),
155
+ dev_hold AS (
156
+ SELECT
157
+ t.token_address AS token_address,
158
+ (wh.current_balance / (t.total_supply / pow(10, t.decimals)) * 100) AS dev_hold_pct
159
+ FROM (
160
+ SELECT
161
+ token_address,
162
+ argMax(creator_address, updated_at) AS creator_address,
163
+ argMax(total_supply, updated_at) AS total_supply,
164
+ argMax(decimals, updated_at) AS decimals
165
+ FROM tokens
166
+ GROUP BY token_address
167
+ ) t
168
+ JOIN (
169
+ SELECT mint_address, wallet_address, argMax(current_balance, updated_at) AS current_balance
170
+ FROM wallet_holdings
171
+ GROUP BY mint_address, wallet_address
172
+ ) wh ON t.token_address = wh.mint_address AND t.creator_address = wh.wallet_address
173
+ WHERE t.total_supply > 0
174
+ ),
175
+ insiders AS (
176
+ SELECT
177
+ wh.mint_address AS token_address,
178
+ (sum(wh.current_balance) / (t.total_supply / pow(10, t.decimals)) * 100) AS insiders_pct
179
+ FROM (
180
+ SELECT mint_address, wallet_address, argMax(current_balance, updated_at) AS current_balance
181
+ FROM wallet_holdings
182
+ GROUP BY mint_address, wallet_address
183
+ ) wh
184
+ JOIN (
185
+ SELECT
186
+ wallet_address,
187
+ argMax(total_buys_count, updated_at) AS buys,
188
+ argMax(transfers_in_count, updated_at) AS transfers,
189
+ argMax(spl_transfers_in_count, updated_at) AS spl_transfers
190
+ FROM wallet_profile_metrics
191
+ GROUP BY wallet_address
192
+ ) wpm ON wh.wallet_address = wpm.wallet_address
193
+ JOIN (
194
+ SELECT token_address, argMax(total_supply, updated_at) AS total_supply, argMax(decimals, updated_at) AS decimals
195
+ FROM tokens
196
+ GROUP BY token_address
197
+ ) t ON wh.mint_address = t.token_address
198
+ WHERE wpm.buys = 0 AND (wpm.transfers > 0 OR wpm.spl_transfers > 0) AND t.total_supply > 0
199
+ GROUP BY wh.mint_address, t.total_supply, t.decimals
200
+ )
201
+ SELECT
202
+ r.token_address,
203
+ r.ret,
204
+ r.unique_holders,
205
+ f.fees_sol,
206
+ f.volume_usd,
207
+ f.n_trades,
208
+ (f.t_ath - f.t0) AS time_to_ath_sec,
209
+ s.snipers_pct,
210
+ b.bundled_pct,
211
+ d.dev_hold_pct,
212
+ i.insiders_pct
213
+ FROM ret_agg r
214
+ LEFT JOIN trade_agg f ON r.token_address = f.base_address
215
+ LEFT JOIN snipers s ON r.token_address = s.token_address
216
+ LEFT JOIN bundled b ON r.token_address = b.token_address
217
+ LEFT JOIN dev_hold d ON r.token_address = d.token_address
218
+ LEFT JOIN insiders i ON r.token_address = i.token_address
219
+ """
220
+ rows = client.execute(query)
221
+ cols = [
222
+ "token_address",
223
+ "ret",
224
+ "unique_holders",
225
+ "fees_sol",
226
+ "volume_usd",
227
+ "n_trades",
228
+ "time_to_ath_sec",
229
+ "snipers_pct",
230
+ "bundled_pct",
231
+ "dev_hold_pct",
232
+ "insiders_pct",
233
+ ]
234
+ out = []
235
+ for r in rows:
236
+ out.append(dict(zip(cols, r)))
237
+ return out
238
+
239
+
240
+ def _compute_quality_scores(
241
+ client,
242
+ max_ret: float = 10000.0,
243
+ rerank: bool = True,
244
+ with_debug: bool = False,
245
+ ):
246
+ data = fetch_token_metrics(client)
247
+
248
+ # feature spec: (name, getter, positive_when_high)
249
+ feature_defs = [
250
+ ("fees_log", lambda d: math.log1p(d["fees_sol"]) if d["fees_sol"] is not None else None, True),
251
+ ("volume_log", lambda d: math.log1p(d["volume_usd"]) if d["volume_usd"] is not None else None, True),
252
+ ("holders_log", lambda d: math.log1p(d["unique_holders"]) if d["unique_holders"] is not None else None, True),
253
+ ("time_to_ath_log", lambda d: math.log1p(d["time_to_ath_sec"]) if d["time_to_ath_sec"] is not None else None, True),
254
+ ("fees_per_volume", lambda d: (d["fees_sol"] / (d["volume_usd"] + EPS)) if d["fees_sol"] is not None and d["volume_usd"] is not None else None, True),
255
+ ("fees_per_trade", lambda d: (d["fees_sol"] / (d["n_trades"] + EPS)) if d["fees_sol"] is not None and d["n_trades"] is not None else None, True),
256
+ ("holders_per_trade", lambda d: (d["unique_holders"] / (d["n_trades"] + EPS)) if d["unique_holders"] is not None and d["n_trades"] is not None else None, True),
257
+ ("holders_per_volume", lambda d: (d["unique_holders"] / (d["volume_usd"] + EPS)) if d["unique_holders"] is not None and d["volume_usd"] is not None else None, True),
258
+ ("snipers_pct", lambda d: d["snipers_pct"], False),
259
+ ("bundled_pct", lambda d: d["bundled_pct"], False),
260
+ ("dev_hold_pct", lambda d: d["dev_hold_pct"], False),
261
+ ("insiders_pct", lambda d: d["insiders_pct"], False),
262
+ ]
263
+
264
+ raw_metrics = ["snipers_pct", "bundled_pct", "dev_hold_pct", "insiders_pct"]
265
+
266
+ debug = None
267
+ if with_debug:
268
+ debug = {
269
+ "q_raw": [],
270
+ "feature_pairs": {f[0]: [] for f in feature_defs},
271
+ "raw_pairs": {m: [] for m in raw_metrics},
272
+ }
273
+
274
+ # Build bucket mapping
275
+ buckets: Dict[int, List[dict]] = {}
276
+ for d in data:
277
+ ret_val = d.get("ret")
278
+ if ret_val is None or ret_val <= 0 or ret_val > max_ret:
279
+ continue
280
+ b = _bucket_id(ret_val)
281
+ if b == -1:
282
+ continue
283
+ d["bucket_id"] = b
284
+ buckets.setdefault(b, []).append(d)
285
+
286
+ # Compute percentiles per bucket + feature
287
+ token_scores = []
288
+ for b, items in buckets.items():
289
+ # Precompute percentiles per feature
290
+ feature_percentiles: Dict[str, Dict[str, float]] = {}
291
+ for fname, fget, _pos in feature_defs:
292
+ vals = []
293
+ for d in items:
294
+ v = fget(d)
295
+ if v is None or (isinstance(v, float) and (math.isnan(v) or math.isinf(v))):
296
+ continue
297
+ vals.append((d["token_address"], v))
298
+ feature_percentiles[fname] = _midrank_percentiles(vals)
299
+
300
+ # Compute q_raw for each token
301
+ q_raw_map = {}
302
+ for d in items:
303
+ s_vals = []
304
+ s_map = {}
305
+ for fname, _fget, pos in feature_defs:
306
+ p = feature_percentiles[fname].get(d["token_address"])
307
+ if p is None:
308
+ continue
309
+ s = 2.0 * p - 1.0
310
+ if not pos:
311
+ s = -s
312
+ # clip
313
+ if s > 0.99:
314
+ s = 0.99
315
+ elif s < -0.99:
316
+ s = -0.99
317
+ s_vals.append(s)
318
+ s_map[fname] = s
319
+ if not s_vals:
320
+ continue
321
+ q_raw = sum(s_vals) / len(s_vals)
322
+ q_raw_map[d["token_address"]] = q_raw
323
+ if with_debug:
324
+ debug["q_raw"].append(q_raw)
325
+ for fname, s in s_map.items():
326
+ debug["feature_pairs"][fname].append((q_raw, s))
327
+ for metric in raw_metrics:
328
+ raw_val = d.get(metric)
329
+ if raw_val is None:
330
+ continue
331
+ debug["raw_pairs"][metric].append((q_raw, raw_val))
332
+
333
+ # Optional re-rank within bucket
334
+ if rerank:
335
+ q_items = [(t, q) for t, q in q_raw_map.items()]
336
+ q_p = _midrank_percentiles(q_items)
337
+ for d in items:
338
+ t = d["token_address"]
339
+ if t not in q_raw_map:
340
+ continue
341
+ q_final = 2.0 * q_p[t] - 1.0
342
+ token_scores.append(
343
+ {
344
+ "token_address": t,
345
+ "bucket_id": b,
346
+ "ret": d["ret"],
347
+ "q_raw": q_raw_map[t],
348
+ "q": q_final,
349
+ }
350
+ )
351
+ else:
352
+ for d in items:
353
+ t = d["token_address"]
354
+ if t not in q_raw_map:
355
+ continue
356
+ token_scores.append(
357
+ {
358
+ "token_address": t,
359
+ "bucket_id": b,
360
+ "ret": d["ret"],
361
+ "q_raw": q_raw_map[t],
362
+ "q": q_raw_map[t],
363
+ }
364
+ )
365
+
366
+ if with_debug:
367
+ return token_scores, debug
368
+ return token_scores
369
+
370
+
371
+ def compute_quality_scores(
372
+ client,
373
+ max_ret: float = 10000.0,
374
+ rerank: bool = True,
375
+ ) -> List[dict]:
376
+ return _compute_quality_scores(client, max_ret=max_ret, rerank=rerank, with_debug=False)
377
+
378
+
379
+ def write_jsonl(path: str, rows: List[dict]) -> None:
380
+ os.makedirs(os.path.dirname(path), exist_ok=True)
381
+ with open(path, "w", encoding="utf-8") as f:
382
+ for r in rows:
383
+ f.write(json.dumps(r) + "\n")
384
+
385
+
386
+ def _percentile(sorted_vals: List[float], p: float) -> float:
387
+ if not sorted_vals:
388
+ return float("nan")
389
+ n = len(sorted_vals)
390
+ if n == 1:
391
+ return sorted_vals[0]
392
+ pos = p * (n - 1)
393
+ lo = int(math.floor(pos))
394
+ hi = int(math.ceil(pos))
395
+ if lo == hi:
396
+ return sorted_vals[lo]
397
+ frac = pos - lo
398
+ return sorted_vals[lo] * (1 - frac) + sorted_vals[hi] * frac
399
+
400
+
401
+ def _summary_stats(vals: List[float]) -> Dict[str, float]:
402
+ if not vals:
403
+ return {}
404
+ vals_sorted = sorted(vals)
405
+ return {
406
+ "mean": sum(vals_sorted) / len(vals_sorted),
407
+ "min": vals_sorted[0],
408
+ "max": vals_sorted[-1],
409
+ "p10": _percentile(vals_sorted, 0.10),
410
+ "p50": _percentile(vals_sorted, 0.50),
411
+ "p90": _percentile(vals_sorted, 0.90),
412
+ "p99": _percentile(vals_sorted, 0.99),
413
+ }
414
+
415
+
416
+ def _pearson_corr(xs: List[float], ys: List[float]) -> float:
417
+ if not xs or not ys or len(xs) != len(ys) or len(xs) < 2:
418
+ return float("nan")
419
+ n = len(xs)
420
+ mean_x = sum(xs) / n
421
+ mean_y = sum(ys) / n
422
+ num = 0.0
423
+ den_x = 0.0
424
+ den_y = 0.0
425
+ for i in range(n):
426
+ dx = xs[i] - mean_x
427
+ dy = ys[i] - mean_y
428
+ num += dx * dy
429
+ den_x += dx * dx
430
+ den_y += dy * dy
431
+ denom = math.sqrt(den_x * den_y)
432
+ if denom == 0.0:
433
+ return float("nan")
434
+ return num / denom
435
+
436
+
437
+ def _bucket_label(b: int) -> str:
438
+ lower = RETURN_THRESHOLDS[b]
439
+ upper = RETURN_THRESHOLDS[b + 1] if b + 1 < len(RETURN_THRESHOLDS) else None
440
+ if upper is None:
441
+ return f">= {lower}x"
442
+ return f"{lower}x - {upper}x"
443
+
444
+
445
+ def print_summary(scores: List[dict]) -> None:
446
+ print("=== QUALITY SCORE SUMMARY ===")
447
+ print(f"Total tokens scored: {len(scores)}")
448
+ if not scores:
449
+ return
450
+
451
+ overall_q = [s["q"] for s in scores if "q" in s]
452
+ overall_q_raw = [s["q_raw"] for s in scores if "q_raw" in s]
453
+ for name, series in [("q", overall_q), ("q_raw", overall_q_raw)]:
454
+ stats = _summary_stats(series)
455
+ if not stats:
456
+ continue
457
+ print(f"\nOverall {name}:")
458
+ print(f" Mean: {stats['mean']:.4f} | Min: {stats['min']:.4f} | Max: {stats['max']:.4f}")
459
+ print(f" Q: p10={stats['p10']:.2f} p50={stats['p50']:.2f} p90={stats['p90']:.2f} p99={stats['p99']:.2f}")
460
+
461
+ # Per-bucket summaries
462
+ buckets: Dict[int, List[dict]] = {}
463
+ for s in scores:
464
+ buckets.setdefault(s["bucket_id"], []).append(s)
465
+
466
+ for b in sorted(buckets.keys()):
467
+ items = buckets[b]
468
+ q_vals = [i["q"] for i in items if "q" in i]
469
+ q_raw_vals = [i["q_raw"] for i in items if "q_raw" in i]
470
+ print(f"\nSEGMENT: {b}. {_bucket_label(b)}")
471
+ print(f"Tokens in segment: {len(items)}")
472
+ stats_q = _summary_stats(q_vals)
473
+ stats_q_raw = _summary_stats(q_raw_vals)
474
+ if stats_q:
475
+ print(" q:")
476
+ print(f" Mean: {stats_q['mean']:.4f} | Min: {stats_q['min']:.4f} | Max: {stats_q['max']:.4f}")
477
+ print(f" Q: p10={stats_q['p10']:.2f} p50={stats_q['p50']:.2f} p90={stats_q['p90']:.2f} p99={stats_q['p99']:.2f}")
478
+ if stats_q_raw:
479
+ print(" q_raw:")
480
+ print(f" Mean: {stats_q_raw['mean']:.4f} | Min: {stats_q_raw['min']:.4f} | Max: {stats_q_raw['max']:.4f}")
481
+ print(f" Q: p10={stats_q_raw['p10']:.2f} p50={stats_q_raw['p50']:.2f} p90={stats_q_raw['p90']:.2f} p99={stats_q_raw['p99']:.2f}")
482
+
483
+
484
+ def print_diagnostics(debug: dict) -> None:
485
+ if not debug:
486
+ return
487
+ q_raw_vals = debug.get("q_raw", [])
488
+ if not q_raw_vals:
489
+ return
490
+ print("\n=== QUALITY SCORE DIAGNOSTICS ===")
491
+
492
+ feature_pairs = debug.get("feature_pairs", {})
493
+ if feature_pairs:
494
+ print("Correlation with q_raw (signed features):")
495
+ for fname in sorted(feature_pairs.keys()):
496
+ pairs = feature_pairs[fname]
497
+ xs = [p[0] for p in pairs]
498
+ ys = [p[1] for p in pairs]
499
+ corr = _pearson_corr(xs, ys)
500
+ print(f" {fname}: {corr:.4f} (n={len(pairs)})")
501
+
502
+ raw_pairs = debug.get("raw_pairs", {})
503
+ if raw_pairs:
504
+ q_sorted = sorted(q_raw_vals)
505
+ p10 = _percentile(q_sorted, 0.10)
506
+ p90 = _percentile(q_sorted, 0.90)
507
+ print("\nTop/bottom decile raw means (by q_raw):")
508
+ for metric in sorted(raw_pairs.keys()):
509
+ pairs = raw_pairs[metric]
510
+ lows = [v for q, v in pairs if q <= p10]
511
+ highs = [v for q, v in pairs if q >= p90]
512
+ if not lows or not highs:
513
+ continue
514
+ low_mean = sum(lows) / len(lows)
515
+ high_mean = sum(highs) / len(highs)
516
+ print(f" {metric}: bottom_mean={low_mean:.4f} top_mean={high_mean:.4f} (n_low={len(lows)}, n_high={len(highs)})")
517
+
518
+
519
+ def main():
520
+ parser = argparse.ArgumentParser(description="Compute token quality/health score.")
521
+ parser.add_argument("--max-ret", type=float, default=10000.0, help="Max return to include")
522
+ parser.add_argument("--no-rerank", action="store_true", help="Disable final rerank within bucket")
523
+ parser.add_argument("--no-summary", action="store_true", help="Disable summary logging")
524
+ parser.add_argument("--no-diagnostics", action="store_true", help="Disable diagnostics logging")
525
+ args = parser.parse_args()
526
+
527
+ client = get_client()
528
+ if args.no_diagnostics:
529
+ scores = compute_quality_scores(client, max_ret=args.max_ret, rerank=not args.no_rerank)
530
+ debug = None
531
+ else:
532
+ scores, debug = _compute_quality_scores(
533
+ client,
534
+ max_ret=args.max_ret,
535
+ rerank=not args.no_rerank,
536
+ with_debug=True,
537
+ )
538
+ if not args.no_summary:
539
+ print_summary(scores)
540
+ if not args.no_diagnostics:
541
+ print_diagnostics(debug)
542
+
543
+
544
+ if __name__ == "__main__":
545
+ main()