oracle / README.md

Upload folder using huggingface_hub

7d63a09 2 months ago

4.67 kB

	# Apollo: Oracle Model

	## Project Status
	Phase: Hyperparameter Optimization & Dataset Preparation.

	### Recent Updates (Jan 2026)
	* Hyperparameter Tuning: Analyzed token trade distribution to determine optimal model parameters.
	* Max Sequence Length: Set to 8192. This covers >2 hours of high-frequency trading activity for high-volume tokens (verified against `HWVY...`) and the full lifecycle for 99% of tokens.
	* Prediction Horizons: Set to 60s, 3m, 5m, 10m, 30m, 1h, 2h.
	* Min Horizon (60s): Chosen to accommodate ~20s inference latency while capturing the "meat" of aggressive breakout movers.
	* Max Horizon (2h): Covers the timeframe where 99% of tokens hit their All-Time High.
	* Infrastructure:
	* Updated `train.sh` to use these new hyperparameters.
	* Updated `scripts/cache_dataset.py` to ensure cached datasets are labeled with these horizons.
	* Verified `DataFetcher` retrieves full trade histories (no hidden limits).

	## Configuration Summary

	\| Parameter \| Value \| Rationale \|
	\| :--- \| :--- \| :--- \|
	\| Max Seq Len \| `8192` \| Captures >2h of intense pump activity or full rug lifecycle. \|
	\| Horizons \| `60, 180, 300, 600, 1800, 3600, 7200` \| From "Scalp/Breakout" (1m) to "Runner/ATH" (2h). \|
	\| Inference Latency \| ~20s \| Dictates the 60s minimum horizon. \|

	## Usage

	### 1. Cache Dataset
	Pre-process data into `.pt` files with correct labels.
	```bash
	./pre_cache.sh
	```

	### 2. Train Model
	Launch training with updated hyperparameters.
	```bash
	./train.sh
	```

	## TODO: Future Enhancements

	### Multi-Task Quality Prediction Head
	Add a secondary head (Head B) that predicts token quality percentiles alongside price returns:
	- Fees Percentile — Predicted future fees relative to class median
	- Volume Percentile — Predicted future volume relative to class median
	- Holders Percentile — Predicted future holder count relative to class median

	Rationale: The `analyze_distribution.py` script currently uses hard thresholds on future metrics to classify tokens as "Manipulated". This head would let the model learn to predict those quality metrics from current features, enabling scam detection at inference time without access to future data.

	Approach Options:
	1. Single composite quality score (simpler)
	2. Three separate percentile predictions (more interpretable)
	3. Three binary classifications (fees_ok, volume_ok, holders_ok)

	Data Sampling (Context Optimization)
	Replace hardcoded H/B/H limits with a dynamic sampling strategy that maximizes the model's context window usage.

	The Problem
	Currently, the system triggers H/B/H logic based on a fixed 30k trade count and uses hardcoded limits (10k early, 15k recent). This mismatch with the model's max_seq_len (e.g., 8192) leads to inefficient data usage—either truncating valuable data arbitrarily or feeding too little when more could fit.

	The Solution: Dynamic Context Filling
	Implementation moves to
	data_loader.py
	(since cache contains full history).

	Algorithm
	Input: Full sorted list of events (Trades, Chart Segments, etc.) up to T_cutoff.
	Check: if
	len(events) <= max_seq_len
	, use ALL events.
	Split: If
	len(events) > max_seq_len
	:
	Reserve space for special tokens (start/end/pad).
	Calculate Budget: budget = max_seq_len - reserve (e.g., 8100).
	Dynamic Split:
	Head (Early): First budget / 2 events.
	Tail (Recent): Last budget / 2 events.
	Construct: [HEAD] ... [GAP_TOKEN] ... [TAIL].
	Implementation Changes
	[MODIFY]
	data_loader.py
	Remove Constants: Delete HBH_EARLY_EVENT_LIMIT, HBH_RECENT_EVENT_LIMIT.
	Update
	_generate_dataset_item
	:
	Accept max_seq_len.
	Implement the split logic defined above before returning event_sequence.




	Here explained easly:

	We check all the final events if exeed the total context we have.
	Then we filter out all the trade events and then check how many non aggregable events we have, for example a burn or a deployer trade etc...
	Then we take the remaining from context exldued thosoe IMPORTANT events like i show above and we check how many snapshot will fit chart segment, holders snapshot, chain stats etc...
	Then the remaining after snapshot and important non aggregable events we use them to make the H segments (high definition) and in the middle (Blurry) we keep just the snapshots.

	This works because 90% of context is taken just by trades and transfers so they are the only thing to compress to help context

	you dont need new tokens becuase there are already special tokens for it:
	'MIDDLE',
	'RECENT'

	so when you switch to blurry <MIDDLE> and when you go back to high definition you use <RECENT>