Update README.md

ffcf731 verified 25 days ago

6.02 kB

	---
	license: mit
	---
	# Refined experiment set

	## Experiment 1:
	Tokenization with the H2 battery.

	Question: Can the H2 battery recall tokenized information, and which format is the most useful format for recall?


	# Stage 1 Battery Array
	The upcoming experiment will be utilizing ngram processing from wordnet previously extracted from GPT 5 mini on GPT 5's induction series.

	https://huggingface.co/datasets/AbstractPhil/wordnet-definitions

	Using the precalculated lexical topology that I'll likely need to rerun anyway in order to extend to ngram8 before we start.

	https://huggingface.co/datasets/AbstractPhil/wordnet-lexical-topology

	The battery array will be;

	1. 1-8gram characters ordinal uncased
	* Direct character sequence recon per ngram, each has it's own vocabulary that will need direct bitwise optimization.
	2. 1-8gram words frequency tuned uncased
	* Direct word converted to bitwise optimized character translation; example: "taco" = 3714 = utf8 unicode char from index
	* Train ordinal using relational vocabulary
	3. 1-8gram definitions ordinal uncased


	# Stage 2 Battery Array Constellation
	Pending success of stage 1 pretrain being usefully differentiated to task;

	1. 3 frequency selectors
	* Each tuned via MSE gating to select the most likely candidates for recon accuracy to the task
	* char, word, definition

	This will use the geolip transformer structure to capture multispectral anchoring first,
	if that doesn't yield I'll switch to full transformer system and concatenation to see if the structure holds.

	If that doesn't work I'll default to chalkboard mode, and construct a better concatenation array system.

	# The Vocabulary

	This is a unique format of bitwise relational complexity for the preliminary tests;

	For characters; encode a single character to a three channel trigram.

	```python
	@staticmethod
	def bytes_to_image(byte_chunk: np.ndarray, img_size: int,
	patch_size: int = 4,
	channels: int = 3) -> np.ndarray:
	"""``[bytes_per_image]`` uint8 → ``[channels, img_size, img_size]`` float32 in [-1, 1].

	Layout: byte stream packs into cells in row-major-across-patches,
	row-major-within-patch order. Cell ``i`` holds bytes
	``byte_chunk[Ci : Ci + C]`` as the C-tuple of channel values,
	where C = ``channels``.
	"""
	gh = gw = img_size // patch_size
	cells_per_patch = patch_size * patch_size
	n_patches = gh * gw
	# Reshape: [n_patches, cells_per_patch, channels]
	rgb = byte_chunk.reshape(n_patches, cells_per_patch, channels).astype(np.float32)
	rgb = (rgb - 127.5) / 127.5 # → [-1, 1]
	per_patch = rgb.reshape(n_patches, patch_size, patch_size, channels)
	grid = per_patch.reshape(gh, gw, patch_size, patch_size, channels)
	# Permute (gh, gw, ps_r, ps_c, channel) → (channel, gh, ps_r, gw, ps_c)
	img = grid.transpose(4, 0, 2, 1, 3)
	img = img.reshape(channels, img_size, img_size)
	return img
	```

	```python
	@staticmethod
	def image_to_bytes(images: torch.Tensor, patch_size: int = 4,
	channels: int = 3) -> torch.Tensor:
	"""``[B, C, H, W]`` float → ``[B, n_cells_total, C]`` in {0..255}.

	Inverse of ``bytes_to_image`` for the same ``patch_size`` and
	``channels``. Maps continuous [-1, 1] back to rounded uint8 byte
	values. C = ``channels``.
	"""
	B, C, H, W = images.shape
	assert C == channels and H == W and H % patch_size == 0, (
	f"Need square C={channels}-ch image div by ps; got "
	f"{tuple(images.shape)}, ps={patch_size}, channels={channels}"
	)
	gh = gw = H // patch_size
	ps = patch_size
	# (B, C, gh, ps, gw, ps) → (B, gh, gw, ps_r, ps_c, channel)
	x = images.reshape(B, channels, gh, ps, gw, ps)
	x = x.permute(0, 2, 4, 3, 5, 1)
	x = x.reshape(B, gh * gw * ps * ps, channels)
	# Recover bytes: float in [-1, 1] → byte in [0, 255]
	bytes_f = x * 127.5 + 127.5
	return bytes_f.clamp(0, 255).round().to(torch.uint8)

	```

	So for the prelim experiments we will be translating directly to 3 channel text as we've been doing.

	Currently, the primary finetune is using wikipedia-103 as a preliminary system.

	We will need to train the entirety of wordnet to reconstruct in sequence, which will provide encoder frequency association in the preliminary stages.

	After this, we'll want our second. This will have an entirely different vocabulary translation matrix meant to target 2gram.

	Our 1gram and 2gram probes will be our first catalyst pair to test downstream vocabulary differentiation.

	My hunch is sentencepiece or something along those lines is more stable, but it won't matter in this model.

	These are asking and answering a very different set of questions and producing a very different set of problems.


	# 2gram Recon

	The system requires combination tokens, which means I will be using the lexical analyzed wordnet frequency dataset to determine my preliminary pairs.

	2gram frequency will determine the order of the values, so nothing too-too specific for the prelim. With this I'll organize a translation matrix.

	Binary matching utf-8 "c" is would be an entirely different value, while both translate to an identical UTF-8 relational value.

	["c", "a"] becomes ["z "] or something in direct translation along those lines with the 2gram, then we just translate with our downstream consumer.

	Extracting the behavior is related to behavior, not the actual vocabulary. Everything downstream is a consumer of the behavior,
	not the curator of the behavior.

	## Likely permanent solution
	Token testing and accumulative research related to tokenization processing via sentencepiece and the like is required.

	Something along the lines of the llama tokenizer, qwen tokenizer, or something relationally universal could work, but I'll need to determine this through testing.