| --- |
| license: mit |
| --- |
| # Refined experiment set |
|
|
| ## Experiment 1: |
| Tokenization with the H2 battery. |
|
|
| Question: Can the H2 battery recall tokenized information, and which format is the most useful format for recall? |
|
|
|
|
| # Stage 1 Battery Array |
| The upcoming experiment will be utilizing ngram processing from wordnet previously extracted from GPT 5 mini on GPT 5's induction series. |
|
|
| https://huggingface.co/datasets/AbstractPhil/wordnet-definitions |
|
|
| Using the precalculated lexical topology that I'll likely need to rerun anyway in order to extend to ngram8 before we start. |
|
|
| https://huggingface.co/datasets/AbstractPhil/wordnet-lexical-topology |
|
|
| The battery array will be; |
|
|
| 1. 1-8gram characters ordinal uncased |
| * Direct character sequence recon per ngram, each has it's own vocabulary that will need direct bitwise optimization. |
| 2. 1-8gram words frequency tuned uncased |
| * Direct word converted to bitwise optimized character translation; example: "taco" = 3714 = utf8 unicode char from index |
| * Train ordinal using relational vocabulary |
| 3. 1-8gram definitions ordinal uncased |
|
|
|
|
| # Stage 2 Battery Array Constellation |
| Pending success of stage 1 pretrain being usefully differentiated to task; |
|
|
| 1. 3 frequency selectors |
| * Each tuned via MSE gating to select the most likely candidates for recon accuracy to the task |
| * char, word, definition |
|
|
| This will use the geolip transformer structure to capture multispectral anchoring first, |
| if that doesn't yield I'll switch to full transformer system and concatenation to see if the structure holds. |
|
|
| If that doesn't work I'll default to chalkboard mode, and construct a better concatenation array system. |
|
|
| # The Vocabulary |
|
|
| This is a unique format of bitwise relational complexity for the preliminary tests; |
|
|
| For characters; encode a single character to a three channel trigram. |
|
|
| ```python |
| @staticmethod |
| def bytes_to_image(byte_chunk: np.ndarray, img_size: int, |
| patch_size: int = 4, |
| channels: int = 3) -> np.ndarray: |
| """``[bytes_per_image]`` uint8 β ``[channels, img_size, img_size]`` float32 in [-1, 1]. |
| |
| Layout: byte stream packs into cells in row-major-across-patches, |
| row-major-within-patch order. Cell ``i`` holds bytes |
| ``byte_chunk[C*i : C*i + C]`` as the C-tuple of channel values, |
| where C = ``channels``. |
| """ |
| gh = gw = img_size // patch_size |
| cells_per_patch = patch_size * patch_size |
| n_patches = gh * gw |
| # Reshape: [n_patches, cells_per_patch, channels] |
| rgb = byte_chunk.reshape(n_patches, cells_per_patch, channels).astype(np.float32) |
| rgb = (rgb - 127.5) / 127.5 # β [-1, 1] |
| per_patch = rgb.reshape(n_patches, patch_size, patch_size, channels) |
| grid = per_patch.reshape(gh, gw, patch_size, patch_size, channels) |
| # Permute (gh, gw, ps_r, ps_c, channel) β (channel, gh, ps_r, gw, ps_c) |
| img = grid.transpose(4, 0, 2, 1, 3) |
| img = img.reshape(channels, img_size, img_size) |
| return img |
| ``` |
|
|
| ```python |
| @staticmethod |
| def image_to_bytes(images: torch.Tensor, patch_size: int = 4, |
| channels: int = 3) -> torch.Tensor: |
| """``[B, C, H, W]`` float β ``[B, n_cells_total, C]`` in {0..255}. |
| |
| Inverse of ``bytes_to_image`` for the same ``patch_size`` and |
| ``channels``. Maps continuous [-1, 1] back to rounded uint8 byte |
| values. C = ``channels``. |
| """ |
| B, C, H, W = images.shape |
| assert C == channels and H == W and H % patch_size == 0, ( |
| f"Need square C={channels}-ch image div by ps; got " |
| f"{tuple(images.shape)}, ps={patch_size}, channels={channels}" |
| ) |
| gh = gw = H // patch_size |
| ps = patch_size |
| # (B, C, gh, ps, gw, ps) β (B, gh, gw, ps_r, ps_c, channel) |
| x = images.reshape(B, channels, gh, ps, gw, ps) |
| x = x.permute(0, 2, 4, 3, 5, 1) |
| x = x.reshape(B, gh * gw * ps * ps, channels) |
| # Recover bytes: float in [-1, 1] β byte in [0, 255] |
| bytes_f = x * 127.5 + 127.5 |
| return bytes_f.clamp(0, 255).round().to(torch.uint8) |
| |
| ``` |
|
|
| So for the prelim experiments we will be translating directly to 3 channel text as we've been doing. |
|
|
| Currently, the primary finetune is using wikipedia-103 as a preliminary system. |
|
|
| We will need to train the entirety of wordnet to reconstruct in sequence, which will provide encoder frequency association in the preliminary stages. |
|
|
| After this, we'll want our second. This will have an entirely different vocabulary translation matrix meant to target 2gram. |
|
|
| Our 1gram and 2gram probes will be our first catalyst pair to test downstream vocabulary differentiation. |
|
|
| My hunch is sentencepiece or something along those lines is more stable, but it won't matter in this model. |
|
|
| These are asking and answering a very different set of questions and producing a very different set of problems. |
|
|
|
|
| # 2gram Recon |
|
|
| The system requires combination tokens, which means I will be using the lexical analyzed wordnet frequency dataset to determine my preliminary pairs. |
|
|
| 2gram frequency will determine the order of the values, so nothing too-too specific for the prelim. With this I'll organize a translation matrix. |
|
|
| Binary matching utf-8 "c" is would be an entirely different value, while both translate to an identical UTF-8 relational value. |
|
|
| ["c", "a"] becomes ["z "] or something in direct translation along those lines with the 2gram, then we just translate with our downstream consumer. |
|
|
| Extracting the behavior is related to behavior, not the actual vocabulary. Everything downstream is a consumer of the behavior, |
| not the curator of the behavior. |
|
|
| ## Likely permanent solution |
| Token testing and accumulative research related to tokenization processing via sentencepiece and the like is required. |
|
|
| Something along the lines of the llama tokenizer, qwen tokenizer, or something relationally universal could work, but I'll need to determine this through testing. |