File size: 6,226 Bytes
03be383 07342db 78d468d 03be383 f40c49b 0f9c5ea ec78735 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea b88face 0f9c5ea f657fa3 03be383 3aca92a 03be383 ad517f4 f657fa3 ab6c493 f657fa3 4e8b311 652c8db ad517f4 03be383 34ce035 f657fa3 3052169 09dc994 5c574de 03be383 5c574de 03be383 07342db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
license: apache-2.0
base_model:
- nomic-ai/nomic-bert-2048
tags:
- symbolic
- classification
- text_masking
- image_feature
- text_feature
- experimental
- categorical
- similarity
- teacher
---
A great deal of experimentation and testing has now been done on this model. It is more than capable of handling categorization, classification, text masking, similarity detection, similarity offset comparison, and many more tasks that I haven't listed.
It's small, so the cracks show. I have plans for a much more diverse categorical array with a much larger set of symbolic tokens for a version 2 of this model with a more expansive set of masking processes to train it; each more carefully tuned to not damage the alternative pretrained pathways.








# SEMANTIC TOKEN STATISTICS:
Average similarity between tokens: 0.232
Std dev of similarities: 0.043
Max similarity: 0.368
Min similarity: 0.082
Most similar token pairs:
<intent> ↔ <style>: 0.368
<hair_style> ↔ <hair_length>: 0.336
<grid> ↔ <fabric>: 0.316
<footwear> ↔ <jewelry>: 0.315
<grid> ↔ <offset>: 0.315
# SHUNT TOKEN STATISTICS:
Average distance between shunts: 1.146
Std dev of distances: 0.040
Min distance: 1.056
Max distance: 1.262
# CATEGORY ANALYSIS:
Subject/Object:
Tokens: 5
Avg within-category similarity: 0.234
Appearance:
Tokens: 4
Avg within-category similarity: 0.260
Clothing:
Tokens: 5
Avg within-category similarity: 0.280
Material/Texture:
Tokens: 5
Avg within-category similarity: 0.251
Spatial/Style:
Tokens: 7
Avg within-category similarity: 0.270
# DIMENSIONALITY ANALYSIS:
Variance explained by first 10 PCs: 37.6%
Components needed for 90% variance: 1
# Release - bert-beatrix-2048 v1
Entirely saturated pretrained masking window, fixated on expanding the masking potential using subject and shunt allocation tokenization systems.
What we have here, is our first subject burned saturated rope prototype - though I must admit it took longer than expected to train, will provide a perfect excitation catalyst for the next step.
After more research, I learned I could probably saturate this context window in a fraction of the steps - however, this is a full pretrain burn, so I let it go the distance.
There is much to be learned here - especially with diffusion embedding shaping.
There may be faults in it's core, I've had some issues with the vocab and tokenization structure having a valuation misalignment, however I decided nonetheless to let it complete with those faults intact. There's no telling what that may teach alongside. The actual data itself should reflect the correct tokenization, I just need to make sure it loads correctly. If not, I'll retrain her.
## 2008000 total steps at batch size 1024.
This is the 26 categorical finetune of nomic-bert-2048's encoder.
* 130,000,000 - 4-30 masked token samples with 80% mask rate
* * Learned timestep associative noise and why it matters.
* 253,952,000 - 77 token samples with 20% mask rate
* * Learned context over the noise.
* 775,000,000 - 144-256 token samples with 30% mask rate
* 453,800,000 - 385-512 token samples with 30% mask rate
* 227,328,000 - 1024 token samples with 30% mask rate
* 234,112,000 - 2048 token samples with 30% mask rate
The final accuracy is about 95% give or take, but it's also comparing against the actual pretrained data. The point being it needs to associate information with information more cleanly, rather than fully enforcing the various logistical and subjective elements instantiated from external elements.
Total samples; 2,056,192,000
The model has learned to categorize certain masked patterns with their categories and special tokens.
It turns WHAT YOU WANT into something A BIT more reliable - in a vector-rich modality meant to be subjectively cohesive and applied to categorize SOMETHING ELSE for processing and decoding.
This will likely result in many variant and incorrect pathways if you try to use it as an LLM, or if you try to use it to TRAIN an LLM. However, I'm going to do it anyway, because I can.
---
HOWEVER, the deterministic nature of subjectivity... will have a very crucial role when being shaped by... much more intelligent harmonic influence.
The REAL experiment can now start.
```
<subject>
<subject1>
<subject2>
<pose>
<emotion>
<surface>
<lighting>
<material>
<accessory>
<footwear>
<upper_body_clothing>
<hair_style>
<hair_length>
<headwear>
<texture>
<pattern>
<grid>
<zone>
<offset>
<object_left>
<object_right>
<relation>
<intent>
<style>
<fabric>
<jewelry>
With the categorical shunts;
[SHUNT_1000000]
[SHUNT_1000001]
[SHUNT_1000002]
[SHUNT_1000003]
[SHUNT_1000004]
[SHUNT_1000005]
[SHUNT_1000006]
[SHUNT_1000007]
[SHUNT_1000008]
[SHUNT_1000009]
[SHUNT_1000010]
[SHUNT_1000011]
[SHUNT_1000012]
[SHUNT_1000013]
[SHUNT_1000014]
[SHUNT_1000015]
[SHUNT_1000016]
[SHUNT_1000017]
[SHUNT_1000018]
[SHUNT_1000019]
[SHUNT_1000020]
[SHUNT_1000021]
[SHUNT_1000022]
[SHUNT_1000023]
[SHUNT_1000024]
[SHUNT_1000025]
```
Each shunt meant to activate cross-categorical conceptualization within their 77 token window. |