This is the 2nd model in the lineup named Biggerbrain.
It has 136m parameters, and was trained on an RTX 4060 using the training pipeline in the app.py file. It is MoLaMART a Mixture of Layers(Modified dense MoE) and Memory Augmented Recurrent Transformer. It has a 'recurrent' loop in the center 1/2 of the forward pass that loops for 1-3 iterations(iterations are preset by the user), with per-iteration embeddings injected to the MoL layers. It uses a 2.7x FFNN ratio, and uses SwiGLU for non-linearity and expressiveness. Thus, each of the first nn.Linears expands the current token's value to 4147 dimensions, which is halved by SwiGLU, then goes through another nn.Linear to be shrunk down to 768 dimensions, which is then multiplied by a value chosen by the router(The value is softmaxed before being used to multiply), then all of the Experts are added together, and output.
Test here: https://huggingface.co/spaces/Skull18500/BIggerbrain2_136m
The model is intended to have improved reasoning above V1, and lower forward pass Ram usage, with V2 integrating MLA(Multihead Latent Attention) and GQA(Grouped Quary Attention) as well as 1/3 of the model's layers using a decreased sequence length of 256 tokens, where the other layers use a sequence length of 640 tokens. The model integrates partial RoPE(Rotary Positional Embeddings) where RoPE affects only 1/2 of the latent space dimensions. The latent space is 384 dimensional, while the standered dimension of the model is 768. The model has a 4 head hashed ENgram Module, with trigram token slots, a total of 12192 with an internal dimension 384 that is scaled to the model's internal dimension. The ENgram was enabled with a linear layer that is run through torch.sigmoid, with the linear layer bias initialized to -1 and its weights initialized to 0.0, the model opened the gate more, averaging ~50% injection rate from the output of the ENgram to the current value of the model's thoughts, injected just before the iterative loop.
The model achieved best loss of 4.3 in Cross Entropy on a custom mix of the linked datasets, in around 1 hour 45 minutes training on my RTX 4060 with warmup steps 5000, a max LR of 3e-4, batchsize of 4, and accumulation steps of 32. However, due to using weighted random iteration values, the model's loss fluctuated wildly during training, because of the sheer randomization of shuffled chunks of the combined 20 gb dataset, on top of the 85% for 1 iteration(really no iterations, it is just a straight forward pass. It is called '1' iteration because it is in a loop), 10% chance for 2 iterations, and a 5% chance for 3. With STD initialization of the Embedding layer of ~0.01, the model around 15 hours in(the training run) got to embedding STD 0.0182 and was beginning to lower its average loss. However, as it kept seeing new data(only 1 28th of the way through the dataset), its loss fluctuated, with the best average at 2.3, the worst ~4.9. However, it continued a bouncy trend downwords.
If you intend to test this, perfer running app.py. It is a case insensitive CLI. The code is simple and is self explanitory for the commands. If you perfer to skip the CLI and integrate this model, call
model = initmodel(device)
Where device is what device you want to create the model on, and it by default sets the model to BF16. It returns the reference to model, and also compiles the model. Then, for testing, call
think(prompt, model, max_length=100, iter=3, top_k=10, temperature=1.0, raw=False, rep_penalty=1.5)
Where prompt is the string input, model is the model intialized above, iter is the integer of iterations the model makes(not above 3), and raw is whether to add chat formatting to the input. ai_extras is a single file library of modules and classes for the biggerbrain linup. If you are interested in the model's architecture, it is a good file to read. If you would like to one step install all dependencies, run pip install requirements.txt
And then, if you want to train it, you will need a pre-compiled .bin binary file full of uint16 token IDs, then construct an ai extras class: StreamDataset("total_dataset.bin", seq_len=model.sequencelength)
If you would like to support the future development of the open-source MIT licensed linup Biggerbrain, please donate to my Ko-fi. Any little bit helps: