- Generator: create your own dataset from scratch - Converter: use existing datasets (Hugging Face support) with reasoning traces to match our SYNTH style - DEEP Mode: multiple agents working together in various configurations - Multi-turn Support: pass one DEEP run, let the model ask follow-up questions, and choose who should respond using SYNTH-like thinking - Firebase/Firestore: download your data directly as a JSONL file or upload it to your Firestore (production mode) - Data Preview: Have data but unsure what's inside? Explore it directly! - Verifier View: evaluate generated data, remove duplicates, assign ratings
๐ Big news! NeuroBLAST, the outstanding new architecture, has officially arrived on HF! After three intense months of training my 1.9 billion SLM on my trusty RTX 3090 Ti, Iโm happy to announce the results. While itโs not perfect just yet, Iโve dedicated countless hours to optimizing costs while crafting clever layer connections that mimic the brain's centers. Plus, Iโve introduced a new memory-like layer thatโs sure to turn heads! I canโt wait to dive deep into this journey in my upcoming blog post. Stay tuned for the full scoop! ๐ฅ
โ๏ธ modification of the cross-entropy loss function designed specifically for training LLMs. โ๏ธ twist on the standard cross-entropy loss by emphasizing the importance of outlier prediction errors and dynamically normalizing token-level variance. โ๏ธ more stable and efficient training, leading to models that generalize better.
Check it out, give it a spin, and let me know what you think!
Licensed under the Apache 2.0 license and ready to use. Happy training! ๐ฅ๐ค
After hours of working with GitHub Copilot to organize the code, I'm keen to announce the release of Blurred Thoughts Supervised-Finetuning (BT-SFT), a new method for fine-tuning LLMs to produce more diverse and creative responses.
BT-SFT introduces: โ Smart tokenization method randomly masks tokens within <think> ... </think> tags, promoting the model to generate diverse responses that align better with its probability distribution instead of memorizing the thought process from distilled data. โ Reward function that ensures responses are well-structured.
Can we teach a model to think completely on its own without reinforcement learning? Actually, yes.
We can do straightforward supervised fine-tuning using a relatively simple trick: blurring a part of CoT thoughts. But why is this effective?
We observed that various models differ in their thinking processes, and fine-tuning one model on another modelโs thoughts (CoT) can sometimes be inefficientโoften resulting in the model simply memorizing reasoning rather than learning how to actually think.
I discovered that this process can still be efficient if we clearly indicate when the model should start and stop thinking and uncover only a part of CoT and the expected answer, blurring the other part of CoT. This approach allows the model to learn only a portion of the thought process while still arriving at an expected answer.
To demonstrate this, you can watch my experimental BT-SFT on meditsolutions/Llama-3.2-SUN-2.5B-chat model, which was fine-tuned on 151 million tokens from the Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B dataset.
Enjoy! ๐
PS. If you were curious enough to read this, leave me a comment. It's always nice to chat with open-minded and intelligent ppl.
Ok, my 14B DeepSeek R1 merge with Qwen2.5 1M is really hot right nowโit's got 2.6k downloads! It's sitting pretty as the top trending model on the third page. ๐ฅ
Check out Qwen-2.5-14B-DeepSeek-R1-1M! This one's a cool blend of the latest Qwen 2.5 with 14 billion parameters and has a massive 1 million token context window. It also comes with the DeepSeek R1 version of the Qwen 2.5 14B base model.
Are you fascinated by reasoning models? If so, you won't want to miss my latest project! I've implemented multiple path generations to supercharge the reasoning capabilities of O1-like models. Explore how this work can elevate your model in complex reasoning tasks!
I kindly invite you to try my experimental Llama 3.2 3B with o1-like thinking.
It utilizes Thoughts when needed, so don't be surprised when it's not. It also has a minor bug that requires further fine-tuning (sometimes it starts with the <|python_tag|> instead of <Thought>).
Enjoy!
Give some likes and whatever to make me feel better and motivated to keep going ๐
Exciting times to come? We are working on a layer self-esteem technique to score their contribution to the final prediction. For now, it unlocks a lot of knowledge already stored in weights we couldn't force the model to extract by further fine-tuning!
We built a new small language model SmolLM2-MedIT-Upscale-2B, based on SmolLM2-1.7B-Instruct from Hugging Face. The premise was simple - increasing the vector in attention layers would positively impact the model's capabilities.
What did we prove? In total, not much really, since we don't have the original trained under the same conditions as our upscale. However...
1. We scaled up the model without losing its quality 2. We confirmed that the method we devised works 3. After extremely short fine-tuning, the model achieved much better results in IFEval compared to the original (53.68 vs 64.29) and a higher overall average score in Open LLM Leaderboard (14.75 vs 15.17)
I consider this a big success ๐, since surpassing the original in metrics is often very time-consuming, generates high costs, and doesn't always work out.
Meanwhile, we're moving forward, training SmolLM2 400M Instruct as an upscale of 136M.
We're curious about how increasing the base and intermediate vectors will affect the model's quality. We'll compare it to the original and the 360M Instruct version released by Hugging Face.