Sorry I’m late - Took a short break

#1
by StentorLabs - opened

I came back after a few days away and genuinely did not expect to find this. Seeing someone take Stentor-30M, scale it to 142M, run a three-stage training pipeline across 2.33 billion tokens, and then credit my little Kaggle experiment as what sparked their journey into deep learning — that hit differently. This is exactly why I shared it openly. You didn’t just fine-tune it, you believed in it more than I did. Thank you, stas122. Keep building. 🙏

Heya!
Honestly, HF isnt very convenient for communication, and I only just noticed this thread, sorry

we continue to work :3
Btw, I checked out the second stentor, and the result’s solid. I’ve gotten into tokenizers myself lately - been messing with SentencePiece, training it on different data, mostly Cyrillic but English too. Wanna team up? We could train a tokenizer specifically on FineWeb, like 2^12 tokens or something. Overall, if we normalize the tokens and get rid of caps entirely, we could get a really dense result.

This comment has been hidden (marked as Resolved)

Haha yeah, HF discussions are genuinely terrible for this.

And yeah, I'm really into the vocab efficiency idea. In Stentor2 I went 32K → 8K and it was already eating 16% of the model just as a lookup table, so there's definitely something here worth exploring further.

Also honestly the TokenMonster vocab has been a pain. Users still need to pip install tokenmonster as a separate dependency, AutoTokenizer requires trust_remote_code=True, and I had to do a bunch of wrapper code that would've been completely unnecessary with a native HF tokenizer. A proper vocab trained on FineWeb would kill all of that.

One thing I'd love to explore though — have you looked into BPE instead of SentencePiece? Most modern LLMs including LLaMA use BPE, and the HuggingFace tokenizers library has a built-in BPE trainer that produces a fully native tokenizer.json with zero extra dependencies and perfect AutoTokenizer support out of the box. Since LLaMA already uses BPE it'd also be way more compatible with the rest of the ecosystem.

On the vocab size — I’m totally fine if you want to make 4K, 8K, and 16K. I just wouldn’t want to do only 4K. From what I’ve seen, 4K starts to hurt more than it helps once you're above ~10M parameters because the tokenization overhead gets pretty aggressive. Having 8K and 16K alongside it would give a much cleaner comparison and cover the sub-50M range better.

My availability is kind of random so async works way better for me, but if you're down for that I think this could be really cool. Hit me on email and we can actually talk properly lol

I could not find your email anywhere on your profile, so heres mine: lunfromluna@gmail.com
Or just send me yours

Sign up or log in to comment