Evaluation Summary: willow-alpha-base

#1
by GODELEV - opened

MEMORANDUM

TO: North-ML1 Team
DATE: June 9, 2026
SUBJECT: Evaluation Summary: willow-alpha-base

I have completed a comprehensive evaluation suite on North-ML1/willow-alpha-base. Looking closely at the data, I want to appreciate your team for putting together a solid foundation with this base architecture.

The evaluation metrics demonstrate stable training and highly encouraging capabilities across standard language modeling, grammar, and reasoning benchmarks, especially considering its role as a base checkpoint.

Benchmark Evaluation Metrics

Category Benchmark Metric Score / Value Status
Linguistics & Grammar COPA Accuracy 64.00% Success
BLiMP Accuracy 59.23% Success
Commonsense & Reasoning PIQA Normalized Accuracy 53.86% Success
WinoGrande Accuracy 50.67% Success
TruthfulQA MC2 Accuracy 48.74% Success
BoolQ Accuracy 40.21% Success
SWAG Normalized Accuracy 29.13% Success
HellaSwag Normalized Accuracy 26.71% Success
RACE Accuracy 23.16% Success
CommonsenseQA Accuracy 20.31% Success
Academic & Knowledge SciQ Normalized Accuracy 35.60% Success
ARC-Easy Normalized Accuracy 34.68% Success
ARC-Challenge Normalized Accuracy 25.60% Success
OpenBookQA Normalized Accuracy 25.00% Success
MMLU Accuracy 23.89% Success
Language Modeling LAMBADA Accuracy 0.23% Success
WikiText-2 Word Perplexity 12524.42 Success

Final Evaluation Takeaway

Your model showcases impressive strengths in strategic choice benchmarks like COPA (64.00%), alongside competitive linguistics baselines on BLiMP (59.23%) and physical commonsense on PIQA (53.86%).

While the high perplexity on WikiText-2 and the lower baseline on LAMBADA suggest room for further alignment or longer context window exposure, these results indicate a highly capable and balanced foundational model ready for instruction tuning and specialized alignment pipelines.

Excellent job on this alpha release. I look forward to seeing how the Willow series develops!

Best regards,
Akshit

North ML org

Hey! on the gguf version of the model can you add what's it called .eval_results PR?
Also, for clarification, the model is actually named Forge-1V. That is an early checkpoint.

Done πŸ‘
Btw can you give me a peek about training:-
Like Hardware you are using , time to train full model , etc
If you can , Please Don't mind.

North ML org

I literally forgot the time 😭
But ~4-9 hours for CPT/pretraining (yes I pretrained from scratch), SFT and more CPT.
I'm still training ...-alpha, the next checkpoint (full name will be revealed later)

North ML org

and A10G + L40S on Modal = hardware!!!

Sign up or log in to comment