Evaluation Summary: willow-alpha-base
MEMORANDUM
TO: North-ML1 Team
DATE: June 9, 2026
SUBJECT: Evaluation Summary: willow-alpha-base
I have completed a comprehensive evaluation suite on North-ML1/willow-alpha-base. Looking closely at the data, I want to appreciate your team for putting together a solid foundation with this base architecture.
The evaluation metrics demonstrate stable training and highly encouraging capabilities across standard language modeling, grammar, and reasoning benchmarks, especially considering its role as a base checkpoint.
Benchmark Evaluation Metrics
| Category | Benchmark | Metric | Score / Value | Status |
|---|---|---|---|---|
| Linguistics & Grammar | COPA | Accuracy | 64.00% | Success |
| BLiMP | Accuracy | 59.23% | Success | |
| Commonsense & Reasoning | PIQA | Normalized Accuracy | 53.86% | Success |
| WinoGrande | Accuracy | 50.67% | Success | |
| TruthfulQA MC2 | Accuracy | 48.74% | Success | |
| BoolQ | Accuracy | 40.21% | Success | |
| SWAG | Normalized Accuracy | 29.13% | Success | |
| HellaSwag | Normalized Accuracy | 26.71% | Success | |
| RACE | Accuracy | 23.16% | Success | |
| CommonsenseQA | Accuracy | 20.31% | Success | |
| Academic & Knowledge | SciQ | Normalized Accuracy | 35.60% | Success |
| ARC-Easy | Normalized Accuracy | 34.68% | Success | |
| ARC-Challenge | Normalized Accuracy | 25.60% | Success | |
| OpenBookQA | Normalized Accuracy | 25.00% | Success | |
| MMLU | Accuracy | 23.89% | Success | |
| Language Modeling | LAMBADA | Accuracy | 0.23% | Success |
| WikiText-2 | Word Perplexity | 12524.42 | Success |
Final Evaluation Takeaway
Your model showcases impressive strengths in strategic choice benchmarks like COPA (64.00%), alongside competitive linguistics baselines on BLiMP (59.23%) and physical commonsense on PIQA (53.86%).
While the high perplexity on WikiText-2 and the lower baseline on LAMBADA suggest room for further alignment or longer context window exposure, these results indicate a highly capable and balanced foundational model ready for instruction tuning and specialized alignment pipelines.
Excellent job on this alpha release. I look forward to seeing how the Willow series develops!
Best regards,
Akshit
Hey! on the gguf version of the model can you add what's it called .eval_results PR?
Also, for clarification, the model is actually named Forge-1V. That is an early checkpoint.
Done π
Btw can you give me a peek about training:-
Like Hardware you are using , time to train full model , etc
If you can , Please Don't mind.
I literally forgot the time π
But ~4-9 hours for CPT/pretraining (yes I pretrained from scratch), SFT and more CPT.
I'm still training ...-alpha, the next checkpoint (full name will be revealed later)
and A10G + L40S on Modal = hardware!!!