Evaluation Summary: willow-alpha-base

by GODELEV - opened Jun 8

Jun 8

MEMORANDUM

TO: North-ML1 Team
DATE: June 9, 2026
SUBJECT: Evaluation Summary: willow-alpha-base

I have completed a comprehensive evaluation suite on North-ML1/willow-alpha-base. Looking closely at the data, I want to appreciate your team for putting together a solid foundation with this base architecture.

The evaluation metrics demonstrate stable training and highly encouraging capabilities across standard language modeling, grammar, and reasoning benchmarks, especially considering its role as a base checkpoint.

Benchmark Evaluation Metrics

Category	Benchmark	Metric	Score / Value	Status
Linguistics & Grammar	COPA	Accuracy	64.00%	Success
	BLiMP	Accuracy	59.23%	Success
Commonsense & Reasoning	PIQA	Normalized Accuracy	53.86%	Success
	WinoGrande	Accuracy	50.67%	Success
	TruthfulQA MC2	Accuracy	48.74%	Success
	BoolQ	Accuracy	40.21%	Success
	SWAG	Normalized Accuracy	29.13%	Success
	HellaSwag	Normalized Accuracy	26.71%	Success
	RACE	Accuracy	23.16%	Success
	CommonsenseQA	Accuracy	20.31%	Success
Academic & Knowledge	SciQ	Normalized Accuracy	35.60%	Success
	ARC-Easy	Normalized Accuracy	34.68%	Success
	ARC-Challenge	Normalized Accuracy	25.60%	Success
	OpenBookQA	Normalized Accuracy	25.00%	Success
	MMLU	Accuracy	23.89%	Success
Language Modeling	LAMBADA	Accuracy	0.23%	Success
	WikiText-2	Word Perplexity	12524.42	Success

Final Evaluation Takeaway

Your model showcases impressive strengths in strategic choice benchmarks like COPA (64.00%), alongside competitive linguistics baselines on BLiMP (59.23%) and physical commonsense on PIQA (53.86%).

While the high perplexity on WikiText-2 and the lower baseline on LAMBADA suggest room for further alignment or longer context window exposure, these results indicate a highly capable and balanced foundational model ready for instruction tuning and specialized alignment pipelines.

Excellent job on this alpha release. I look forward to seeing how the Willow series develops!

Best regards,
Akshit

arthu1

North ML org Jun 9

Hey! on the gguf version of the model can you add what's it called .eval_results PR?
Also, for clarification, the model is actually named Forge-1V. That is an early checkpoint.

GODELEV

Jun 9

Done 👍
Btw can you give me a peek about training:-
Like Hardware you are using , time to train full model , etc
If you can , Please Don't mind.

arthu1

North ML org Jun 9

I literally forgot the time 😭
But ~4-9 hours for CPT/pretraining (yes I pretrained from scratch), SFT and more CPT.
I'm still training ...-alpha, the next checkpoint (full name will be revealed later)

arthu1

North ML org Jun 9

and A10G + L40S on Modal = hardware!!!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Evaluation Summary: willow-alpha-base

MEMORANDUM

TO: North-ML1 TeamDATE: June 9, 2026SUBJECT: Evaluation Summary: willow-alpha-base

Benchmark Evaluation Metrics

Final Evaluation Takeaway

TO: North-ML1 Team
DATE: June 9, 2026
SUBJECT: Evaluation Summary: willow-alpha-base