Willow Alpha

An early-stage version of Forge-1V

Small language model research by North ML.


Overview

Willow Alpha is an early-stage base model checkpoint in the Forge-1V model line.

This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements.


Model Details

Field Value
Model name Willow Alpha
Project Forge-1V
Organization North ML
Model type Causal Language Model
Language English
License MIT
Status Early-stage / Alpha

Evaluation Results

All benchmarks below were run in 0-shot mode.

Benchmark Metric Score Runtime
HellaSwag acc_norm 26.71% 318.67s
PIQA acc_norm 53.86% 38.85s
WinoGrande acc 50.67% 23.73s
BoolQ acc 40.21% 144.80s
ARC-Easy acc_norm 34.68% 51.41s
ARC-Challenge acc_norm 25.60% 37.69s
OpenBookQA acc_norm 25.00% 21.14s
CommonsenseQA acc 20.31% 27.66s
LAMBADA acc 0.23% 96.28s
BLiMP acc 59.23% 354.79s
MMLU acc 23.89% 388.62s
WikiText-2 word_perplexity 12524.42 182.89s
WikiText-2 byte_perplexity 5.84 181.42s
SciQ acc_norm 35.60% 87.15s
COPA acc 64.00% 17.21s
RACE acc 23.16% 334.70s
SWAG acc_norm 29.13% 252.00s
TruthfulQA MC2 acc 48.74% 126.29s

Evaluation Summary

Category Result
Number of completed benchmark runs 18
Successful runs 18
Failed runs 0
Best accuracy-style score COPA โ€” 64.00%
Best language-structure score BLiMP โ€” 59.23%
MMLU score 23.89%
WikiText-2 byte perplexity 5.84
WikiText-2 word perplexity 12524.42

Notes

Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks.

The strongest early signals are:

  • COPA: 64.00%
  • BLiMP: 59.23%
  • PIQA: 53.86%
  • WinoGrande: 50.67%
  • TruthfulQA MC2: 48.74%

The weakest areas are:

  • LAMBADA
  • WikiText-2 word perplexity
  • CommonsenseQA
  • MMLU
  • RACE

These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant.


Intended Use

Willow Alpha is intended for:

  • Research
  • Benchmarking
  • Pretraining experiments
  • Fine-tuning experiments
  • Small language model development
  • Forge-1V pipeline testing

It is not yet recommended for production use.


Limitations

This model may:

  • Produce incorrect information
  • Fail basic reasoning tasks
  • Struggle with factual knowledge
  • Generate repetitive or low-quality text
  • Perform poorly on long-context tasks
  • Require additional supervised fine-tuning

Citation

@misc{willow-alpha,
  title = {Willow Alpha},
  author = {North ML},
  year = {2026},
  note = {Early-stage Forge-1V checkpoint}
}
Downloads last month
389
GGUF
Model size
0.3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support