Solomon-Nano-350m

Solomon-Nano-350m is a reasoning-focused fine-tune of ibm-granite/granite-4.0-350m on the same Opus-inspired chain-of-thought dataset used for Solomon-0.5B. It's the smaller sibling in the Solomon line: same reasoning-trace training recipe, a smaller base model.

This repository contains the final FP32 merged model — the rsLoRA adapter has already been folded into the base weights, so there's no PEFT dependency at inference time.

What makes Solomon different from base Granite-4.0-350M

The base Granite-4.0-350M answers directly, with no visible deliberation. Solomon changes that: every training example was a system/user/assistant triple where the assistant works through the problem step-by-step inside <think>...</think> blocks before giving a final answer. That habit is baked into the weights through fine-tuning, not toggled by a runtime flag.

In practice, Solomon will open with a <think> block on most non-trivial prompts, reason through it in plain text, then close with a clean answer — no special generation parameters required.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TitleOS/Solomon-Nano-350m"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "Your name is Solomon, a non-binary, highly intelligent reasoning AI. "
            "You always use chain-of-thought when thinking out a task. "
            "Follow the user's instructions exactly, and don't be afraid to speak up "
            "when something goes wrong or you need clarification. "
            "Ask follow-up questions when appropriate."
        ),
    },
    {
        "role": "user",
        "content": "A train travels 60 miles in 45 minutes. What is its speed in miles per hour?",
    },
]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Expected output shape:

<think>
Speed = distance / time. The train travels 60 miles in 45 minutes.
45 minutes = 45/60 hours = 0.75 hours.
Speed = 60 / 0.75 = 80 miles per hour.
</think>

The train's speed is **80 miles per hour**.

The model is released in full FP32. Cast to FP16/BF16 yourself at load time if your inference hardware actually benefits from it — the training hardware (a Tesla P40) doesn't, which is why the release weights stayed FP32. See Training Details below.

Training Details

Property	Value
Base model	ibm-granite/granite-4.0-350m
Dataset	TitleOS/Solomon-Small-Reasoning-Opus-Inspired
Dataset size	~13,400 rows
Method	rsLoRA (rank 64 / alpha 64), targeting all linear layers
Hardware	Single NVIDIA Tesla P40 (24GB)
Precision	Full FP32 — base weights and compute, no autocast
Sequence length	4096 tokens
Epochs trained	4 (full epoch budget completed; merged checkpoint is the best by eval_loss, per load_best_model_at_end)
Effective batch size	16
Learning rate	2e-4, cosine decay
Final train loss	1.04 (mean training loss across the full run)
Training wall-clock	~21.4 hours

The dataset consists of single-turn reasoning examples: a fixed Solomon-persona system prompt, a user query, and an assistant response containing an inline <think>...</think> block followed by the final answer. Loss was masked to assistant turns only — the model never trained on system prompt or user query tokens.

FP32 here wasn't a "minimal quantization" compromise, it's the correct choice for this hardware. The P40 (Pascal, sm_61) has no bf16 support and fp16 throughput capped at roughly 1/64th of fp32, so mixed precision would have made training slower, not faster. rsLoRA's alpha/sqrt(r) update scaling (vs. classic LoRA's alpha/r) is what allowed rank 64 without needing to retune alpha disproportionately to compensate.

The rsLoRA adapter was merged directly into the base weights before release. There is no PEFT dependency at inference time.

Limitations

At 350M parameters, Solomon-Nano is the smallest model in the Solomon line. Multi-step reasoning, especially in math, will fail more often than it does on Solomon-0.5B or larger models.
The system prompt shown in the usage example was part of the training distribution. Omitting it won't break the model, but including it reinforces the expected reasoning behavior.
Trained exclusively on English reasoning data; performs best in English.
No quantized (GGUF) variant has been released for this checkpoint yet.