Minnow-Math-1.5B / README.md
anuj0456's picture
Update README.md
8f3589a verified
|
raw
history blame
3 kB
metadata
license: mit
language:
  - en
tags:
  - causal-lm
  - scientific-language-model
  - arxiv
  - mathematics
  - research
library_name: transformers

KiteFish-A1-1.5B

KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.

This model is a base scientific language model and is not instruction-tuned.


Overview

KiteFish-A1-1.5B was trained using approximately:

  • 52.18B pretraining tokens
  • 5B post-training tokens
  • ~200GB of processed scientific corpus
  • LLaMA-compatible tokenizer (~102k vocab)
  • 2× NVIDIA A100 (80GB) GPUs
  • 24 experimental runs for optimization stability

The goal of this model is to explore the practical challenges of training a domain-specialized scientific language model from raw LaTeX archives.


Intended Use

This model is intended for:

  • Scientific text modeling research
  • Mathematical language modeling experiments
  • Pretraining initialization for domain-specific fine-tuning
  • Tokenization and symbolic modeling research

This model is not optimized for:

  • General conversational AI
  • Instruction following
  • Chat-based interaction
  • Benchmark competition

Performance Notes

This is a base model trained from scratch under moderate compute constraints.

Observed characteristics:

  • Strong familiarity with scientific writing style
  • Stable LaTeX structure modeling
  • Limited instruction-following ability
  • Limited reasoning depth compared to large instruction-tuned models
  • Modest downstream benchmark accuracy without fine-tuning

Users are encouraged to apply supervised fine-tuning (SFT) or LoRA-based adaptation for improved task performance.


Training Details

Architecture

  • 24 layers
  • Hidden size: 2048
  • FFN size: 5504
  • 16 attention heads
  • Context length: 4096 (trained at 768 tokens)
  • Dense LLaMA-style transformer

Optimization

  • AdamW
  • Learning rate: 2e-4
  • Warmup: 500 steps
  • Weight decay: 0.1
  • Gradient accumulation: 32
  • Gradient checkpointing enabled
  • Mixed precision (bf16)

Validation Perplexity

  • ~4.2 on held-out scientific corpus

Limitations

  • Not instruction-tuned
  • Limited reasoning capabilities
  • Trained at 768-token sequence length
  • Domain restricted to selected arXiv categories
  • No RLHF or preference alignment
  • Not benchmark-optimized

Performance on general NLP benchmarks may be low.


Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))