Update README.md

8f3589a verified about 2 months ago

3 kB

license: mit
language:
  - en
tags:
  - causal-lm
  - scientific-language-model
  - arxiv
  - mathematics
  - research
library_name: transformers

KiteFish-A1-1.5B

KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.

This model is a base scientific language model and is not instruction-tuned.

Overview

KiteFish-A1-1.5B was trained using approximately:

52.18B pretraining tokens
5B post-training tokens
~200GB of processed scientific corpus
LLaMA-compatible tokenizer (~102k vocab)
2× NVIDIA A100 (80GB) GPUs
24 experimental runs for optimization stability

The goal of this model is to explore the practical challenges of training a domain-specialized scientific language model from raw LaTeX archives.

Intended Use

This model is intended for:

Scientific text modeling research
Mathematical language modeling experiments
Pretraining initialization for domain-specific fine-tuning
Tokenization and symbolic modeling research

This model is not optimized for:

General conversational AI
Instruction following
Chat-based interaction
Benchmark competition

Performance Notes

This is a base model trained from scratch under moderate compute constraints.

Observed characteristics:

Strong familiarity with scientific writing style
Stable LaTeX structure modeling
Limited instruction-following ability
Limited reasoning depth compared to large instruction-tuned models
Modest downstream benchmark accuracy without fine-tuning

Users are encouraged to apply supervised fine-tuning (SFT) or LoRA-based adaptation for improved task performance.

Training Details

Architecture

24 layers
Hidden size: 2048
FFN size: 5504
16 attention heads
Context length: 4096 (trained at 768 tokens)
Dense LLaMA-style transformer

Optimization

AdamW
Learning rate: 2e-4
Warmup: 500 steps
Weight decay: 0.1
Gradient accumulation: 32
Gradient checkpointing enabled
Mixed precision (bf16)

Validation Perplexity

~4.2 on held-out scientific corpus

Limitations

Not instruction-tuned
Limited reasoning capabilities
Trained at 768-token sequence length
Domain restricted to selected arXiv categories
No RLHF or preference alignment
Not benchmark-optimized

Performance on general NLP benchmarks may be low.

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "KiteFishAI/KiteFish-A1-1.5B-Math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Prove that the sum of two continuous functions is continuous."
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))