KiteFishAI
/

KiteFish-A1-1.5B-Math

@@ -5,103 +5,105 @@ language:
 tags:
   - causal-lm
   - scientific-language-model
-  - arxiv
   - mathematics
   - research
 library_name: transformers
 ---
 # KiteFish-A1-1.5B
-KiteFish-A1-1.5B is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics.
-This model is a **base scientific language model** and is not instruction-tuned.
 ---
 ## Overview
-KiteFish-A1-1.5B was trained using approximately:
-- **52.18B pretraining tokens**
-- **5B post-training tokens**
-- ~200GB of processed scientific corpus
-- LLaMA-compatible tokenizer (~102k vocab)
-- 2× NVIDIA A100 (80GB) GPUs
-- 24 experimental runs for optimization stability
-The goal of this model is to explore the practical challenges of training a domain-specialized scientific language model from raw LaTeX archives.
 ---
-## Intended Use
-This model is intended for:
-- Scientific text modeling research
-- Mathematical language modeling experiments
-- Pretraining initialization for domain-specific fine-tuning
-- Tokenization and symbolic modeling research
-This model is **not optimized for:**
-- General conversational AI
-- Instruction following
-- Chat-based interaction
-- Benchmark competition
 ---
-## Performance Notes
-This is a base model trained from scratch under moderate compute constraints.
-Observed characteristics:
-- Strong familiarity with scientific writing style
-- Stable LaTeX structure modeling
-- Limited instruction-following ability
-- Limited reasoning depth compared to large instruction-tuned models
-- Modest downstream benchmark accuracy without fine-tuning
-Users are encouraged to apply supervised fine-tuning (SFT) or LoRA-based adaptation for improved task performance.
 ---
-## Training Details
-**Architecture**
-- 24 layers
-- Hidden size: 2048
-- FFN size: 5504
-- 16 attention heads
-- Context length: 4096 (trained at 768 tokens)
-- Dense LLaMA-style transformer
-**Optimization**
-- AdamW
-- Learning rate: 2e-4
-- Warmup: 500 steps
-- Weight decay: 0.1
-- Gradient accumulation: 32
-- Gradient checkpointing enabled
-- Mixed precision (bf16)
-**Validation Perplexity**
-- ~4.2 on held-out scientific corpus
 ---
 ## Limitations
-- Not instruction-tuned
-- Limited reasoning capabilities
-- Trained at 768-token sequence length
-- Domain restricted to selected arXiv categories
-- No RLHF or preference alignment
-- Not benchmark-optimized
-Performance on general NLP benchmarks may be low.
 ---
@@ -124,3 +126,18 @@ with torch.no_grad():
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 tags:
   - causal-lm
   - scientific-language-model
   - mathematics
+  - arxiv
   - research
 library_name: transformers
 ---
 # KiteFish-A1-1.5B
+**KiteFish-A1-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics.
+📄 **Paper:** https://arxiv.org/abs/2602.17288
+This is a **base scientific language model** (not instruction-tuned).
 ---
 ## Overview
+KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives.
+**Training Scale**
+- ~52B pretraining tokens
+- ~5B additional post-training tokens
+- ~200GB processed scientific corpus
+- LLaMA-compatible tokenizer (~102k vocab)
+- 2× NVIDIA A100 (80GB) GPUs
+- 24 experimental training runs
+The focus of this project is *scientific language modeling robustness*, not benchmark optimization.
 ---
+## Model Architecture
+- 24 Transformer layers
+- Hidden size: 2048
+- FFN size: 5504
+- 16 attention heads
+- Context length: 4096 (trained at 768 tokens)
+- Dense LLaMA-style architecture
+**Optimization**
+- AdamW
+- Learning rate: 2e-4
+- Warmup: 500 steps
+- Weight decay: 0.1
+- Gradient accumulation: 32
+- bf16 mixed precision
+- Gradient checkpointing enabled
+**Validation Perplexity:** ~4.2 (held-out scientific corpus)
 ---
+## Intended Use
+KiteFish-A1-1.5B is suitable for:
+- Scientific text modeling research
+- Mathematical language modeling experiments
+- Pretraining initialization for domain fine-tuning
+- Tokenization and symbolic modeling research
+- Studying LaTeX structure modeling
+It is **not optimized for:**
+- Instruction following
+- Chat-based applications
+- General conversational AI
+- Benchmark leaderboard performance
 ---
+## Performance Notes
+This model was trained under moderate compute constraints and without instruction tuning or alignment stages.
+Observed characteristics:
+- Strong familiarity with scientific writing style
+- Stable LaTeX structural modeling
+- Reasonable symbolic fluency
+- Limited reasoning depth
+- Low downstream benchmark accuracy without fine-tuning
+Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning.
 ---
 ## Limitations
+- Not instruction-tuned
+- No RLHF or preference alignment
+- Trained at 768-token sequence length
+- Domain restricted to selected arXiv categories
+- Not optimized for reasoning benchmarks
+- General NLP benchmark scores may be low
+This release is intended primarily for research and experimentation.
 ---
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Citation
+If you use this model in your research, please cite:
+```
+@article{kitefish_a1_2026,
+  title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives},
+  author={...},
+  year={2026},
+  eprint={2602.17288},
+  archivePrefix={arXiv}
+}
+```