Julian 600M - 40B Tokens
A 600M parameter decoder-only language model trained from scratch on 39.3B tokens using JAX/Flax on Google Cloud TPUs.
Model Description
Julian is a causal language model designed for text generation, trained on a mix of English (70%) and French (30%) data. The architecture follows modern best practices with RoPE positional embeddings, SwiGLU activations, and RMSNorm.
Architecture
| Component |
Configuration |
| Parameters |
599.9M |
| Layers |
18 |
| Hidden Size |
1280 |
| Attention Heads |
16 |
| Head Dimension |
80 |
| Intermediate Size |
5120 (SwiGLU) |
| Vocabulary |
50,000 (SentencePiece) |
| Context Length |
2048 |
| Positional Encoding |
RoPE (θ=10000) |
| Normalization |
RMSNorm (pre-norm) |
Training Details
| Metric |
Value |
| Total Tokens |
39.32B |
| Training Steps |
300,000 |
| Batch Size |
256 (global) |
| Learning Rate |
3e-4 → 3e-5 (cosine decay) |
| Hardware |
TPU v5litepod-32 |
| Framework |
JAX + Flax |
| Precision |
bfloat16 |
| Final Loss |
2.33 |
| Final Perplexity |
10.3 |
Training Data
| Source |
Proportion |
Tokens |
| Wikipedia EN |
~25% |
~10B |
| Wikipedia FR |
~10% |
~4B |
| OSCAR (EN/FR) |
~40% |
~16B |
| The Stack (Code) |
~15% |
~6B |
| Gutenberg Books |
~10% |
~4B |
Benchmark Results
Evaluated using lm-evaluation-harness (0-shot).
| Benchmark |
Score |
| HellaSwag |
53.5% |
| PIQA |
66.8% |
| LAMBADA |
37.3% |
Comparison with Similar Models
| Model |
Params |
Tokens |
HellaSwag |
PIQA |
Year |
| SmolLM2-135M |
135M |
2T |
43.3% |
67.4% |
2025 |
| Pythia-410M |
410M |
300B |
40.9% |
66.8% |
2023 |
| SmolLM2-360M |
360M |
4T |
54.5% |
71.7% |
2025 |
| Qwen2.5-0.5B |
490M |
~18T |
52.1% |
69.9% |
2024 |
| Julian 600M |
600M |
39B |
53.5% |
66.8% |
2025 |
| Qwen3-0.6B-Base |
600M |
36T |
41.1% |
70.0% |
2025 |
| Pythia-1B |
1B |
300B |
49.7% |
70.7% |
2023 |
| SmolLM2-1.7B |
1.7B |
11T |
68.7% |
76.9% |
2025 |
💡 Key insight: With only 39B tokens (ratio 1:65), Julian 600M surpasses Qwen3-0.6B-Base (41.1% HellaSwag, trained on 36T tokens, same 600M params) and matches SmolLM2-360M (54.5%, trained on 4T tokens). Julian achieves this with 100-900x less training data, highlighting exceptional data efficiency.
Sources: SmolLM2 (HuggingFace, 2025), Qwen3 (Alibaba, 2025), Qwen2.5 (Alibaba, 2024), Pythia (EleutherAI, 2023)
Usage
With Transformers (after conversion)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-40b")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-40b")
prompt = "La France est"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With JAX/Flax (native)
import jax
from julian_model import JulianLM, JULIAN_600M
import orbax.checkpoint as ocp
model = JulianLM(JULIAN_600M)
checkpointer = ocp.PyTreeCheckpointer()
params = checkpointer.restore("path/to/checkpoint")
Training Curves
Loss Progression
| Step |
Tokens |
Loss |
Perplexity |
| 0 |
0 |
10.5 |
36,316 |
| 50K |
6.5B |
3.20 |
24.5 |
| 100K |
13.1B |
2.70 |
14.9 |
| 150K |
19.6B |
2.50 |
12.2 |
| 200K |
26.2B |
2.40 |
11.0 |
| 250K |
32.8B |
2.35 |
10.5 |
| 300K |
39.3B |
2.33 |
10.3 |
Compute Budget
| Metric |
Value |
| TPU Type |
v5litepod-32 |
| TPU Hours |
~120h |
| Total FLOPS |
~2.4e19 |
| Throughput |
~1.1M tok/s |
| Training Time |
5 days |
Training Configuration
learning_rate: 3e-4 → 3e-5 (cosine decay)
warmup_steps: 3000
batch_size: 256 (global)
sequence_length: 2048
weight_decay: 0.1
gradient_clipping: 1.0
precision: bfloat16
optimizer: AdamW
beta1: 0.9
beta2: 0.95
epsilon: 1e-8
Julian Model Family
| Model |
Type |
Training |
HellaSwag |
PIQA |
LAMBADA |
Status |
| julian-600m-10b |
Base |
10B tokens |
45.8% |
67.6% |
35.0% |
✅ Released |
| julian-600m-40b |
Base |
39B tokens |
53.5% |
66.8% |
37.3% |
✅ Current |
| julian-600m-10b-instruct-v0.1 |
SFT |
10B + 185K ex |
42.7% |
66.2% |
34.6% |
✅ Released |
| julian-600m-40b-sft-2.5M |
SFT |
39B + 2.5M ex |
~ |
~ |
~ |
🔄 Coming |
Limitations
- Context Length: Limited to 2048 tokens
- Languages: Primarily English and French
- Benchmarks: Evaluated on HellaSwag, PIQA, LAMBADA
- Safety: Not instruction-tuned or safety-aligned
Citation
@misc{julian2025,
author = {Julian Kerignard},
title = {Julian: A 600M Parameter Language Model Trained on 40B Tokens},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/JulianKrgd/julian-600m-40b}
}
License
Apache 2.0
Acknowledgments
- Google Cloud TPU Research Program for compute resources
- JAX/Flax team for the excellent ML framework
- Hugging Face for model hosting