Julian 600M - 40B Tokens

A 600M parameter decoder-only language model trained from scratch on 39.3B tokens using JAX/Flax on Google Cloud TPUs.

Model Description

Julian is a causal language model designed for text generation, trained on a mix of English (70%) and French (30%) data. The architecture follows modern best practices with RoPE positional embeddings, SwiGLU activations, and RMSNorm.

Architecture

Component Configuration
Parameters 599.9M
Layers 18
Hidden Size 1280
Attention Heads 16
Head Dimension 80
Intermediate Size 5120 (SwiGLU)
Vocabulary 50,000 (SentencePiece)
Context Length 2048
Positional Encoding RoPE (θ=10000)
Normalization RMSNorm (pre-norm)

Training Details

Metric Value
Total Tokens 39.32B
Training Steps 300,000
Batch Size 256 (global)
Learning Rate 3e-4 → 3e-5 (cosine decay)
Hardware TPU v5litepod-32
Framework JAX + Flax
Precision bfloat16
Final Loss 2.33
Final Perplexity 10.3

Training Data

Source Proportion Tokens
Wikipedia EN ~25% ~10B
Wikipedia FR ~10% ~4B
OSCAR (EN/FR) ~40% ~16B
The Stack (Code) ~15% ~6B
Gutenberg Books ~10% ~4B

Benchmark Results

Evaluated using lm-evaluation-harness (0-shot).

Benchmark Score
HellaSwag 53.5%
PIQA 66.8%
LAMBADA 37.3%

Comparison with Similar Models

Model Params Tokens HellaSwag PIQA Year
SmolLM2-135M 135M 2T 43.3% 67.4% 2025
Pythia-410M 410M 300B 40.9% 66.8% 2023
SmolLM2-360M 360M 4T 54.5% 71.7% 2025
Qwen2.5-0.5B 490M ~18T 52.1% 69.9% 2024
Julian 600M 600M 39B 53.5% 66.8% 2025
Qwen3-0.6B-Base 600M 36T 41.1% 70.0% 2025
Pythia-1B 1B 300B 49.7% 70.7% 2023
SmolLM2-1.7B 1.7B 11T 68.7% 76.9% 2025

💡 Key insight: With only 39B tokens (ratio 1:65), Julian 600M surpasses Qwen3-0.6B-Base (41.1% HellaSwag, trained on 36T tokens, same 600M params) and matches SmolLM2-360M (54.5%, trained on 4T tokens). Julian achieves this with 100-900x less training data, highlighting exceptional data efficiency.

Sources: SmolLM2 (HuggingFace, 2025), Qwen3 (Alibaba, 2025), Qwen2.5 (Alibaba, 2024), Pythia (EleutherAI, 2023)

Usage

With Transformers (after conversion)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-40b")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-40b")

prompt = "La France est"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With JAX/Flax (native)

import jax
from julian_model import JulianLM, JULIAN_600M
import orbax.checkpoint as ocp

# Load model
model = JulianLM(JULIAN_600M)
checkpointer = ocp.PyTreeCheckpointer()
params = checkpointer.restore("path/to/checkpoint")

# Generate
# See src/inference/generate.py for full example

Training Curves

Loss Progression

Step Tokens Loss Perplexity
0 0 10.5 36,316
50K 6.5B 3.20 24.5
100K 13.1B 2.70 14.9
150K 19.6B 2.50 12.2
200K 26.2B 2.40 11.0
250K 32.8B 2.35 10.5
300K 39.3B 2.33 10.3

Compute Budget

Metric Value
TPU Type v5litepod-32
TPU Hours ~120h
Total FLOPS ~2.4e19
Throughput ~1.1M tok/s
Training Time 5 days

Training Configuration

# Hyperparameters
learning_rate: 3e-4 → 3e-5 (cosine decay)
warmup_steps: 3000
batch_size: 256 (global)
sequence_length: 2048
weight_decay: 0.1
gradient_clipping: 1.0
precision: bfloat16

# Optimizer
optimizer: AdamW
beta1: 0.9
beta2: 0.95
epsilon: 1e-8

Julian Model Family

Model Type Training HellaSwag PIQA LAMBADA Status
julian-600m-10b Base 10B tokens 45.8% 67.6% 35.0% ✅ Released
julian-600m-40b Base 39B tokens 53.5% 66.8% 37.3% ✅ Current
julian-600m-10b-instruct-v0.1 SFT 10B + 185K ex 42.7% 66.2% 34.6% ✅ Released
julian-600m-40b-sft-2.5M SFT 39B + 2.5M ex ~ ~ ~ 🔄 Coming

Limitations

  • Context Length: Limited to 2048 tokens
  • Languages: Primarily English and French
  • Benchmarks: Evaluated on HellaSwag, PIQA, LAMBADA
  • Safety: Not instruction-tuned or safety-aligned

Citation

@misc{julian2025,
  author = {Julian Kerignard},
  title = {Julian: A 600M Parameter Language Model Trained on 40B Tokens},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/JulianKrgd/julian-600m-40b}
}

License

Apache 2.0

Acknowledgments

  • Google Cloud TPU Research Program for compute resources
  • JAX/Flax team for the excellent ML framework
  • Hugging Face for model hosting
Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JulianKrgd/Julian-600M-40B

Quantizations
1 model

Datasets used to train JulianKrgd/Julian-600M-40B

Space using JulianKrgd/Julian-600M-40B 1

Collection including JulianKrgd/Julian-600M-40B