JULIAN-100M: French-English Language Model
JULIAN-100M is a 107-million parameter GPT-style decoder-only transformer trained from scratch on French and English Wikipedia data. This model was developed as a research project to explore modern transformer architectures and training techniques on TPU hardware.
Model Description
- Model Type: Decoder-only transformer (GPT-style)
- Parameters: 107 million
- Training Data: 4.45 billion tokens (Wikipedia EN + FR)
- Languages: English (70%) and French (30%)
- Training Hardware: Google Cloud TPU v5litepod-32
- Training Framework: JAX/Flax/Optax
- Training Duration: ~1 hour
- Final Perplexity: 11.0
Architecture
JULIAN-100M uses a modern transformer architecture with the following features:
| Component | Specification |
|---|---|
| Hidden Dimension | 640 |
| Layers | 12 |
| Attention Heads | 10 |
| FFN Dimension | 2560 |
| Vocabulary Size | 24,000 (SentencePiece) |
| Context Length | 2048 tokens |
| Normalization | RMSNorm |
| Position Encoding | RoPE (Rotary Position Embedding) |
| Activation | SwiGLU |
| Precision | bfloat16 mixed precision |
Training Details
Training Configuration
- Optimizer: AdamW (Ξ²β=0.9, Ξ²β=0.95, weight_decay=0.1)
- Learning Rate: Peak 3e-4 with warmup (2000 steps) + cosine decay
- Batch Size: 128 global batch size (4 per device Γ 32 TPU cores)
- Sequence Length: 2048 tokens
- Gradient Clipping: 1.0
- Dropout: 0.1
- Total Steps: ~8,500 (1 epoch through 4.45B tokens)
Training Data
The model was trained on cleaned Wikipedia dumps:
- English Wikipedia: ~3.5B tokens
- French Wikipedia: ~0.95B tokens
- Total: 4.45B tokens
Data preprocessing pipeline:
- Downloaded latest Wikipedia dumps
- Cleaned and filtered articles
- Removed duplicates and low-quality content
- Tokenized with SentencePiece (24K BPE vocabulary)
- Packed sequences to 2048 tokens with EOS separators
Training Performance
| Metric | Value |
|---|---|
| Final Loss | 2.39 |
| Final Perplexity | 11.0 |
For comparison, GPT-2 Small (117M parameters) achieves PPL ~30-40 on similar data, demonstrating that JULIAN-100M's architecture is effective.
Usage
Installation
pip install jax[cpu] flax orbax-checkpoint sentencepiece
For GPU support, install jax[cuda] instead.
Loading the Model
import jax
import jax.numpy as jnp
import sentencepiece as spm
import orbax.checkpoint as ocp
import os
# Add model code to path
import sys
sys.path.insert(0, "path/to/julian-100m")
from src.model import JULIAN_100M, create_model
# Load tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.Load("julian-100m/tokenizer/julian_24k.model")
# Create model
model = create_model(JULIAN_100M)
# Initialize with dummy input
rng = jax.random.PRNGKey(0)
dummy = jnp.ones((1, JULIAN_100M.max_seq_len), dtype=jnp.int32)
variables = model.init(rng, dummy)
# Load checkpoint
checkpoint_path = os.path.abspath("julian-100m/checkpoint")
checkpointer = ocp.StandardCheckpointer()
restored = checkpointer.restore(
checkpoint_path,
args=ocp.args.StandardRestore({"params": variables["params"]})
)
params = restored["params"]
print("β
Model loaded successfully!")
print(f"Parameters: {sum(p.size for p in jax.tree_util.tree_leaves(params)):,}")
Text Generation
def generate_text(prompt: str, max_tokens: int = 100, temperature: float = 0.8):
"""Generate text from a prompt."""
# Tokenize
input_ids = tokenizer.EncodeAsIds(prompt)
input_ids = jnp.array([input_ids], dtype=jnp.int32)
rng = jax.random.PRNGKey(42)
# Autoregressive generation
for _ in range(max_tokens):
# Forward pass
logits = model.apply({"params": params}, input_ids, deterministic=True)
next_token_logits = logits[0, -1, :] / temperature
# Sample next token
rng, sample_rng = jax.random.split(rng)
next_token = jax.random.categorical(sample_rng, next_token_logits)
next_token = jnp.array([[next_token]], dtype=jnp.int32)
# Append to context
input_ids = jnp.concatenate([input_ids, next_token], axis=1)
# Stop at EOS
if int(next_token[0, 0]) == 3:
break
# Print token
print(tokenizer.DecodeIds([int(next_token[0, 0])]), end='', flush=True)
print()
return tokenizer.DecodeIds(input_ids[0].tolist())
# Example usage
generated = generate_text("La France est un pays", max_tokens=50)
Limitations and Biases
Current Limitations
Base Model Only: This is a pre-trained base model without instruction tuning. It generates text but doesn't follow instructions or engage in dialogue.
Limited Training Data: With only 4.45B tokens (1 epoch), the model has seen relatively little data compared to production LLMs (typically 1-10 trillion tokens).
Generation Quality: As a base model, it may produce:
- Repetitive text
- Incomplete sentences
- Factual inaccuracies
- Grammatical errors
Language Imbalance: Trained primarily on English (70%), French performance may be inferior.
Domain Limitations: Trained only on Wikipedia, the model may struggle with:
- Conversational language
- Code generation
- Creative writing
- Technical jargon outside Wikipedia's scope
Known Biases
- Wikipedia Bias: Reflects biases present in Wikipedia content
- Language Bias: Better performance on English than French
- Knowledge Cutoff: Limited to Wikipedia dumps from training time
- Demographic Biases: May reflect Wikipedia's contributor demographics
Recommended Use Cases
β Appropriate Uses:
- Research and educational projects
- Studying transformer architectures
- Benchmarking and comparison studies
- Understanding TPU training workflows
- Portfolio demonstrations
β Not Recommended:
- Production applications
- Critical decision-making systems
- Content requiring factual accuracy
- Applications requiring instruction-following
Future Work
Planned improvements for future versions:
- Instruction Tuning: Finetune with instruction datasets (Open Assistant, Alpaca) to create JULIAN-100M-Instruct
- Increased Scale: Train 250M-500M parameter versions with 20-30B tokens
- Better Tokenization: Experiment with larger vocabularies (50K-100K)
- Multi-Task Learning: Incorporate code, dialogue, and other domains
- Quantization: Implement 8-bit or 4-bit inference for efficiency
Technical Details
Model Files
julian-100m/
βββ checkpoint/ # Orbax checkpoint (480MB)
β βββ manifest.ocdbt
β βββ _METADATA
β βββ _CHECKPOINT_METADATA
β βββ ocdbt.process_0/
βββ src/model/ # Model source code
β βββ config.py # Model configurations
β βββ julian.py # Model architecture
β βββ hf_config.py # HuggingFace config wrapper
βββ tokenizer/
β βββ julian_24k.model # SentencePiece model
β βββ julian_24k.vocab # Vocabulary file
βββ config.json # Model metadata
βββ README.md # This file
Training Logs
Complete training logs are available showing:
- Loss progression: 4.8 β 2.39
- Perplexity improvement: 121 β 11.0
- Token throughput: ~150k tokens/s
- Gradient norms and learning rate schedule
Citation
If you use JULIAN-100M in your research, please cite:
@misc{julian100m2025,
title={JULIAN-100M: A 107M Parameter French-English Language Model},
author={Julian Kerignard},
year={2025},
howpublished={\\url{https://huggingface.co/juliankerignard/JULIAN-100M}},
}
License
This model is released under the MIT License. The training data (Wikipedia) is available under Creative Commons licenses.
Acknowledgments
This project was made possible thanks to:
Google's TPU Research Cloud (TRC) program for providing TPU v5e access. Training large language models would be prohibitively expensive without such initiatives supporting independent research and education. Special thanks to the TRC team for democratizing access to cutting-edge AI infrastructure.
Google Research for developing and open-sourcing JAX, Flax, and Optax - exceptional tools that make TPU programming accessible and enjoyable.
The Wikimedia Foundation for maintaining Wikipedia as a free, open knowledge resource.
HuggingFace for building an incredible platform that makes sharing models and datasets seamless.
The broader open-source ML community for countless tutorials, papers, and code examples that made this learning journey possible.
Note: This is a research model developed for educational purposes and portfolio demonstration. For production use cases, consider larger, instruction-tuned models like Mistral, Llama, or GPT-4.
- Downloads last month
- 17