JULIAN-100M: French-English Language Model

JULIAN-100M is a 107-million parameter GPT-style decoder-only transformer trained from scratch on French and English Wikipedia data. This model was developed as a research project to explore modern transformer architectures and training techniques on TPU hardware.

Model Description

Model Type: Decoder-only transformer (GPT-style)
Parameters: 107 million
Training Data: 4.45 billion tokens (Wikipedia EN + FR)
Languages: English (70%) and French (30%)
Training Hardware: Google Cloud TPU v5litepod-32
Training Framework: JAX/Flax/Optax
Training Duration: ~1 hour
Final Perplexity: 11.0

Architecture

JULIAN-100M uses a modern transformer architecture with the following features:

Component	Specification
Hidden Dimension	640
Layers	12
Attention Heads	10
FFN Dimension	2560
Vocabulary Size	24,000 (SentencePiece)
Context Length	2048 tokens
Normalization	RMSNorm
Position Encoding	RoPE (Rotary Position Embedding)
Activation	SwiGLU
Precision	bfloat16 mixed precision

Training Details

Training Configuration

Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Learning Rate: Peak 3e-4 with warmup (2000 steps) + cosine decay
Batch Size: 128 global batch size (4 per device × 32 TPU cores)
Sequence Length: 2048 tokens
Gradient Clipping: 1.0
Dropout: 0.1
Total Steps: ~8,500 (1 epoch through 4.45B tokens)

Training Data

The model was trained on cleaned Wikipedia dumps:

English Wikipedia: ~3.5B tokens
French Wikipedia: ~0.95B tokens
Total: 4.45B tokens

Data preprocessing pipeline:

Downloaded latest Wikipedia dumps
Cleaned and filtered articles
Removed duplicates and low-quality content
Tokenized with SentencePiece (24K BPE vocabulary)
Packed sequences to 2048 tokens with EOS separators

Training Performance

Metric	Value
Final Loss	2.39
Final Perplexity	11.0

For comparison, GPT-2 Small (117M parameters) achieves PPL ~30-40 on similar data, demonstrating that JULIAN-100M's architecture is effective.

Usage

Installation

pip install jax[cpu] flax orbax-checkpoint sentencepiece

For GPU support, install jax[cuda] instead.

Loading the Model

import jax
import jax.numpy as jnp
import sentencepiece as spm
import orbax.checkpoint as ocp
import os

# Add model code to path
import sys
sys.path.insert(0, "path/to/julian-100m")

from src.model import JULIAN_100M, create_model

# Load tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.Load("julian-100m/tokenizer/julian_24k.model")

# Create model
model = create_model(JULIAN_100M)

# Initialize with dummy input
rng = jax.random.PRNGKey(0)
dummy = jnp.ones((1, JULIAN_100M.max_seq_len), dtype=jnp.int32)
variables = model.init(rng, dummy)

# Load checkpoint
checkpoint_path = os.path.abspath("julian-100m/checkpoint")
checkpointer = ocp.StandardCheckpointer()
restored = checkpointer.restore(
    checkpoint_path,
    args=ocp.args.StandardRestore({"params": variables["params"]})
)
params = restored["params"]

print("✅ Model loaded successfully!")
print(f"Parameters: {sum(p.size for p in jax.tree_util.tree_leaves(params)):,}")

Text Generation

def generate_text(prompt: str, max_tokens: int = 100, temperature: float = 0.8):
    """Generate text from a prompt."""

    # Tokenize
    input_ids = tokenizer.EncodeAsIds(prompt)
    input_ids = jnp.array([input_ids], dtype=jnp.int32)

    rng = jax.random.PRNGKey(42)

    # Autoregressive generation
    for _ in range(max_tokens):
        # Forward pass
        logits = model.apply({"params": params}, input_ids, deterministic=True)
        next_token_logits = logits[0, -1, :] / temperature

        # Sample next token
        rng, sample_rng = jax.random.split(rng)
        next_token = jax.random.categorical(sample_rng, next_token_logits)
        next_token = jnp.array([[next_token]], dtype=jnp.int32)

        # Append to context
        input_ids = jnp.concatenate([input_ids, next_token], axis=1)

        # Stop at EOS
        if int(next_token[0, 0]) == 3:
            break

        # Print token
        print(tokenizer.DecodeIds([int(next_token[0, 0])]), end='', flush=True)

    print()
    return tokenizer.DecodeIds(input_ids[0].tolist())

# Example usage
generated = generate_text("La France est un pays", max_tokens=50)

Limitations and Biases

Current Limitations

Base Model Only: This is a pre-trained base model without instruction tuning. It generates text but doesn't follow instructions or engage in dialogue.
Limited Training Data: With only 4.45B tokens (1 epoch), the model has seen relatively little data compared to production LLMs (typically 1-10 trillion tokens).
Generation Quality: As a base model, it may produce:
- Repetitive text
- Incomplete sentences
- Factual inaccuracies
- Grammatical errors
Language Imbalance: Trained primarily on English (70%), French performance may be inferior.
Domain Limitations: Trained only on Wikipedia, the model may struggle with:
- Conversational language
- Code generation
- Creative writing
- Technical jargon outside Wikipedia's scope

Known Biases

Wikipedia Bias: Reflects biases present in Wikipedia content
Language Bias: Better performance on English than French
Knowledge Cutoff: Limited to Wikipedia dumps from training time
Demographic Biases: May reflect Wikipedia's contributor demographics

Recommended Use Cases

✅ Appropriate Uses:

Research and educational projects
Studying transformer architectures
Benchmarking and comparison studies
Understanding TPU training workflows
Portfolio demonstrations

❌ Not Recommended:

Production applications
Critical decision-making systems
Content requiring factual accuracy
Applications requiring instruction-following

Future Work

Planned improvements for future versions:

Instruction Tuning: Finetune with instruction datasets (Open Assistant, Alpaca) to create JULIAN-100M-Instruct
Increased Scale: Train 250M-500M parameter versions with 20-30B tokens
Better Tokenization: Experiment with larger vocabularies (50K-100K)
Multi-Task Learning: Incorporate code, dialogue, and other domains
Quantization: Implement 8-bit or 4-bit inference for efficiency

Technical Details

Model Files

julian-100m/
├── checkpoint/              # Orbax checkpoint (480MB)
│   ├── manifest.ocdbt
│   ├── _METADATA
│   ├── _CHECKPOINT_METADATA
│   └── ocdbt.process_0/
├── src/model/               # Model source code
│   ├── config.py           # Model configurations
│   ├── julian.py           # Model architecture
│   └── hf_config.py        # HuggingFace config wrapper
├── tokenizer/
│   ├── julian_24k.model    # SentencePiece model
│   └── julian_24k.vocab    # Vocabulary file
├── config.json             # Model metadata
└── README.md               # This file

Training Logs

Complete training logs are available showing:

Loss progression: 4.8 → 2.39
Perplexity improvement: 121 → 11.0
Token throughput: ~150k tokens/s
Gradient norms and learning rate schedule

Citation

If you use JULIAN-100M in your research, please cite:

@misc{julian100m2025,
  title={JULIAN-100M: A 107M Parameter French-English Language Model},
  author={Julian Kerignard},
  year={2025},
  howpublished={\\url{https://huggingface.co/juliankerignard/JULIAN-100M}},
}

License

This model is released under the MIT License. The training data (Wikipedia) is available under Creative Commons licenses.

Acknowledgments

This project was made possible thanks to:

Google's TPU Research Cloud (TRC) program for providing TPU v5e access. Training large language models would be prohibitively expensive without such initiatives supporting independent research and education. Special thanks to the TRC team for democratizing access to cutting-edge AI infrastructure.
Google Research for developing and open-sourcing JAX, Flax, and Optax - exceptional tools that make TPU programming accessible and enjoyable.
The Wikimedia Foundation for maintaining Wikipedia as a free, open knowledge resource.
HuggingFace for building an incredible platform that makes sharing models and datasets seamless.
The broader open-source ML community for countless tutorials, papers, and code examples that made this learning journey possible.

Note: This is a research model developed for educational purposes and portfolio demonstration. For production use cases, consider larger, instruction-tuned models like Mistral, Llama, or GPT-4.

Downloads last month: 23

Model tree for JulianKrgd/JULIAN-100M

Finetunes

1 model

Datasets used to train JulianKrgd/JULIAN-100M

Collection including JulianKrgd/JULIAN-100M

100M Models

Collection

100M Models • 4 items • Updated Jan 28