JULIAN-100M: French-English Language Model

JULIAN-100M is a 107-million parameter GPT-style decoder-only transformer trained from scratch on French and English Wikipedia data. This model was developed as a research project to explore modern transformer architectures and training techniques on TPU hardware.

Model Description

  • Model Type: Decoder-only transformer (GPT-style)
  • Parameters: 107 million
  • Training Data: 4.45 billion tokens (Wikipedia EN + FR)
  • Languages: English (70%) and French (30%)
  • Training Hardware: Google Cloud TPU v5litepod-32
  • Training Framework: JAX/Flax/Optax
  • Training Duration: ~1 hour
  • Final Perplexity: 11.0

Architecture

JULIAN-100M uses a modern transformer architecture with the following features:

Component Specification
Hidden Dimension 640
Layers 12
Attention Heads 10
FFN Dimension 2560
Vocabulary Size 24,000 (SentencePiece)
Context Length 2048 tokens
Normalization RMSNorm
Position Encoding RoPE (Rotary Position Embedding)
Activation SwiGLU
Precision bfloat16 mixed precision

Training Details

Training Configuration

  • Optimizer: AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight_decay=0.1)
  • Learning Rate: Peak 3e-4 with warmup (2000 steps) + cosine decay
  • Batch Size: 128 global batch size (4 per device Γ— 32 TPU cores)
  • Sequence Length: 2048 tokens
  • Gradient Clipping: 1.0
  • Dropout: 0.1
  • Total Steps: ~8,500 (1 epoch through 4.45B tokens)

Training Data

The model was trained on cleaned Wikipedia dumps:

  • English Wikipedia: ~3.5B tokens
  • French Wikipedia: ~0.95B tokens
  • Total: 4.45B tokens

Data preprocessing pipeline:

  1. Downloaded latest Wikipedia dumps
  2. Cleaned and filtered articles
  3. Removed duplicates and low-quality content
  4. Tokenized with SentencePiece (24K BPE vocabulary)
  5. Packed sequences to 2048 tokens with EOS separators

Training Performance

Metric Value
Final Loss 2.39
Final Perplexity 11.0

For comparison, GPT-2 Small (117M parameters) achieves PPL ~30-40 on similar data, demonstrating that JULIAN-100M's architecture is effective.

Usage

Installation

pip install jax[cpu] flax orbax-checkpoint sentencepiece

For GPU support, install jax[cuda] instead.

Loading the Model

import jax
import jax.numpy as jnp
import sentencepiece as spm
import orbax.checkpoint as ocp
import os

# Add model code to path
import sys
sys.path.insert(0, "path/to/julian-100m")

from src.model import JULIAN_100M, create_model

# Load tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.Load("julian-100m/tokenizer/julian_24k.model")

# Create model
model = create_model(JULIAN_100M)

# Initialize with dummy input
rng = jax.random.PRNGKey(0)
dummy = jnp.ones((1, JULIAN_100M.max_seq_len), dtype=jnp.int32)
variables = model.init(rng, dummy)

# Load checkpoint
checkpoint_path = os.path.abspath("julian-100m/checkpoint")
checkpointer = ocp.StandardCheckpointer()
restored = checkpointer.restore(
    checkpoint_path,
    args=ocp.args.StandardRestore({"params": variables["params"]})
)
params = restored["params"]

print("βœ… Model loaded successfully!")
print(f"Parameters: {sum(p.size for p in jax.tree_util.tree_leaves(params)):,}")

Text Generation

def generate_text(prompt: str, max_tokens: int = 100, temperature: float = 0.8):
    """Generate text from a prompt."""

    # Tokenize
    input_ids = tokenizer.EncodeAsIds(prompt)
    input_ids = jnp.array([input_ids], dtype=jnp.int32)

    rng = jax.random.PRNGKey(42)

    # Autoregressive generation
    for _ in range(max_tokens):
        # Forward pass
        logits = model.apply({"params": params}, input_ids, deterministic=True)
        next_token_logits = logits[0, -1, :] / temperature

        # Sample next token
        rng, sample_rng = jax.random.split(rng)
        next_token = jax.random.categorical(sample_rng, next_token_logits)
        next_token = jnp.array([[next_token]], dtype=jnp.int32)

        # Append to context
        input_ids = jnp.concatenate([input_ids, next_token], axis=1)

        # Stop at EOS
        if int(next_token[0, 0]) == 3:
            break

        # Print token
        print(tokenizer.DecodeIds([int(next_token[0, 0])]), end='', flush=True)

    print()
    return tokenizer.DecodeIds(input_ids[0].tolist())

# Example usage
generated = generate_text("La France est un pays", max_tokens=50)

Limitations and Biases

Current Limitations

  1. Base Model Only: This is a pre-trained base model without instruction tuning. It generates text but doesn't follow instructions or engage in dialogue.

  2. Limited Training Data: With only 4.45B tokens (1 epoch), the model has seen relatively little data compared to production LLMs (typically 1-10 trillion tokens).

  3. Generation Quality: As a base model, it may produce:

    • Repetitive text
    • Incomplete sentences
    • Factual inaccuracies
    • Grammatical errors
  4. Language Imbalance: Trained primarily on English (70%), French performance may be inferior.

  5. Domain Limitations: Trained only on Wikipedia, the model may struggle with:

    • Conversational language
    • Code generation
    • Creative writing
    • Technical jargon outside Wikipedia's scope

Known Biases

  • Wikipedia Bias: Reflects biases present in Wikipedia content
  • Language Bias: Better performance on English than French
  • Knowledge Cutoff: Limited to Wikipedia dumps from training time
  • Demographic Biases: May reflect Wikipedia's contributor demographics

Recommended Use Cases

βœ… Appropriate Uses:

  • Research and educational projects
  • Studying transformer architectures
  • Benchmarking and comparison studies
  • Understanding TPU training workflows
  • Portfolio demonstrations

❌ Not Recommended:

  • Production applications
  • Critical decision-making systems
  • Content requiring factual accuracy
  • Applications requiring instruction-following

Future Work

Planned improvements for future versions:

  1. Instruction Tuning: Finetune with instruction datasets (Open Assistant, Alpaca) to create JULIAN-100M-Instruct
  2. Increased Scale: Train 250M-500M parameter versions with 20-30B tokens
  3. Better Tokenization: Experiment with larger vocabularies (50K-100K)
  4. Multi-Task Learning: Incorporate code, dialogue, and other domains
  5. Quantization: Implement 8-bit or 4-bit inference for efficiency

Technical Details

Model Files

julian-100m/
β”œβ”€β”€ checkpoint/              # Orbax checkpoint (480MB)
β”‚   β”œβ”€β”€ manifest.ocdbt
β”‚   β”œβ”€β”€ _METADATA
β”‚   β”œβ”€β”€ _CHECKPOINT_METADATA
β”‚   └── ocdbt.process_0/
β”œβ”€β”€ src/model/               # Model source code
β”‚   β”œβ”€β”€ config.py           # Model configurations
β”‚   β”œβ”€β”€ julian.py           # Model architecture
β”‚   └── hf_config.py        # HuggingFace config wrapper
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ julian_24k.model    # SentencePiece model
β”‚   └── julian_24k.vocab    # Vocabulary file
β”œβ”€β”€ config.json             # Model metadata
└── README.md               # This file

Training Logs

Complete training logs are available showing:

  • Loss progression: 4.8 β†’ 2.39
  • Perplexity improvement: 121 β†’ 11.0
  • Token throughput: ~150k tokens/s
  • Gradient norms and learning rate schedule

Citation

If you use JULIAN-100M in your research, please cite:

@misc{julian100m2025,
  title={JULIAN-100M: A 107M Parameter French-English Language Model},
  author={Julian Kerignard},
  year={2025},
  howpublished={\\url{https://huggingface.co/juliankerignard/JULIAN-100M}},
}

License

This model is released under the MIT License. The training data (Wikipedia) is available under Creative Commons licenses.

Acknowledgments

This project was made possible thanks to:

  • Google's TPU Research Cloud (TRC) program for providing TPU v5e access. Training large language models would be prohibitively expensive without such initiatives supporting independent research and education. Special thanks to the TRC team for democratizing access to cutting-edge AI infrastructure.

  • Google Research for developing and open-sourcing JAX, Flax, and Optax - exceptional tools that make TPU programming accessible and enjoyable.

  • The Wikimedia Foundation for maintaining Wikipedia as a free, open knowledge resource.

  • HuggingFace for building an incredible platform that makes sharing models and datasets seamless.

  • The broader open-source ML community for countless tutorials, papers, and code examples that made this learning journey possible.


Note: This is a research model developed for educational purposes and portfolio demonstration. For production use cases, consider larger, instruction-tuned models like Mistral, Llama, or GPT-4.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JulianKrgd/JULIAN-100M

Finetunes
1 model

Datasets used to train JulianKrgd/JULIAN-100M

Space using JulianKrgd/JULIAN-100M 1