French LLM From Scratch - 260M (Fineweb-32k Tokenizer)

Description du modèle

Modèle de langage GPT-2 de 292M paramètres entraîné from-scratch sur corpus français. Architecture decoder-only transformer optimisée pour la génération de texte en français.

Caractéristiques principales

Langue : Français 🇫🇷
Paramètres : 292,270,080 (292M)
Architecture : GPT-2 (decoder-only transformer)
Tokenizer : Custom Fineweb-32k BPE (ByteLevel)
Vocab size : 32,000 tokens
Context length : 1024 tokens
Training steps : 200,000

Architecture détaillée

├─ Embedding layer: 32,000 × 1024
├─ Position embeddings: 1024 × 1024
├─ Transformer blocks: 18 layers
│  ├─ Multi-head attention: 16 heads, 64 dim/head
│  ├─ Feed-forward: 1024 → 4096 → 1024
│  ├─ LayerNorm (pre-norm architecture)
│  └─ Dropout: 0.1
└─ LM Head: 1024 → 32,000

Dataset d'entraînement

Corpus combiné de ~375M tokens (197M tokens Mistral-tokenized) :

UltraChat français : Conversations instructionnelles de qualité (60%)
OASST2 français : Dialogues OpenAssistant
Dolly français : Instructions Databricks traduites
Wikipedia français : Articles sur IA/informatique
FineWeb français : Contenu éducatif filtré

Prétraitement

Split train/test : 85% / 15% avec shuffle reproductible (seed=42)
Tokenisation : Custom BPE Fineweb-32k (~2.8 chars/token)
Longueur minimale : 512 tokens
Batch size : 4 (gradient accumulation possible)

Entraînement

Hardware

GPU : NVIDIA RTX 5080 16GB (compute 12.0)
Durée : ~23 heures pour 200k steps
Throughput : ~8,700 steps/h (avec torch.compile)

Hyperparamètres

{
    "max_steps": 200000,
    "batch_size": 4,
    "learning_rate": 3e-06,
    "weight_decay": 0.0,
    "optimizer": "AdamW (apex FusedAdam si disponible)",
    "scheduler": "Cosine with warmup",
    "precision": "mixed (AMP)",
    "gradient_clipping": 1.0,
    "dropout": 0.1,
    "seed": 13
}

Optimisations

torch.compile(mode='max-autotune') : +15% vitesse
Mixed precision (AMP) : 2x throughput
DataLoader optimisé : num_workers=4, pin_memory=True
Apex FusedAdam optimizer (GPU-native)

Performance

Métriques

Metric	Value
Final train loss	~4.5
Final validation loss	~4.79
Perplexity (val)	~120
Training steps	200,000

Qualité de génération

✅ Excellente - Génération cohérente, français fluide, grammaire correcte

Utilisation

Installation

pip install transformers safetensors tokenizers

Chargement du modèle

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Vincent-PRO-AI/french-gpt2-260m-fineweb32k")
tokenizer = AutoTokenizer.from_pretrained("Vincent-PRO-AI/french-gpt2-260m-fineweb32k")

# Génération
prompt = "Bonjour, comment vas-tu ?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, top_k=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Génération avec pipeline

from transformers import pipeline

generator = pipeline('text-generation', model='Vincent-PRO-AI/french-gpt2-260m-fineweb32k')
result = generator("Expliquez-moi l'intelligence artificielle", max_length=200)
print(result[0]['generated_text'])

Limitations

Langue : Optimisé pour français uniquement (tokenizer custom)
Context : Limité à 1024 tokens (~2,800 chars)
Tokenizer : Non compatible avec tokenizers standards (Mistral/GPT-2)
Date cutoff : Données jusqu'à novembre 2024
Biais : Peut refléter biais présents dans datasets source

Licence

MIT License - Utilisation libre pour recherche et production.

Citation

@misc{french-llm-from-scratch,
  author = {Vincent-PRO-AI},
  title = {French LLM From Scratch - 260M with Fineweb-32k Tokenizer},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Vincent-PRO-AI/french-llm-from-scratch}}
}

Contact

Repository : french-llm-from-scratch
Issues : GitHub Issues

Modèle entraîné between November 2 et November 2

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32