| | --- |
| | language: fr |
| | language_name: French |
| | language_family: romance_galloitalic |
| | tags: |
| | - wikilangs |
| | - nlp |
| | - tokenizer |
| | - embeddings |
| | - n-gram |
| | - markov |
| | - wikipedia |
| | - feature-extraction |
| | - sentence-similarity |
| | - tokenization |
| | - n-grams |
| | - markov-chain |
| | - text-mining |
| | - fasttext |
| | - babelvec |
| | - vocabulous |
| | - vocabulary |
| | - monolingual |
| | - family-romance_galloitalic |
| | license: mit |
| | library_name: wikilangs |
| | pipeline_tag: text-generation |
| | datasets: |
| | - omarkamali/wikipedia-monthly |
| | dataset_info: |
| | name: wikipedia-monthly |
| | description: Monthly snapshots of Wikipedia articles across 300+ languages |
| | metrics: |
| | - name: best_compression_ratio |
| | type: compression |
| | value: 4.573 |
| | - name: best_isotropy |
| | type: isotropy |
| | value: 0.7808 |
| | - name: best_alignment_r10 |
| | type: alignment |
| | value: 0.9680 |
| | - name: vocabulary_size |
| | type: vocab |
| | value: 1519124 |
| | generated: 2026-03-03 |
| | --- |
| | |
| | # French — Wikilangs Models |
| |
|
| | Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **French** Wikipedia by [Wikilangs](https://wikilangs.org). |
| |
|
| | 🌐 [Language Page](https://wikilangs.org/languages/fr/) · 🎮 [Playground](https://wikilangs.org/playground/?lang=fr) · 📊 [Full Research Report](RESEARCH_REPORT.md) |
| |
|
| | ## Language Samples |
| |
|
| | Example sentences drawn from the French Wikipedia corpus: |
| |
|
| | > Le décane est un alcane linéaire de formule brute qui possède 136 isomères. Ces diverses molécules comportent toutes dix [en grec δέκα (déca)] atomes de carbone. Notes et références linéaire du décane |
| |
|
| | > Cette liste représente les plus importantes villes de l'Égypte antique ordonnées par nome et suivies des divinités qui y étaient adorées. Basse-Égypte |
| |
|
| | > L'eicosane est un alcane linéaire de formule brute . Il possède isomères structuraux. Notes et références linéaire |
| |
|
| | > Hapy se réfère à Hâpi : Génie à tête de singe de la mythologie égyptienne. Hâpy : Dieu du Nil dans la mythologie égyptienne. |
| |
|
| | > L'heptane ou n-heptane est l'hydrocarbure saturé de la famille des alcanes linéaires de formule CH. Notes et références linéaire de l'heptane |
| |
|
| | ## Quick Start |
| |
|
| | ### Load the Tokenizer |
| |
|
| | ```python |
| | import sentencepiece as spm |
| | |
| | sp = spm.SentencePieceProcessor() |
| | sp.Load("fr_tokenizer_32k.model") |
| | |
| | text = "Lapon peut désigner : les Samis ; les langues sames ; Lapon, une ville du Soudan" |
| | tokens = sp.EncodeAsPieces(text) |
| | ids = sp.EncodeAsIds(text) |
| | |
| | print(tokens) # subword pieces |
| | print(ids) # integer ids |
| | |
| | # Decode back |
| | print(sp.DecodeIds(ids)) |
| | ``` |
| |
|
| | <details> |
| | <summary><b>Tokenization examples (click to expand)</b></summary> |
| |
|
| | **Sample 1:** `Lapon peut désigner : les Samis ; les langues sames ; Lapon, une ville du Soudan…` |
| |
|
| | | Vocab | Tokens | Count | |
| | |-------|--------|-------| |
| | | 8k | `▁lap on ▁peut ▁désigner ▁: ▁les ▁sam is ▁; ▁les … (+27 more)` | 37 | |
| | | 16k | `▁lap on ▁peut ▁désigner ▁: ▁les ▁sam is ▁; ▁les … (+27 more)` | 37 | |
| | | 32k | `▁lap on ▁peut ▁désigner ▁: ▁les ▁sam is ▁; ▁les … (+26 more)` | 36 | |
| | | 64k | `▁lap on ▁peut ▁désigner ▁: ▁les ▁sam is ▁; ▁les … (+25 more)` | 35 | |
| |
|
| | **Sample 2:** `Le pentadécane est un alcane linéaire de formule brute . Il possède 4 347 isomèr…` |
| |
|
| | | Vocab | Tokens | Count | |
| | |-------|--------|-------| |
| | | 8k | `▁le ▁pent ad éc ane ▁est ▁un ▁al c ane … (+27 more)` | 37 | |
| | | 16k | `▁le ▁pent ad éc ane ▁est ▁un ▁alc ane ▁linéaire … (+22 more)` | 32 | |
| | | 32k | `▁le ▁pent ad éc ane ▁est ▁un ▁alc ane ▁linéaire … (+20 more)` | 30 | |
| | | 64k | `▁le ▁pent ad éc ane ▁est ▁un ▁alcane ▁linéaire ▁de … (+18 more)` | 28 | |
| |
|
| | **Sample 3:** `L'eicosane est un alcane linéaire de formule brute . Il possède isomères structu…` |
| |
|
| | | Vocab | Tokens | Count | |
| | |-------|--------|-------| |
| | | 8k | `▁l ' e ic os ane ▁est ▁un ▁al c … (+22 more)` | 32 | |
| | | 16k | `▁l ' e icos ane ▁est ▁un ▁alc ane ▁linéaire … (+16 more)` | 26 | |
| | | 32k | `▁l ' e icos ane ▁est ▁un ▁alc ane ▁linéaire … (+14 more)` | 24 | |
| | | 64k | `▁l ' e icos ane ▁est ▁un ▁alcane ▁linéaire ▁de … (+12 more)` | 22 | |
| |
|
| | </details> |
| |
|
| | ### Load Word Embeddings |
| |
|
| | ```python |
| | from gensim.models import KeyedVectors |
| | |
| | # Aligned embeddings (cross-lingual, mapped to English vector space) |
| | wv = KeyedVectors.load("fr_embeddings_128d_aligned.kv") |
| | |
| | similar = wv.most_similar("word", topn=5) |
| | for word, score in similar: |
| | print(f" {word}: {score:.3f}") |
| | ``` |
| |
|
| | ### Load N-gram Model |
| |
|
| | ```python |
| | import pyarrow.parquet as pq |
| | |
| | df = pq.read_table("fr_3gram_word.parquet").to_pandas() |
| | print(df.head()) |
| | ``` |
| |
|
| | ## Models Overview |
| |
|
| |  |
| |
|
| | | Category | Assets | |
| | |----------|--------| |
| | | Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes | |
| | | N-gram models | 2 / 3 / 4 / 5-gram (word & subword) | |
| | | Markov chains | Context 1–5 (word & subword) | |
| | | Embeddings | 32d, 64d, 128d — mono & aligned | |
| | | Vocabulary | Full frequency list + Zipf analysis | |
| | | Statistics | Corpus & model statistics JSON | |
| |
|
| | ## Metrics Summary |
| |
|
| | | Component | Model | Key Metric | Value | |
| | |-----------|-------|------------|-------| |
| | | Tokenizer | 8k BPE | Compression | 3.72x | |
| | | Tokenizer | 16k BPE | Compression | 4.08x | |
| | | Tokenizer | 32k BPE | Compression | 4.37x | |
| | | Tokenizer | 64k BPE | Compression | 4.57x 🏆 | |
| | | N-gram | 2-gram (subword) | Perplexity | 251 🏆 | |
| | | N-gram | 2-gram (word) | Perplexity | 197,170 | |
| | | N-gram | 3-gram (subword) | Perplexity | 1,988 | |
| | | N-gram | 3-gram (word) | Perplexity | 1,815,223 | |
| | | N-gram | 4-gram (subword) | Perplexity | 11,120 | |
| | | N-gram | 4-gram (word) | Perplexity | 5,518,864 | |
| | | N-gram | 5-gram (subword) | Perplexity | 46,850 | |
| | | N-gram | 5-gram (word) | Perplexity | 4,103,331 | |
| | | Markov | ctx-1 (subword) | Predictability | 0.0% | |
| | | Markov | ctx-1 (word) | Predictability | 9.5% | |
| | | Markov | ctx-2 (subword) | Predictability | 39.6% | |
| | | Markov | ctx-2 (word) | Predictability | 53.5% | |
| | | Markov | ctx-3 (subword) | Predictability | 36.4% | |
| | | Markov | ctx-3 (word) | Predictability | 74.5% | |
| | | Markov | ctx-4 (subword) | Predictability | 34.0% | |
| | | Markov | ctx-4 (word) | Predictability | 87.3% 🏆 | |
| | | Vocabulary | full | Size | 1,519,124 | |
| | | Vocabulary | full | Zipf R² | 0.9927 | |
| | | Embeddings | mono_32d | Isotropy | 0.7808 🏆 | |
| | | Embeddings | mono_64d | Isotropy | 0.7574 | |
| | | Embeddings | mono_128d | Isotropy | 0.6995 | |
| | | Embeddings | aligned_32d | Isotropy | 0.7808 | |
| | | Embeddings | aligned_64d | Isotropy | 0.7574 | |
| | | Embeddings | aligned_128d | Isotropy | 0.6995 | |
| | | Alignment | aligned_32d | R@1 / R@5 / R@10 | 48.2% / 74.8% / 82.4% | |
| | | Alignment | aligned_64d | R@1 / R@5 / R@10 | 70.8% / 89.6% / 94.2% | |
| | | Alignment | aligned_128d | R@1 / R@5 / R@10 | 81.2% / 93.4% / 96.8% 🏆 | |
| | |
| | 📊 **[Full ablation study, per-model breakdowns, and interpretation guide →](RESEARCH_REPORT.md)** |
| | |
| | --- |
| | |
| | ## About |
| | |
| | Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) — monthly snapshots of 300+ Wikipedia languages. |
| | |
| | A project by **[Wikilangs](https://wikilangs.org)** · Maintainer: [Omar Kamali](https://omarkamali.com) · [Omneity Labs](https://omneitylabs.com) |
| | |
| | ### Citation |
| | |
| | ```bibtex |
| | @misc{wikilangs2025, |
| | author = {Kamali, Omar}, |
| | title = {Wikilangs: Open NLP Models for Wikipedia Languages}, |
| | year = {2025}, |
| | doi = {10.5281/zenodo.18073153}, |
| | publisher = {Zenodo}, |
| | url = {https://huggingface.co/wikilangs}, |
| | institution = {Omneity Labs} |
| | } |
| | ``` |
| | |
| | ### Links |
| | |
| | - 🌐 [wikilangs.org](https://wikilangs.org) |
| | - 🌍 [Language page](https://wikilangs.org/languages/fr/) |
| | - 🎮 [Playground](https://wikilangs.org/playground/?lang=fr) |
| | - 🤗 [HuggingFace models](https://huggingface.co/wikilangs) |
| | - 📊 [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) |
| | - 👤 [Omar Kamali](https://huggingface.co/omarkamali) |
| | - 🤝 Sponsor: [Featherless AI](https://featherless.ai) |
| | |
| | **License:** MIT — free for academic and commercial use. |
| | |
| | --- |
| | *Generated by Wikilangs Pipeline · 2026-03-03 05:41:40* |
| | |