AI & ML interests

Wikilangs is an open-source initiative to democratize access to natural language processing models for every language represented on Wikipedia - A project by @OmarKamali. Graciously sponsored by Featherless.ai.

Recent Activity

omarkamaliĀ  updated a model 24 days ago
wikilangs/hu
omarkamaliĀ  updated a model 24 days ago
wikilangs/ar
omarkamaliĀ  updated a model 24 days ago
wikilangs/ceb
View all activity

omarkamaliĀ 
posted an update 3 days ago
view post
Post
139
Omneity Labs LID Benchmark is live šŸ”„

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space šŸ‘‡

omneity-labs/lid-benchmark
omarkamaliĀ 
posted an update 4 days ago
view post
Post
1827
I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

  • 14 replies
Ā·
omarkamaliĀ 
posted an update 21 days ago
view post
Post
334
You're probably training on outdated Wikipedia data right now and don't know it. šŸ’”

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly šŸ‘‡

https://omarkamali.com/blog/wikipedia-monthly-pipeline