Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
omarkamaliΒ 
posted an update 26 days ago
Post
5216
Exciting updates to the Wikipedia Monthly dataset for November! πŸš€

・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__
・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.

Check out the dataset:
omarkamali/wikipedia-monthly
In this post