Post
61
Greek Corpus 150B is now live on the Hub.
A deduplicated, ~146B-token Greek dataset for pretraining and fine-tuning foundation models — a pretrain layer + an instruction (SFT) layer, one unified schema, globally deduplicated.
📊 49.6M documents / ~146B pretrain tokens
📚 Web (FineWeb-2) + long-form PDFs (FinePDFs) + FineWiki + native Greek legislation (47k statutes from the Government Gazette)
💬 ~10B-token SFT layer (9.9M conversations)
The newest in my Global Corpus family — Dutch, Turkish, Bulgarian, Greek — built on a consistent, reproducible pipeline.
🔗 hasankursun/greek-corpus-150b
#greek #llm #dataset #multilingual
A deduplicated, ~146B-token Greek dataset for pretraining and fine-tuning foundation models — a pretrain layer + an instruction (SFT) layer, one unified schema, globally deduplicated.
📊 49.6M documents / ~146B pretrain tokens
📚 Web (FineWeb-2) + long-form PDFs (FinePDFs) + FineWiki + native Greek legislation (47k statutes from the Government Gazette)
💬 ~10B-token SFT layer (9.9M conversations)
The newest in my Global Corpus family — Dutch, Turkish, Bulgarian, Greek — built on a consistent, reproducible pipeline.
🔗 hasankursun/greek-corpus-150b
#greek #llm #dataset #multilingual