AI & ML interests

LLM, OCR, Embedding Models, Private Intelligence

hasankursun 
posted an update 2 days ago
view post
Post
61
Greek Corpus 150B is now live on the Hub.
A deduplicated, ~146B-token Greek dataset for pretraining and fine-tuning foundation models — a pretrain layer + an instruction (SFT) layer, one unified schema, globally deduplicated.
📊 49.6M documents / ~146B pretrain tokens

📚 Web (FineWeb-2) + long-form PDFs (FinePDFs) + FineWiki + native Greek legislation (47k statutes from the Government Gazette)

💬 ~10B-token SFT layer (9.9M conversations)
The newest in my Global Corpus family — Dutch, Turkish, Bulgarian, Greek — built on a consistent, reproducible pipeline.
🔗 hasankursun/greek-corpus-150b
#greek #llm #dataset #multilingual