Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
ajibawa-2023Β 
posted an update 4 days ago
Post
3623
Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

I'm curious if there is any part of the dataset that involves AI infra.

Β·

In this case, unfortunately no.

huggingface_test01