Buckets:

Crux37
/

MultiLang-Code-Parser-Dataset-bucket

114 GB

42 files

Updated 6 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
.gitattributes	2.44 kB xet	6 days ago	b16d233d
README.md	6.27 kB xet	6 days ago	47919f99
c_parsed_1.parquet	2.79 GB xet	6 days ago	47be985a
c_parsed_2.parquet	2.81 GB xet	6 days ago	fc318ae6
c_parsed_3.parquet	2.8 GB xet	6 days ago	b82be1c4
c_parsed_4.parquet	2.83 GB xet	6 days ago	a022254a
c_sharp_parsed_1.parquet	2.28 GB xet	6 days ago	5b0aeaf6
c_sharp_parsed_2.parquet	2.28 GB xet	6 days ago	ba50ac7d
c_sharp_parsed_3.parquet	2.29 GB xet	6 days ago	e15c7093
c_sharp_parsed_4.parquet	2.28 GB xet	6 days ago	52ab76da
cpp_parsed_1.parquet	4.95 GB xet	6 days ago	3c91b625
cpp_parsed_2.parquet	4.97 GB xet	6 days ago	b4eb46f1
cpp_parsed_3.parquet	4.96 GB xet	6 days ago	a3ad6031
cpp_parsed_4.parquet	4.95 GB xet	6 days ago	9c6c6fc1
go_parsed_1.parquet	3.9 GB xet	6 days ago	3be393c8
go_parsed_2.parquet	3.89 GB xet	6 days ago	621b8183
go_parsed_3.parquet	3.9 GB xet	6 days ago	0a070a40
go_parsed_4.parquet	3.89 GB xet	6 days ago	e7ae4851
java_parsed_1.parquet	2.84 GB xet	6 days ago	6b11e344
java_parsed_2.parquet	2.84 GB xet	6 days ago	b8e6003c
java_parsed_3.parquet	2.85 GB xet	6 days ago	eaaa748f
java_parsed_4.parquet	2.85 GB xet	6 days ago	de974858
javascript_parsed_1.parquet	2.32 GB xet	6 days ago	b2bb62e1
javascript_parsed_2.parquet	2.32 GB xet	6 days ago	19c5a7e7
javascript_parsed_3.parquet	2.33 GB xet	6 days ago	d6548468
javascript_parsed_4.parquet	2.33 GB xet	6 days ago	b55f4eb9
python_parsed_1.parquet	3.17 GB xet	6 days ago	a15ab37f
python_parsed_2.parquet	3.18 GB xet	6 days ago	05c7c26b
python_parsed_3.parquet	3.18 GB xet	6 days ago	fcb402ea
python_parsed_4.parquet	3.18 GB xet	6 days ago	bf09fa18
ruby_parsed_1.parquet	1.35 GB xet	6 days ago	c29bb790
ruby_parsed_2.parquet	1.35 GB xet	6 days ago	950e9f9d
ruby_parsed_3.parquet	1.35 GB xet	6 days ago	4e0137e5
ruby_parsed_4.parquet	1.35 GB xet	6 days ago	34f2e98b
scala_parsed_1.parquet	2.79 GB xet	6 days ago	37d7fdee
scala_parsed_2.parquet	2.78 GB xet	6 days ago	4c608fe7
scala_parsed_3.parquet	2.78 GB xet	6 days ago	152ab768
scala_parsed_4.parquet	2.78 GB xet	6 days ago	a1e029fd
typescript_parsed_1.parquet	2.06 GB xet	6 days ago	0a9decd3
typescript_parsed_2.parquet	2.06 GB xet	6 days ago	3540ca9d
typescript_parsed_3.parquet	2.06 GB xet	6 days ago	b370fd2b
typescript_parsed_4.parquet	2.05 GB xet	6 days ago	42c02060

README.md

MultiLang Code Parser Dataset (MLCPD)

MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages, represented under a universal schema that captures syntax, semantics, and structure in a consistent format.

Each entry corresponds to one parsed source file and includes:

Language metadata
Code-level statistics (lines, errors, AST nodes)
Universal Schema JSON (normalized structural representation)

MLCPD enables robust cross-language analysis, code understanding, and representation learning by providing a consistent, language-agnostic data structure suitable for both traditional ML and modern LLM-based workflows.

📂 Dataset Structure

MultiLang-Code-Parser-Dataset/
├── c_parsed_1.parquet
├── c_parsed_2.parquet
├── c_parsed_3.parquet
├── c_parsed_4.parquet
├── c_sharp_parsed_1.parquet
├── ...
└── typescript_parsed_4.parquet

Each file corresponds to one partition of a language (~175k rows each).

Each record contains:

Field	Type	Description
`language`	`str`	Programming language name
`code`	`str`	Raw source code
`avg_line_length`	`float`	Average line length
`line_count`	`int`	Number of lines
`lang_specific_parse`	`str`	TreeSitter parse output
`ast_node_count`	`int`	Number of AST nodes
`num_errors`	`int`	Parse errors
`universal_schema`	`str`	JSON-formatted unified schema

📊 Key Statistics

Metric	Value
Total Languages	10
Total Files	40
Total Records	7,021,722
Successful Conversions	7,021,718 (99.9999%)
Failed Conversions	4 (3 in C, 1 in C++)
Disk Size	~114 GB (Parquet format)
Memory Size	~600 GB (Parquet format)

The dataset is clean, lossless, and statistically balanced across languages.
It offers both per-language and combined cross-language representations.

🚀 Use Cases

MLCPD can be directly used for:

Cross-language code representation learning
Program understanding and code similarity tasks
Syntax-aware pretraining for LLMs
Code summarization, clone detection, and bug prediction
Graph-based learning on universal ASTs
Benchmark creation for cross-language code reasoning

🔍 Features

Universal Schema: A unified structural representation harmonizing AST node types across languages.
Compact Format: Stored in Apache Parquet, allowing fast access and efficient querying.
Cross-Language Compatibility: Enables comparative code structure analysis across multiple programming ecosystems.
Error-Free Parsing: 99.9999% successful schema conversions across ~7M code files.
Statistical Richness: Includes per-language metrics such as mean line count, AST size, and error ratios.
Ready for ML Pipelines: Compatible with PyTorch, TensorFlow, Hugging Face Transformers, and graph-based models.

📥 How to Access the Dataset

Using the Hugging Face `datasets` Library

This dataset is hosted on the Hugging Face Hub and can be easily accessed using the datasets library.

Install the Required Library

pip install datasets

Import Library

from datasets import load_dataset

Load the Entire Dataset

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset"
)

Load a Specific Language File

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset",
    data_files="python_parsed_1.parquet"
)

Stream Data

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset",
    data_files="python_parsed_1.parquet",
    streaming=True
)

Access Data Content (After Downloading)

try:
    for example in dataset["train"].take(5):
        print(example)
        print("-"*25)
except Exception as e:
    print(f"An error occurred: {e}")

Manual Download

You can also manually download specific language files from the Hugging Face repository page:

Visit https://huggingface.co/datasets/jugalgajjar/MultiLang-Code-Parser-Dataset
Navigate to the Files tab
Click on the language file you want (e.g., python_parsed_1.parquet)
Use the Download button to save locally

🧾 Citation

If you use this dataset in your research or work, please cite the following paper:

Gajjar, J., & Subramaniakuppusamy, K. (2025).
MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema.
arXiv preprint arXiv:2510.16357

@article{gajjar2025mlcpd,
  title={MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema},
  author={Gajjar, Jugal and Subramaniakuppusamy, Kamalasankari},
  journal={arXiv preprint arXiv:2510.16357},
  year={2025}
}

📜 License

This dataset is released under the MIT License.
You are free to use, modify, and redistribute it for research and educational purposes, with proper attribution.

🙏 Acknowledgements

StarCoder Dataset for source code samples
TreeSitter for parsing
Hugging Face for dataset hosting

📧 Contact

For questions, collaborations, or feedback:

Primary Author: Jugal Gajjar
Email: 812jugalgajjar@gmail.com
LinkedIn: linkedin.com/in/jugal-gajjar/

⭐ If you find this dataset useful, consider liking the dataset and the GitHub repository and sharing your work that uses it.

Total size: 114 GB

Files: 42

Last updated: Jun 28

Pre-warmed CDN: US EU US EU