Buckets:

Crux37
/

MultiLang-Code-Parser-Dataset-bucket

Files

xet

Crux37/MultiLang-Code-Parser-Dataset-bucket / README.md

Crux37

6 days ago

preview code

download

raw

6.27 kB

	---
	license: mit
	language:
	- en
	tags:
	- code
	---

	# MultiLang Code Parser Dataset (MLCPD)

	[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
	[![GitHub](https://img.shields.io/badge/GitHub-Repository-181717.svg?logo=github)](https://github.com/JugalGajjar/MultiLang-Code-Parser-Dataset)
	[![arXiv](https://img.shields.io/badge/arXiv-2510.16357-b31b1b.svg)](https://arxiv.org/abs/2510.16357)


	MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages, represented under a universal schema that captures syntax, semantics, and structure in a consistent format.

	Each entry corresponds to one parsed source file and includes:
	- Language metadata
	- Code-level statistics (lines, errors, AST nodes)
	- Universal Schema JSON (normalized structural representation)

	MLCPD enables robust cross-language analysis, code understanding, and representation learning by providing a consistent, language-agnostic data structure suitable for both traditional ML and modern LLM-based workflows.

	---

	## 📂 Dataset Structure

	```
	MultiLang-Code-Parser-Dataset/
	├── c_parsed_1.parquet
	├── c_parsed_2.parquet
	├── c_parsed_3.parquet
	├── c_parsed_4.parquet
	├── c_sharp_parsed_1.parquet
	├── ...
	└── typescript_parsed_4.parquet
	```
	Each file corresponds to one partition of a language (~175k rows each).

	Each record contains:

	\| Field \| Type \| Description \|
	\|--------\|------\|-------------\|
	\| `language` \| `str` \| Programming language name \|
	\| `code` \| `str` \| Raw source code \|
	\| `avg_line_length` \| `float` \| Average line length \|
	\| `line_count` \| `int` \| Number of lines \|
	\| `lang_specific_parse` \| `str` \| TreeSitter parse output \|
	\| `ast_node_count` \| `int` \| Number of AST nodes \|
	\| `num_errors` \| `int` \| Parse errors \|
	\| `universal_schema` \| `str` \| JSON-formatted unified schema \|

	---

	## 📊 Key Statistics

	\| Metric \| Value \|
	\|--------\|--------\|
	\| Total Languages \| 10 \|
	\| Total Files \| 40 \|
	\| Total Records \| 7,021,722 \|
	\| Successful Conversions \| 7,021,718 (99.9999%) \|
	\| Failed Conversions \| 4 (3 in C, 1 in C++) \|
	\| Disk Size \| ~114 GB (Parquet format) \|
	\| Memory Size \| ~600 GB (Parquet format) \|

	The dataset is clean, lossless, and statistically balanced across languages.
	It offers both per-language and combined cross-language representations.

	---

	## 🚀 Use Cases

	MLCPD can be directly used for:
	- Cross-language code representation learning
	- Program understanding and code similarity tasks
	- Syntax-aware pretraining for LLMs
	- Code summarization, clone detection, and bug prediction
	- Graph-based learning on universal ASTs
	- Benchmark creation for cross-language code reasoning

	---

	## 🔍 Features

	- Universal Schema: A unified structural representation harmonizing AST node types across languages.
	- Compact Format: Stored in Apache Parquet, allowing fast access and efficient querying.
	- Cross-Language Compatibility: Enables comparative code structure analysis across multiple programming ecosystems.
	- Error-Free Parsing: 99.9999% successful schema conversions across ~7M code files.
	- Statistical Richness: Includes per-language metrics such as mean line count, AST size, and error ratios.
	- Ready for ML Pipelines: Compatible with PyTorch, TensorFlow, Hugging Face Transformers, and graph-based models.

	---

	## 📥 How to Access the Dataset

	### Using the Hugging Face `datasets` Library

	This dataset is hosted on the Hugging Face Hub and can be easily accessed using the `datasets` library.

	#### Install the Required Library

	```bash
	pip install datasets
	```

	#### Import Library

	```bash
	from datasets import load_dataset
	```

	#### Load the Entire Dataset

	```bash
	dataset = load_dataset(
	"jugalgajjar/MultiLang-Code-Parser-Dataset"
	)
	```

	#### Load a Specific Language File

	```bash
	dataset = load_dataset(
	"jugalgajjar/MultiLang-Code-Parser-Dataset",
	data_files="python_parsed_1.parquet"
	)
	```

	#### Stream Data

	```bash
	dataset = load_dataset(
	"jugalgajjar/MultiLang-Code-Parser-Dataset",
	data_files="python_parsed_1.parquet",
	streaming=True
	)
	```

	#### Access Data Content (After Downloading)

	```bash
	try:
	for example in dataset["train"].take(5):
	print(example)
	print("-"*25)
	except Exception as e:
	print(f"An error occurred: {e}")
	```

	### Manual Download

	You can also manually download specific language files from the Hugging Face repository page:

	1. Visit https://huggingface.co/datasets/jugalgajjar/MultiLang-Code-Parser-Dataset
	2. Navigate to the Files tab
	3. Click on the language file you want (e.g., `python_parsed_1.parquet`)
	4. Use the Download button to save locally

	---

	## 🧾 Citation

	If you use this dataset in your research or work, please cite the following paper:

	> Gajjar, J., & Subramaniakuppusamy, K. (2025).
	> MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema.
	> arXiv preprint [arXiv:2510.16357](https://arxiv.org/abs/2510.16357)

	```bibtex
	@article{gajjar2025mlcpd,
	title={MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema},
	author={Gajjar, Jugal and Subramaniakuppusamy, Kamalasankari},
	journal={arXiv preprint arXiv:2510.16357},
	year={2025}
	}
	```

	---

	## 📜 License

	This dataset is released under the MIT License.<br>
	You are free to use, modify, and redistribute it for research and educational purposes, with proper attribution.

	---

	## 🙏 Acknowledgements

	- [StarCoder Dataset](https://huggingface.co/datasets/bigcode/starcoderdata) for source code samples
	- [TreeSitter](https://tree-sitter.github.io/tree-sitter/) for parsing
	- [Hugging Face](https://huggingface.co/) for dataset hosting

	---

	## 📧 Contact

	For questions, collaborations, or feedback:

	- Primary Author: Jugal Gajjar
	- Email: [812jugalgajjar@gmail.com](mailto:812jugalgajjar@gmail.com)
	- LinkedIn: [linkedin.com/in/jugal-gajjar/](https://www.linkedin.com/in/jugal-gajjar/)

	---

	⭐ If you find this dataset useful, consider liking the dataset and the [GitHub repository](https://github.com/JugalGajjar/MultiLang-Code-Parser-Dataset) and sharing your work that uses it.

Xet Storage Details

Size:: 6.27 kB
Xet hash:: 47919f99978f6ad0fc0dee98385906a7696104db8eb270fd0f358378627ac866

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.