Buckets:
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - code | |
| # MultiLang Code Parser Dataset (MLCPD) | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://github.com/JugalGajjar/MultiLang-Code-Parser-Dataset) | |
| [](https://arxiv.org/abs/2510.16357) | |
| **MultiLang-Code-Parser-Dataset (MLCPD)** provides a large-scale, unified dataset of parsed source code across 10 major programming languages, represented under a universal schema that captures syntax, semantics, and structure in a consistent format. | |
| Each entry corresponds to one parsed source file and includes: | |
| - Language metadata | |
| - Code-level statistics (lines, errors, AST nodes) | |
| - Universal Schema JSON (normalized structural representation) | |
| MLCPD enables robust cross-language analysis, code understanding, and representation learning by providing a consistent, language-agnostic data structure suitable for both traditional ML and modern LLM-based workflows. | |
| --- | |
| ## ๐ Dataset Structure | |
| ``` | |
| MultiLang-Code-Parser-Dataset/ | |
| โโโ c_parsed_1.parquet | |
| โโโ c_parsed_2.parquet | |
| โโโ c_parsed_3.parquet | |
| โโโ c_parsed_4.parquet | |
| โโโ c_sharp_parsed_1.parquet | |
| โโโ ... | |
| โโโ typescript_parsed_4.parquet | |
| ``` | |
| Each file corresponds to one partition of a language (~175k rows each). | |
| Each record contains: | |
| | Field | Type | Description | | |
| |--------|------|-------------| | |
| | `language` | `str` | Programming language name | | |
| | `code` | `str` | Raw source code | | |
| | `avg_line_length` | `float` | Average line length | | |
| | `line_count` | `int` | Number of lines | | |
| | `lang_specific_parse` | `str` | TreeSitter parse output | | |
| | `ast_node_count` | `int` | Number of AST nodes | | |
| | `num_errors` | `int` | Parse errors | | |
| | `universal_schema` | `str` | JSON-formatted unified schema | | |
| --- | |
| ## ๐ Key Statistics | |
| | Metric | Value | | |
| |--------|--------| | |
| | Total Languages | 10 | | |
| | Total Files | 40 | | |
| | Total Records | 7,021,722 | | |
| | Successful Conversions | 7,021,718 (99.9999%) | | |
| | Failed Conversions | 4 (3 in C, 1 in C++) | | |
| | Disk Size | ~114 GB (Parquet format) | | |
| | Memory Size | ~600 GB (Parquet format) | | |
| The dataset is clean, lossless, and statistically balanced across languages. | |
| It offers both per-language and combined cross-language representations. | |
| --- | |
| ## ๐ Use Cases | |
| MLCPD can be directly used for: | |
| - Cross-language code representation learning | |
| - Program understanding and code similarity tasks | |
| - Syntax-aware pretraining for LLMs | |
| - Code summarization, clone detection, and bug prediction | |
| - Graph-based learning on universal ASTs | |
| - Benchmark creation for cross-language code reasoning | |
| --- | |
| ## ๐ Features | |
| - **Universal Schema:** A unified structural representation harmonizing AST node types across languages. | |
| - **Compact Format:** Stored in Apache Parquet, allowing fast access and efficient querying. | |
| - **Cross-Language Compatibility:** Enables comparative code structure analysis across multiple programming ecosystems. | |
| - **Error-Free Parsing:** 99.9999% successful schema conversions across ~7M code files. | |
| - **Statistical Richness:** Includes per-language metrics such as mean line count, AST size, and error ratios. | |
| - **Ready for ML Pipelines:** Compatible with PyTorch, TensorFlow, Hugging Face Transformers, and graph-based models. | |
| --- | |
| ## ๐ฅ How to Access the Dataset | |
| ### Using the Hugging Face `datasets` Library | |
| This dataset is hosted on the Hugging Face Hub and can be easily accessed using the `datasets` library. | |
| #### Install the Required Library | |
| ```bash | |
| pip install datasets | |
| ``` | |
| #### Import Library | |
| ```bash | |
| from datasets import load_dataset | |
| ``` | |
| #### Load the Entire Dataset | |
| ```bash | |
| dataset = load_dataset( | |
| "jugalgajjar/MultiLang-Code-Parser-Dataset" | |
| ) | |
| ``` | |
| #### Load a Specific Language File | |
| ```bash | |
| dataset = load_dataset( | |
| "jugalgajjar/MultiLang-Code-Parser-Dataset", | |
| data_files="python_parsed_1.parquet" | |
| ) | |
| ``` | |
| #### Stream Data | |
| ```bash | |
| dataset = load_dataset( | |
| "jugalgajjar/MultiLang-Code-Parser-Dataset", | |
| data_files="python_parsed_1.parquet", | |
| streaming=True | |
| ) | |
| ``` | |
| #### Access Data Content (After Downloading) | |
| ```bash | |
| try: | |
| for example in dataset["train"].take(5): | |
| print(example) | |
| print("-"*25) | |
| except Exception as e: | |
| print(f"An error occurred: {e}") | |
| ``` | |
| ### Manual Download | |
| You can also manually download specific language files from the Hugging Face repository page: | |
| 1. Visit https://huggingface.co/datasets/jugalgajjar/MultiLang-Code-Parser-Dataset | |
| 2. Navigate to the Files tab | |
| 3. Click on the language file you want (e.g., `python_parsed_1.parquet`) | |
| 4. Use the Download button to save locally | |
| --- | |
| ## ๐งพ Citation | |
| If you use this dataset in your research or work, please cite the following paper: | |
| > **Gajjar, J., & Subramaniakuppusamy, K. (2025).** | |
| > *MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema.* | |
| > *arXiv preprint* [arXiv:2510.16357](https://arxiv.org/abs/2510.16357) | |
| ```bibtex | |
| @article{gajjar2025mlcpd, | |
| title={MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema}, | |
| author={Gajjar, Jugal and Subramaniakuppusamy, Kamalasankari}, | |
| journal={arXiv preprint arXiv:2510.16357}, | |
| year={2025} | |
| } | |
| ``` | |
| --- | |
| ## ๐ License | |
| This dataset is released under the MIT License.<br> | |
| You are free to use, modify, and redistribute it for research and educational purposes, with proper attribution. | |
| --- | |
| ## ๐ Acknowledgements | |
| - [StarCoder Dataset](https://huggingface.co/datasets/bigcode/starcoderdata) for source code samples | |
| - [TreeSitter](https://tree-sitter.github.io/tree-sitter/) for parsing | |
| - [Hugging Face](https://huggingface.co/) for dataset hosting | |
| --- | |
| ## ๐ง Contact | |
| For questions, collaborations, or feedback: | |
| - **Primary Author**: Jugal Gajjar | |
| - **Email**: [812jugalgajjar@gmail.com](mailto:812jugalgajjar@gmail.com) | |
| - **LinkedIn**: [linkedin.com/in/jugal-gajjar/](https://www.linkedin.com/in/jugal-gajjar/) | |
| --- | |
| โญ If you find this dataset useful, consider liking the dataset and the [GitHub repository](https://github.com/JugalGajjar/MultiLang-Code-Parser-Dataset) and sharing your work that uses it. |
Xet Storage Details
- Size:
- 6.27 kB
- Xet hash:
- 47919f99978f6ad0fc0dee98385906a7696104db8eb270fd0f358378627ac866
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.