114 GB
42 files
Updated 6 days ago
Name
Size
.gitattributes2.44 kB
xet
README.md6.27 kB
xet
c_parsed_1.parquet2.79 GB
xet
c_parsed_2.parquet2.81 GB
xet
c_parsed_3.parquet2.8 GB
xet
c_parsed_4.parquet2.83 GB
xet
c_sharp_parsed_1.parquet2.28 GB
xet
c_sharp_parsed_2.parquet2.28 GB
xet
c_sharp_parsed_3.parquet2.29 GB
xet
c_sharp_parsed_4.parquet2.28 GB
xet
cpp_parsed_1.parquet4.95 GB
xet
cpp_parsed_2.parquet4.97 GB
xet
cpp_parsed_3.parquet4.96 GB
xet
cpp_parsed_4.parquet4.95 GB
xet
go_parsed_1.parquet3.9 GB
xet
go_parsed_2.parquet3.89 GB
xet
go_parsed_3.parquet3.9 GB
xet
go_parsed_4.parquet3.89 GB
xet
java_parsed_1.parquet2.84 GB
xet
java_parsed_2.parquet2.84 GB
xet
java_parsed_3.parquet2.85 GB
xet
java_parsed_4.parquet2.85 GB
xet
javascript_parsed_1.parquet2.32 GB
xet
javascript_parsed_2.parquet2.32 GB
xet
javascript_parsed_3.parquet2.33 GB
xet
javascript_parsed_4.parquet2.33 GB
xet
python_parsed_1.parquet3.17 GB
xet
python_parsed_2.parquet3.18 GB
xet
python_parsed_3.parquet3.18 GB
xet
python_parsed_4.parquet3.18 GB
xet
ruby_parsed_1.parquet1.35 GB
xet
ruby_parsed_2.parquet1.35 GB
xet
ruby_parsed_3.parquet1.35 GB
xet
ruby_parsed_4.parquet1.35 GB
xet
scala_parsed_1.parquet2.79 GB
xet
scala_parsed_2.parquet2.78 GB
xet
scala_parsed_3.parquet2.78 GB
xet
scala_parsed_4.parquet2.78 GB
xet
typescript_parsed_1.parquet2.06 GB
xet
typescript_parsed_2.parquet2.06 GB
xet
typescript_parsed_3.parquet2.06 GB
xet
typescript_parsed_4.parquet2.05 GB
xet
README.md

MultiLang Code Parser Dataset (MLCPD)

License GitHub arXiv

MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages, represented under a universal schema that captures syntax, semantics, and structure in a consistent format.

Each entry corresponds to one parsed source file and includes:

  • Language metadata
  • Code-level statistics (lines, errors, AST nodes)
  • Universal Schema JSON (normalized structural representation)

MLCPD enables robust cross-language analysis, code understanding, and representation learning by providing a consistent, language-agnostic data structure suitable for both traditional ML and modern LLM-based workflows.


๐Ÿ“‚ Dataset Structure

MultiLang-Code-Parser-Dataset/
โ”œโ”€โ”€ c_parsed_1.parquet
โ”œโ”€โ”€ c_parsed_2.parquet
โ”œโ”€โ”€ c_parsed_3.parquet
โ”œโ”€โ”€ c_parsed_4.parquet
โ”œโ”€โ”€ c_sharp_parsed_1.parquet
โ”œโ”€โ”€ ...
โ””โ”€โ”€ typescript_parsed_4.parquet

Each file corresponds to one partition of a language (~175k rows each).

Each record contains:

Field Type Description
language str Programming language name
code str Raw source code
avg_line_length float Average line length
line_count int Number of lines
lang_specific_parse str TreeSitter parse output
ast_node_count int Number of AST nodes
num_errors int Parse errors
universal_schema str JSON-formatted unified schema

๐Ÿ“Š Key Statistics

Metric Value
Total Languages 10
Total Files 40
Total Records 7,021,722
Successful Conversions 7,021,718 (99.9999%)
Failed Conversions 4 (3 in C, 1 in C++)
Disk Size ~114 GB (Parquet format)
Memory Size ~600 GB (Parquet format)

The dataset is clean, lossless, and statistically balanced across languages.
It offers both per-language and combined cross-language representations.


๐Ÿš€ Use Cases

MLCPD can be directly used for:

  • Cross-language code representation learning
  • Program understanding and code similarity tasks
  • Syntax-aware pretraining for LLMs
  • Code summarization, clone detection, and bug prediction
  • Graph-based learning on universal ASTs
  • Benchmark creation for cross-language code reasoning

๐Ÿ” Features

  • Universal Schema: A unified structural representation harmonizing AST node types across languages.
  • Compact Format: Stored in Apache Parquet, allowing fast access and efficient querying.
  • Cross-Language Compatibility: Enables comparative code structure analysis across multiple programming ecosystems.
  • Error-Free Parsing: 99.9999% successful schema conversions across ~7M code files.
  • Statistical Richness: Includes per-language metrics such as mean line count, AST size, and error ratios.
  • Ready for ML Pipelines: Compatible with PyTorch, TensorFlow, Hugging Face Transformers, and graph-based models.

๐Ÿ“ฅ How to Access the Dataset

Using the Hugging Face datasets Library

This dataset is hosted on the Hugging Face Hub and can be easily accessed using the datasets library.

Install the Required Library

pip install datasets

Import Library

from datasets import load_dataset

Load the Entire Dataset

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset"
)

Load a Specific Language File

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset",
    data_files="python_parsed_1.parquet"
)

Stream Data

dataset = load_dataset(
    "jugalgajjar/MultiLang-Code-Parser-Dataset",
    data_files="python_parsed_1.parquet",
    streaming=True
)

Access Data Content (After Downloading)

try:
    for example in dataset["train"].take(5):
        print(example)
        print("-"*25)
except Exception as e:
    print(f"An error occurred: {e}")

Manual Download

You can also manually download specific language files from the Hugging Face repository page:

  1. Visit https://huggingface.co/datasets/jugalgajjar/MultiLang-Code-Parser-Dataset
  2. Navigate to the Files tab
  3. Click on the language file you want (e.g., python_parsed_1.parquet)
  4. Use the Download button to save locally

๐Ÿงพ Citation

If you use this dataset in your research or work, please cite the following paper:

Gajjar, J., & Subramaniakuppusamy, K. (2025).
MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema.
arXiv preprint arXiv:2510.16357

@article{gajjar2025mlcpd,
  title={MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema},
  author={Gajjar, Jugal and Subramaniakuppusamy, Kamalasankari},
  journal={arXiv preprint arXiv:2510.16357},
  year={2025}
}

๐Ÿ“œ License

This dataset is released under the MIT License.
You are free to use, modify, and redistribute it for research and educational purposes, with proper attribution.


๐Ÿ™ Acknowledgements


๐Ÿ“ง Contact

For questions, collaborations, or feedback:


โญ If you find this dataset useful, consider liking the dataset and the GitHub repository and sharing your work that uses it.

Total size
114 GB
Files
42
Last updated
Jun 28
Pre-warmed CDN
US EU US EU

Contributors