Buckets:
| license: cc-by-sa-4.0 | |
| language: | |
| - en | |
| tags: | |
| - code | |
| pretty_name: PROBE | |
| size_categories: | |
| - 1K<n<10K | |
| # PROBE Dataset | |
| The dataset is provided as a single JSONL file: `dataset.jsonl` | |
| --- | |
| ## Dataset structure | |
| Each line in the file corresponds to one programming problem and contains a JSON object with the following fields: | |
| **problem_id:** A unique identifier for the problem. | |
| **difficulty_level** Indication of the difficulty level of the task. Values range from 0 to 3, where 0 is the easiest and 3 the hardest. Difficulty was estimated based on the cyclomatic complexity, LLOC, and halstead effort of the Python reference solutions. | |
| **prompt:** The natural language description of the programming task. | |
| **unit_tests:** A list of unit test specifications associated with the problem. Each unit test is an object with the following fields: | |
| - number: unit test identifier. | |
| - input: the input provided to the program. | |
| - output: the expected output for the given input. | |
| **references:** A list of reference solutions for the problem. Each reference solution is an object with the following fields: | |
| - language: the programming language of the solution (e.g., Python, C++, Java, C, Rust). | |
| - id: an identifier for the reference solution. | |
| - code: the source code implementing a correct solution for the problem. | |
| --- | |
| ## Dataset Statistics | |
| - **Total problems:** 1,651 | |
| - **Reference solutions per problem:** | |
| - Python, C++: 3–250 | |
| - Java, C: 0–250 | |
| - Rust: 0–180 | |
| - **Unit tests per problem:** 6–131 | |
| --- | |
| ## Data Sources | |
| This dataset is based on the [Project CodeNet](https://github.com/IBM/Project_CodeNet) dataset, which contains problems from two online judge platforms: **Aizu** and **AtCoder**. | |
| - **Prompts:** | |
| Extracted from the HTML files containing problem descriptions and organized into a structured format: | |
| ``` | |
| Problem Description: | |
| Input Format: | |
| Output Format: | |
| Constraints: | |
| ``` | |
| - **Reference solutions:** | |
| Filtered to keep only correct solutions. For each problem, a random subset was selected, with a maximum of 250 reference solutions per problem. | |
| - **Unit tests:** | |
| Most unit tests were obtained directly from the online judge platforms. Additional tests were generated using the available reference solutions to ensure coverage. | |
| --- | |
| # Generated code | |
| The zip file `generated_code.zip` contains LLM-generated solutions for these problems. | |
| The solutions where generated by six different models: | |
| - GPT-4.1-mini | |
| - Gemini-2.0-Flash | |
| - Deepseek-Coder-v2 (16b) | |
| - Qwen2.5-Coder ((14b) | |
| - Qwen2.5-Coder (7/14b) | |
| Each model generated five independent solutions for each problem. | |
| The process was repeated for five programming languages (Python, C++, Java, C, Rust), | |
| When the solutions were incorrect, the models were given feedback (up to two iterations) and asked to provide a new solution: | |
| - solution.{termination} (py, cpp, java, c, rs) - first solution generated by the model (before any feedback), | |
| - solution_0/1.{termination} - first and second solutions generated after feedback. | |
| --- | |
| # Intended Use | |
| This dataset is intended for research and evaluation of Large Language Models in the task of text-to-code generation. | |
| The presence of both large-scale unit tests and multiple reference implementations enables comprehensive functional correctness evaluation as well as comparison against human-written solutions. Reference solutions are provided in five programming languages, allowing cross-language analysis and benchmarking of multilingual code generation capabilities. | |
| The dataset supports: | |
| - Functional correctness evaluation using extensive unit testing. | |
| - Similarity analysis to human-written implementations, supporting metrics such as syntactic, semantic, or structural similarity. | |
| - Code quality assessment, both for comparing different models and for evaluating generated code relative to high-quality human reference implementations. |
Xet Storage Details
- Size:
- 3.9 kB
- Xet hash:
- 24f109e39a659293960f0043adb95e748c67692b77a08f1932faaf3172152e0e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.