Buckets:
| license: apache-2.0 | |
| authors: | |
| - Vinayak Patel | |
| task_categories: | |
| - question-answering | |
| - text-generation | |
| language: | |
| - en | |
| size_categories: | |
| - 10M<n<100M | |
| tags: | |
| - llm-benchmark | |
| - frontier-models | |
| - reasoning | |
| - recursive-reasoning | |
| - long-context | |
| - benchmark | |
| - evaluation | |
| - prompt-based-evaluation | |
| - encrypted-reasoning | |
| - recursive-execution | |
| - dependency-resolution | |
| - noisy-reasoning | |
| pretty_name: Vinayak Multistep Recursive Reasoning Benchmark | |
| configs: | |
| - config_name: questions | |
| data_files: | |
| - split: test | |
| path: questionSheet.txt | |
| - config_name: answers | |
| data_files: | |
| - split: test | |
| path: AnswerSheet.csv | |
| # Vinayak Multistep Recursive Reasoning Benchmark (VMRRB) | |
| ## Overview | |
| The **Vinayak Multistep Recursive Reasoning Benchmark (VMRRB)** is a large-scale prompt-based benchmark designed to evaluate advanced reasoning, recursive dependency resolution, encrypted task traversal, and robustness capabilities of frontier AI systems. | |
| The benchmark evaluates a model's ability to: | |
| - Perform recursive multistep reasoning | |
| - Resolve interdependent question chains | |
| - Execute encrypted dependency traversal | |
| - Perform chained decryption using computed answers | |
| - Parse noisy mathematical expressions | |
| - Maintain consistency across extremely long contexts | |
| - Handle recursive execution workflows | |
| - Preserve memory across large contextual inputs | |
| The benchmark contains approximately **10 million interdependent reasoning tasks**. | |
| Unlike conventional QA datasets, VMRRB operates as a continuous evaluation prompt in which: | |
| - Questions are encrypted | |
| - Dependencies must be recursively resolved | |
| - Intermediate answers become future decryption keys | |
| - Noise must be filtered semantically | |
| - Questions must be solved in dependency order | |
| This benchmark is intended exclusively for **evaluation and benchmarking** of frontier AI systems. | |
| There is **no training split** provided. | |
| --- | |
| ## Files | |
| - questionSheet.txt | |
| - AnswerSheet.csv | |
| All files belong to the `test` split. | |
| --- | |
| ## Dataset Structure | |
| ### questions config | |
| The `questions` configuration contains `questionSheet.txt`, which is a monolithic long-context evaluation prompt. | |
| The file contains: | |
| 1. Benchmark execution instructions | |
| 2. Recursive dependency rules | |
| 3. Encrypted question traversal logic | |
| 4. Noise-handling constraints | |
| 5. Mathematical parsing rules | |
| 6. Decryption code | |
| 7. The encrypted benchmark question corpus | |
| The benchmark operates as a recursive execution chain. | |
| Questions may contain references such as: | |
| ```text | |
| [ Answer N ] | |
| ``` | |
| which indicates that: | |
| - Question `N` must first be solved | |
| - Its computed answer must be substituted back into the current question | |
| - Dependencies must be resolved recursively | |
| Additionally: | |
| - Questions are encrypted | |
| - Computed answers are used as decryption keys | |
| - Decrypted outputs may still contain semantic noise | |
| - Models must identify valid mathematical structure while ignoring irrelevant text | |
| The benchmark is designed to be processed as a single continuous context window whenever possible. | |
| Arbitrary chunking may invalidate dependency chains and decryption traversal logic. | |
| --- | |
| ### answers config | |
| The `answers` configuration contains `AnswerSheet.csv`, which stores the benchmark answer key. | |
| Schema: | |
| | Column | Description | | |
| |---|---| | |
| | `Question Number` | Sequential benchmark question identifier | | |
| | `Answer` | Final computed numeric answer | | |
| Example: | |
| ```csv | |
| Question Number,Answer | |
| 1,807.184 | |
| 2,6.148 | |
| 3,400.891 | |
| ``` | |
| Question numbering corresponds to the ordering of benchmark questions within `questionSheet.txt`. | |
| --- | |
| ## Benchmark Workflow | |
| The intended benchmark workflow is: | |
| 1. Start from the initial question specified in the "Start Guidance" section | |
| 2. Decrypt the question using the provided password or dependency answer | |
| 3. Remove semantic noise from the decrypted output | |
| 4. Resolve referenced dependencies recursively | |
| 5. Compute the final answer | |
| 6. Use the computed answer as the decryption key for the next question | |
| 7. Continue recursively until traversal is complete | |
| The benchmark evaluates both reasoning correctness and recursive execution consistency. | |
| --- | |
| ## Reproducibility | |
| The benchmark includes the complete decryption logic and execution rules required for deterministic evaluation. | |
| --- | |
| ## Important Notes | |
| - This is not a conventional row-wise supervised dataset. | |
| - The benchmark is designed as a prompt-based recursive reasoning corpus. | |
| - Dependency chains must be resolved recursively before computing answers. | |
| - Intermediate answers may function as future decryption keys. | |
| - Decrypted questions may contain semantic noise that must be ignored. | |
| - Full-context loading is strongly recommended. | |
| - Arbitrary chunking may invalidate recursive traversal logic and dependency resolution. | |
| - Questions are not intended to be solved independently. | |
| --- | |
| ## Resource Requirements | |
| `questionSheet.txt` is intentionally large and designed for long-context evaluation. | |
| Substantial memory, compute, and context-window capacity may be required for full benchmark execution. | |
| --- | |
| ## Warning | |
| VMRRB is intentionally designed to stress recursive execution, long-context reasoning, encrypted traversal, and dependency resolution capabilities. | |
| The benchmark may require substantial compute, memory, and context-window capacity for full evaluation. | |
| --- | |
| ## Example Usage | |
| ```python | |
| with open("questionSheet.txt", "r", encoding="utf-8") as f: | |
| benchmark_prompt = f.read() | |
| # Feed the complete benchmark prompt into the evaluation model | |
| response = model.generate(benchmark_prompt) | |
| ``` | |
| Answers can be validated against `AnswerSheet.csv` using the corresponding `Question Number` field. | |
| --- | |
| ## Intended Usage | |
| - Long-context evaluation | |
| - Recursive dependency reasoning evaluation | |
| - Encrypted reasoning evaluation | |
| - Dependency traversal benchmarking | |
| - Memory and dependency tracking | |
| - Frontier AI evaluation | |
| - Retrieval-aware reasoning systems | |
| - Recursive execution evaluation | |
| - Robustness testing under chained reasoning tasks | |
| --- | |
| ## Limitations | |
| This dataset is intended solely for benchmarking and evaluation. | |
| It is not intended as: | |
| - A supervised training corpus | |
| - A standard row-wise machine learning dataset | |
| - A retrieval benchmark | |
| - A conventional mathematical QA dataset | |
| --- | |
| ## Citation | |
| If you use VMRRB in research or evaluation pipelines, please cite the dataset repository. | |
| --- | |
| ## Author | |
| Vinayak Patel | |
| --- | |
| ## License | |
| Apache License 2.0 |
Xet Storage Details
- Size:
- 6.65 kB
- Xet hash:
- e222c979609711eeeb9d0be486e8aa50993e5a2df00dc00579d77368c94b8e55
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.