leaderboard / docs /benchmark_tasks /NanoCodeRAG /NanoCodeRAGProgrammingSolutions.md
hotchpotch's picture
Deploy remote main docs sync
1f41326 verified

NanoCodeRAG / NanoCodeRAGProgrammingSolutions

Overview

CodeRAG-Bench uses programming-solution documents, including HumanEval- and MBPP-style canonical problems, as direct retrieval support for code generation. This Nano split keeps the prompt-to-solution view: a short natural-language Python programming request must retrieve the compact function that implements it. The sampled positives are short snippets for list manipulation, tuple sorting, monotonic-array checks, divisor sums, and similar operations, so the retriever must map intent to executable code even when the prompt and code share little surface wording.

Details

What the Original Data Measures

CodeRAG-Bench: Can Retrieval Augment Code Generation? uses programming solutions as one of its five retrieval sources. The paper says these documents are created from basic programming problems with canonical solutions, such as HumanEval and MBPP, by concatenating the natural-language problem and program solution. In CodeRAG-Bench, such canonical documents can act as direct support for generation and as a retrieval-evaluation target.

This Nano split isolates the programming-solution source: each query describes a small Python task, and the positive document is the code solution. Unlike documentation or tutorial retrieval, the positive may be only a few lines of code and may not repeat the query's descriptive words.

Observed Data Profile

The Nano split has 200 queries, 984 documents, and 200 positive qrel rows. Every query has one positive. Queries average 78.28 characters and are mostly prompts such as "Write a python function to check whether the given array is monotonic" or "find the sum of common divisors". Documents average only 189.05 characters and are compact Python functions.

The sampled positives include one-line or short-loop implementations for list manipulation, tuple sorting, monotonic-array checks, divisor sums, and string editing. Some HumanEval-style queries in the observed sample are truncated to imports such as from typing import List, which makes the retrieval task much harder because the visible query contains almost no semantic problem statement.

BM25 Difficulty

Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.0138 and hit@10 = 0.0250. It ranks only one positive first and finds only five positives in the top 10. This is the hardest NanoCodeRAG split for BM25 by a wide margin.

The failure mode is structural. A natural-language prompt such as "sum of common divisors" must retrieve code using %, loops, and accumulator variables; lexical overlap is minimal. For some HumanEval items, the query text is only an import, so BM25 retrieves unrelated snippets sharing imports or common Python tokens. A useful retriever needs NL-to-code semantic matching and should recognize algorithmic behavior, not only shared words.

Training Data That May Help

Useful training data includes non-overlapping HumanEval, MBPP, APPS, CodeContests, and CodeSearchNet-style natural-language-to-code pairs, plus execution-verified code solutions with hard negatives from similar tasks. Training should exclude the NanoCodeRAG programming-solution evaluation queries, qrels, and positive solution snippets.

Training should preserve identifiers, control flow, and algorithmic behavior. Pairs that include tests or input-output examples are useful because they teach retrievers that two prompts with similar wording can require different code.

Synthetic Data Guidance

For document-to-question generation, use non-evaluation Python functions and generate natural programming prompts that describe the function's behavior, inputs, and expected output. Include edge cases and examples when they are grounded in the code.

For joint generation, create small executable Python functions plus prompts that ask for exactly that behavior. Hard negatives should solve nearby tasks with similar words but different conditions, such as sum vs product, first vs last, or increasing vs decreasing. Do not use Nano evaluation prompts or solution snippets as seeds.

Example Data

Query Positive document
# Write a python function to check whether the given array is monotonic or not. (79 chars) def is_Monotonic(A): return (all(A[i] <= A[i + 1] for i in range(len(A) - 1)) or all(A[i] >= A[i + 1] for i in range(len(A) - 1))) (149 chars)
# Write a python function to find the sum of common divisors of two given numbers. (82 chars) def sum(a,b): sum = 0 for i in range (1,min(a,b)): if (a % i == 0 and b % i == 0): sum += i return sum (143 chars)
# Write a function to add the given list to the given tuples. (61 chars) def add_lists(test_list, test_tup): res = tuple(list(test_tup) + test_list) return (res) (94 chars)
# Write a function to extract the index minimum value record from the given tuples. (83 chars) from operator import itemgetter def index_minimum(test_list): res = min(test_list, key = itemgetter(1))[0] return (res) (127 chars)
# Write a python function to check whether the sum of divisors are same or not. (79 chars) import math def divSum(n): sum = 1; i = 2; while(i * i <= n): if (n % i == 0): sum = (sum + i +math.floor(n / i)); i += 1; return sum; def areEquivalent(num1,num2): return divSum(num1) == divSum(num2); (269 chars)

Dataset Information

Field Value
Nano set NanoCodeRAG
Backing dataset NanoCodeRAG
Task / split NanoCodeRAGProgrammingSolutions
Hugging Face dataset hakari-bench/NanoCodeRAG
Language en
Category code
Queries 200
Documents 984
Positive qrels 200
BM25 nDCG@10 0.0138
BM25 hit@10 0.0250
Query length avg chars 78.28
Document length avg chars 189.05

Public Sources

Hugging Face Links

Source Reference Table

Title Year Type URL
CodeRAG-Bench: Can Retrieval Augment Code Generation? 2025 arXiv paper https://arxiv.org/abs/2406.14497
CodeRAG-Bench project page 2025 project page https://code-rag-bench.github.io/
code-rag-bench/programming-solutions 2024 dataset card https://huggingface.co/datasets/code-rag-bench/programming-solutions

Machine-Readable Metadata

benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoCodeRAG
  backing_dataset: NanoCodeRAG
  dataset_id: hakari-bench/NanoCodeRAG
  task_name: NanoCodeRAGProgrammingSolutions
  split_name: NanoCodeRAGProgrammingSolutions
  language: en
  category: code
  document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGProgrammingSolutions.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
    paper_url: https://arxiv.org/abs/2406.14497
    additional_source_urls:
      - https://aclanthology.org/2025.findings-naacl.176/
      - https://code-rag-bench.github.io/
      - https://github.com/code-rag-bench/code-rag-bench
      - https://huggingface.co/datasets/code-rag-bench/programming-solutions
  counts:
    queries: 200
    documents: 984
    positive_qrels: 200
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_queries: 0
    multi_positive_query_percent: 0.0
  text_stats_chars:
    query_mean: 78.28
    document_mean: 189.053862
  bm25:
    ndcg_at_10: 0.0137666396
    hit_at_10: 0.025
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: CodeRAG-Bench programming solutions retrieval source sampled into NanoCodeRAG
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoCodeRAG programming prompts, qrels, and positive solution snippets
    useful_training_data:
      - non-overlapping HumanEval and MBPP style prompt-to-code pairs
      - APPS and CodeContests natural-language-to-code solutions
      - CodeSearchNet summary-to-code retrieval pairs
      - execution-verified Python functions with behaviorally similar hard negatives
    synthetic_data:
      document_generation: small executable Python functions with identifiers, control flow, and edge-case behavior
      question_generation: natural programming prompts describing inputs, outputs, constraints, and examples
      answerability: the solution code should implement exactly the behavior requested by the prompt
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
    source_urls:
      - label: CodeRAG-Bench arXiv
        url: https://arxiv.org/abs/2406.14497
      - label: CodeRAG-Bench project page
        url: https://code-rag-bench.github.io/
      - label: CodeRAG-Bench GitHub
        url: https://github.com/code-rag-bench/code-rag-bench
      - label: code-rag-bench/programming-solutions
        url: https://huggingface.co/datasets/code-rag-bench/programming-solutions
    source_notes: []
  references:
    - title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
      url: https://arxiv.org/abs/2406.14497
      year: 2025
      doi: 10.18653/v1/2025.findings-naacl.176
      is_paper: true
      source_confidence: definitive_paper_link