Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_tasks /NanoCodeRAG /NanoCodeRAGProgrammingSolutions.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 4 days ago

preview code

raw

history blame contribute delete

10.2 kB

	# NanoCodeRAG / NanoCodeRAGProgrammingSolutions

	## Overview

	CodeRAG-Bench uses programming-solution documents, including HumanEval- and
	MBPP-style canonical problems, as direct retrieval support for code generation.
	This Nano split keeps the prompt-to-solution view: a short natural-language
	Python programming request must retrieve the compact function that implements
	it. The sampled positives are short snippets for list manipulation, tuple
	sorting, monotonic-array checks, divisor sums, and similar operations, so the
	retriever must map intent to executable code even when the prompt and code
	share little surface wording.

	## Details

	### What the Original Data Measures

	[CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497)
	uses programming solutions as one of its five retrieval sources. The paper says
	these documents are created from basic programming problems with canonical
	solutions, such as HumanEval and MBPP, by concatenating the natural-language
	problem and program solution. In CodeRAG-Bench, such canonical documents can act
	as direct support for generation and as a retrieval-evaluation target.

	This Nano split isolates the programming-solution source: each query describes a
	small Python task, and the positive document is the code solution. Unlike
	documentation or tutorial retrieval, the positive may be only a few lines of code
	and may not repeat the query's descriptive words.

	### Observed Data Profile

	The Nano split has 200 queries, 984 documents, and 200 positive qrel rows. Every
	query has one positive. Queries average 78.28 characters and are mostly prompts
	such as "Write a python function to check whether the given array is monotonic"
	or "find the sum of common divisors". Documents average only 189.05 characters
	and are compact Python functions.

	The sampled positives include one-line or short-loop implementations for list
	manipulation, tuple sorting, monotonic-array checks, divisor sums, and string
	editing. Some HumanEval-style queries in the observed sample are truncated to
	imports such as `from typing import List`, which makes the retrieval task much
	harder because the visible query contains almost no semantic problem statement.

	### BM25 Difficulty

	Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.0138
	and hit@10 = 0.0250. It ranks only one positive first and finds only five
	positives in the top 10. This is the hardest NanoCodeRAG split for BM25 by a
	wide margin.

	The failure mode is structural. A natural-language prompt such as "sum of common
	divisors" must retrieve code using `%`, loops, and accumulator variables; lexical
	overlap is minimal. For some HumanEval items, the query text is only an import,
	so BM25 retrieves unrelated snippets sharing imports or common Python tokens.
	A useful retriever needs NL-to-code semantic matching and should recognize
	algorithmic behavior, not only shared words.

	### Training Data That May Help

	Useful training data includes non-overlapping HumanEval, MBPP, APPS, CodeContests,
	and CodeSearchNet-style natural-language-to-code pairs, plus execution-verified
	code solutions with hard negatives from similar tasks. Training should exclude
	the NanoCodeRAG programming-solution evaluation queries, qrels, and positive
	solution snippets.

	Training should preserve identifiers, control flow, and algorithmic behavior.
	Pairs that include tests or input-output examples are useful because they teach
	retrievers that two prompts with similar wording can require different code.

	### Synthetic Data Guidance

	For document-to-question generation, use non-evaluation Python functions and
	generate natural programming prompts that describe the function's behavior,
	inputs, and expected output. Include edge cases and examples when they are
	grounded in the code.

	For joint generation, create small executable Python functions plus prompts that
	ask for exactly that behavior. Hard negatives should solve nearby tasks with
	similar words but different conditions, such as sum vs product, first vs last,
	or increasing vs decreasing. Do not use Nano evaluation prompts or solution
	snippets as seeds.

	## Example Data

	\| Query \| Positive document \|
	\| --- \| --- \|
	\| # Write a python function to check whether the given array is monotonic or not. (79 chars) \| def is_Monotonic(A): return (all(A[i] <= A[i + 1] for i in range(len(A) - 1)) or all(A[i] >= A[i + 1] for i in range(len(A) - 1))) (149 chars) \|
	\| # Write a python function to find the sum of common divisors of two given numbers. (82 chars) \| def sum(a,b): sum = 0 for i in range (1,min(a,b)): if (a % i == 0 and b % i == 0): sum += i return sum (143 chars) \|
	\| # Write a function to add the given list to the given tuples. (61 chars) \| def add_lists(test_list, test_tup): res = tuple(list(test_tup) + test_list) return (res) (94 chars) \|
	\| # Write a function to extract the index minimum value record from the given tuples. (83 chars) \| from operator import itemgetter def index_minimum(test_list): res = min(test_list, key = itemgetter(1))[0] return (res) (127 chars) \|
	\| # Write a python function to check whether the sum of divisors are same or not. (79 chars) \| import math def divSum(n): sum = 1; i = 2; while(i * i <= n): if (n % i == 0): sum = (sum + i +math.floor(n / i)); i += 1; return sum; def areEquivalent(num1,num2): return divSum(num1) == divSum(num2); (269 chars) \|

	## Dataset Information

	\| Field \| Value \|
	\| --- \| --- \|
	\| Nano set \| NanoCodeRAG \|
	\| Backing dataset \| NanoCodeRAG \|
	\| Task / split \| NanoCodeRAGProgrammingSolutions \|
	\| Hugging Face dataset \| [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG) \|
	\| Language \| en \|
	\| Category \| code \|
	\| Queries \| 200 \|
	\| Documents \| 984 \|
	\| Positive qrels \| 200 \|
	\| BM25 nDCG@10 \| 0.0138 \|
	\| BM25 hit@10 \| 0.0250 \|
	\| Query length avg chars \| 78.28 \|
	\| Document length avg chars \| 189.05 \|

	### Public Sources

	- [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497); 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: `10.18653/v1/2025.findings-naacl.176`.
	- [CodeRAG-Bench project page](https://code-rag-bench.github.io/).
	- [CodeRAG-Bench GitHub repository](https://github.com/code-rag-bench/code-rag-bench).
	- [code-rag-bench/programming-solutions dataset card](https://huggingface.co/datasets/code-rag-bench/programming-solutions).

	### Hugging Face Links

	- Nano dataset: [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG)
	- Source dataset: [code-rag-bench/programming-solutions](https://huggingface.co/datasets/code-rag-bench/programming-solutions)

	### Source Reference Table

	\| Title \| Year \| Type \| URL \|
	\| --- \| ---: \| --- \| --- \|
	\| CodeRAG-Bench: Can Retrieval Augment Code Generation? \| 2025 \| arXiv paper \| https://arxiv.org/abs/2406.14497 \|
	\| CodeRAG-Bench project page \| 2025 \| project page \| https://code-rag-bench.github.io/ \|
	\| code-rag-bench/programming-solutions \| 2024 \| dataset card \| https://huggingface.co/datasets/code-rag-bench/programming-solutions \|

	## Machine-Readable Metadata

	<!-- benchmark-task-metadata:v1 -->

	```yaml
	benchmark_task_metadata:
	schema_version: 1
	document_status: first_pass
	nano_set: NanoCodeRAG
	backing_dataset: NanoCodeRAG
	dataset_id: hakari-bench/NanoCodeRAG
	task_name: NanoCodeRAGProgrammingSolutions
	split_name: NanoCodeRAGProgrammingSolutions
	language: en
	category: code
	document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGProgrammingSolutions.md
	source_research:
	primary_source_type: benchmark_paper
	paper_pdf_or_html_checked: true
	paper_url: https://arxiv.org/abs/2406.14497
	additional_source_urls:
	- https://aclanthology.org/2025.findings-naacl.176/
	- https://code-rag-bench.github.io/
	- https://github.com/code-rag-bench/code-rag-bench
	- https://huggingface.co/datasets/code-rag-bench/programming-solutions
	counts:
	queries: 200
	documents: 984
	positive_qrels: 200
	positives_per_query:
	average: 1.0
	min: 1
	median: 1.0
	max: 1
	multi_positive_queries: 0
	multi_positive_query_percent: 0.0
	text_stats_chars:
	query_mean: 78.28
	document_mean: 189.053862
	bm25:
	ndcg_at_10: 0.0137666396
	hit_at_10: 0.025
	source: dataset_bm25_column
	learning:
	original_train_split: unknown
	evaluation_split_origin: CodeRAG-Bench programming solutions retrieval source sampled into NanoCodeRAG
	train_eval_overlap_audit: not_audited
	leakage_note: exclude NanoCodeRAG programming prompts, qrels, and positive solution snippets
	useful_training_data:
	- non-overlapping HumanEval and MBPP style prompt-to-code pairs
	- APPS and CodeContests natural-language-to-code solutions
	- CodeSearchNet summary-to-code retrieval pairs
	- execution-verified Python functions with behaviorally similar hard negatives
	synthetic_data:
	document_generation: small executable Python functions with identifiers, control flow, and edge-case behavior
	question_generation: natural programming prompts describing inputs, outputs, constraints, and examples
	answerability: the solution code should implement exactly the behavior requested by the prompt
	multi_positive_training: single_positive_question_document_focus
	links:
	nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
	source_urls:
	- label: CodeRAG-Bench arXiv
	url: https://arxiv.org/abs/2406.14497
	- label: CodeRAG-Bench project page
	url: https://code-rag-bench.github.io/
	- label: CodeRAG-Bench GitHub
	url: https://github.com/code-rag-bench/code-rag-bench
	- label: code-rag-bench/programming-solutions
	url: https://huggingface.co/datasets/code-rag-bench/programming-solutions
	source_notes: []
	references:
	- title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
	url: https://arxiv.org/abs/2406.14497
	year: 2025
	doi: 10.18653/v1/2025.findings-naacl.176
	is_paper: true
	source_confidence: definitive_paper_link
	```