Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_tasks /NanoCodeRAG /NanoCodeRAGLibraryDocumentationSolutions.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 3 days ago

preview code

raw

history blame contribute delete

11.4 kB

	# NanoCodeRAG / NanoCodeRAGLibraryDocumentationSolutions

	## Overview

	CodeRAG-Bench studies whether retrieval can support code generation, and its
	library-documentation source is built from official Python library references
	collected through devdocs.io. This Nano task uses API names or short reference
	descriptions as queries and retrieves documentation entries, often TensorFlow
	pages. The observed records include signatures, aliases, arguments, examples,
	and migration notes, so the task asks whether a retriever can find the exact
	reference page that would ground API-aware generation.

	## Details

	### What the Original Data Measures

	[CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497)
	introduces a retrieval-augmented code generation benchmark with a heterogeneous
	retrieval datastore. The paper reports five retrieval sources: programming
	solutions, online tutorials, Python library documentation, Stack Overflow posts,
	and GitHub files. For library documentation, it collects official documentation
	provided by devdocs.io for Python libraries, which is especially intended to help
	open-domain and repository-level programming tasks that require library-specific
	functions.

	The same paper manually annotates canonical documents for code-generation tasks
	and evaluates retrieval with NDCG@10, precision, and recall. It also finds that
	current retrievers still struggle when useful contexts have limited lexical
	overlap. In this Nano split, the retrieval surface is the documentation source
	itself: the correct document is the API documentation entry associated with the
	query.

	### Observed Data Profile

	The Nano split has 200 queries, 8,683 documents, and 200 positive qrel rows.
	Every query has one positive. Queries average 397.43 characters, but the median
	is only 110 characters; the long tail comes from API entries whose query text
	includes unusually long reference material. Documents average 2,045.70
	characters, with some very long documentation pages.

	The sampled data is dominated by TensorFlow-style API documentation: function or
	class names such as `tf.autodiff.ForwardAccumulator`,
	`tf.compat.v1.confusion_matrix`, and `tf.compat.v1.batch_to_space_nd`, followed
	by a short description and alias notes. The relevant documents contain method
	signatures, argument descriptions, examples, deprecation warnings, and migration
	guidance.

	### BM25 Difficulty

	Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.2279
	and hit@10 = 0.3800. BM25 ranks 19 positives first and finds 76 positives in the
	top 10. This is a difficult lexical retrieval task because many documents repeat
	generic documentation phrases such as "View aliases", "Compat aliases", and
	"Migration guide", while the meaningful disambiguator may be a dotted API path.

	Observed failures include TensorFlow AutoGraph and audio APIs where BM25 ranks
	unrelated Keras constraint or optimizer documentation above the positive. A
	strong retriever must preserve exact API identifiers and namespace structure,
	while also using semantic clues from the short API summary.

	### Training Data That May Help

	Useful training data includes non-overlapping Python API documentation retrieval,
	DocPrompting-style natural-language intent to documentation pairs, API search
	logs, docstring-to-reference retrieval, and library-specific examples paired
	with the reference page that explains them. Training should exclude the
	CodeRAG-Bench library-documentation evaluation queries, qrels, and positive
	documentation entries used by this Nano split.

	Models should be trained to keep identifiers intact: dotted module paths,
	function names, argument names, and versioned aliases are often the decisive
	tokens. Generic documentation boilerplate should be treated as weak evidence.

	### Synthetic Data Guidance

	For document-to-question generation, use non-evaluation API reference pages and
	generate short programming questions, API-name lookups, and usage-intent queries
	that are answerable from the selected documentation. Preserve signatures,
	argument names, return types, warnings, and version-specific notes.

	For joint generation, create realistic library documentation entries and
	developer queries that ask how to use or locate an API. Hard negatives should be
	nearby APIs in the same namespace or functions with similar boilerplate but
	different behavior. Do not seed synthetic data with Nano evaluation queries or
	positive documentation entries.

	## Example Data

	\| Query \| Positive document \|
	\| --- \| --- \|
	\| tf.autodiff.ForwardAccumulator Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff. (102 chars) \| tf.autodiff.ForwardAccumulator( primals, tangents ) Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients ... [truncated 225 chars](6087 chars) \|
	\| tf.compat.v1.data.experimental.RandomDataset A Dataset of pseudorandom values. Inherits From: Dataset, Dataset (110 chars) \| tf.compat.v1.data.experimental.RandomDataset( seed=None ) Attributes element_spec The type specification of an element of this dataset. dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) dataset.element_spec TensorSpec(s ... [truncated 225 chars](55309 chars) \|
	\| tf.compat.v1.confusion_matrix Computes the confusion matrix from predictions and labels. View aliases Compat aliases for migration (132 chars) \| See Migration guide for more details. tf.compat.v1.math.confusion_matrix tf.compat.v1.confusion_matrix( labels, predictions, num_classes=None, dtype=tf.dtypes.int32, name=None, weights=None ) The matrix columns represent the ... [truncated 225 chars](1943 chars) \|
	\| tf.compat.v1.batch_to_space_nd BatchToSpace for N-D tensors of type T. View aliases Compat aliases for migration (114 chars) \| See Migration guide for more details. tf.compat.v1.manip.batch_to_space_nd tf.compat.v1.batch_to_space_nd( input, block_shape, crops, name=None ) This operation reshapes the "batch" dimension 0 into M + 1 dimensions of shape ... [truncated 225 chars](3558 chars) \|
	\| tf.compat.v1.distribute.OneDeviceStrategy A distribution strategy for running on a single device. Inherits From: Strategy (121 chars) \| tf.compat.v1.distribute.OneDeviceStrategy( device ) Using this strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device ... [truncated 225 chars](30793 chars) \|

	## Dataset Information

	\| Field \| Value \|
	\| --- \| --- \|
	\| Nano set \| NanoCodeRAG \|
	\| Backing dataset \| NanoCodeRAG \|
	\| Task / split \| NanoCodeRAGLibraryDocumentationSolutions \|
	\| Hugging Face dataset \| [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG) \|
	\| Language \| en \|
	\| Category \| code \|
	\| Queries \| 200 \|
	\| Documents \| 8,683 \|
	\| Positive qrels \| 200 \|
	\| BM25 nDCG@10 \| 0.2279 \|
	\| BM25 hit@10 \| 0.3800 \|
	\| Query length avg chars \| 397.43 \|
	\| Document length avg chars \| 2,045.70 \|

	### Public Sources

	- [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497); 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: `10.18653/v1/2025.findings-naacl.176`.
	- [CodeRAG-Bench project page](https://code-rag-bench.github.io/).
	- [CodeRAG-Bench GitHub repository](https://github.com/code-rag-bench/code-rag-bench).
	- [code-rag-bench/library-documentation dataset card](https://huggingface.co/datasets/code-rag-bench/library-documentation).

	### Hugging Face Links

	- Nano dataset: [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG)
	- Source dataset: [code-rag-bench/library-documentation](https://huggingface.co/datasets/code-rag-bench/library-documentation)

	### Source Reference Table

	\| Title \| Year \| Type \| URL \|
	\| --- \| ---: \| --- \| --- \|
	\| CodeRAG-Bench: Can Retrieval Augment Code Generation? \| 2025 \| arXiv paper \| https://arxiv.org/abs/2406.14497 \|
	\| CodeRAG-Bench project page \| 2025 \| project page \| https://code-rag-bench.github.io/ \|
	\| code-rag-bench/library-documentation \| 2024 \| dataset card \| https://huggingface.co/datasets/code-rag-bench/library-documentation \|

	## Machine-Readable Metadata

	<!-- benchmark-task-metadata:v1 -->

	```yaml
	benchmark_task_metadata:
	schema_version: 1
	document_status: first_pass
	nano_set: NanoCodeRAG
	backing_dataset: NanoCodeRAG
	dataset_id: hakari-bench/NanoCodeRAG
	task_name: NanoCodeRAGLibraryDocumentationSolutions
	split_name: NanoCodeRAGLibraryDocumentationSolutions
	language: en
	category: code
	document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGLibraryDocumentationSolutions.md
	source_research:
	primary_source_type: benchmark_paper
	paper_pdf_or_html_checked: true
	paper_url: https://arxiv.org/abs/2406.14497
	additional_source_urls:
	- https://aclanthology.org/2025.findings-naacl.176/
	- https://code-rag-bench.github.io/
	- https://github.com/code-rag-bench/code-rag-bench
	- https://huggingface.co/datasets/code-rag-bench/library-documentation
	counts:
	queries: 200
	documents: 8683
	positive_qrels: 200
	positives_per_query:
	average: 1.0
	min: 1
	median: 1.0
	max: 1
	multi_positive_queries: 0
	multi_positive_query_percent: 0.0
	text_stats_chars:
	query_mean: 397.43
	document_mean: 2045.703098
	bm25:
	ndcg_at_10: 0.227871825
	hit_at_10: 0.38
	source: dataset_bm25_column
	learning:
	original_train_split: unknown
	evaluation_split_origin: CodeRAG-Bench library documentation retrieval source sampled into NanoCodeRAG
	train_eval_overlap_audit: not_audited
	leakage_note: exclude NanoCodeRAG library-documentation queries, qrels, and positive documentation entries
	useful_training_data:
	- non-overlapping Python API documentation retrieval pairs
	- DocPrompting-style natural-language intent to documentation pairs
	- docstring and example code to reference-page retrieval
	- library search logs and API usage examples with overlap removed
	synthetic_data:
	document_generation: realistic Python API documentation with signatures, parameters, examples, aliases, and version notes
	question_generation: API-name, usage-intent, and troubleshooting queries grounded in those documentation entries
	answerability: the selected document should contain the exact API behavior, signature, or argument needed by the query
	multi_positive_training: single_positive_question_document_focus
	links:
	nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
	source_urls:
	- label: CodeRAG-Bench arXiv
	url: https://arxiv.org/abs/2406.14497
	- label: CodeRAG-Bench project page
	url: https://code-rag-bench.github.io/
	- label: CodeRAG-Bench GitHub
	url: https://github.com/code-rag-bench/code-rag-bench
	- label: code-rag-bench/library-documentation
	url: https://huggingface.co/datasets/code-rag-bench/library-documentation
	source_notes: []
	references:
	- title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
	url: https://arxiv.org/abs/2406.14497
	year: 2025
	doi: 10.18653/v1/2025.findings-naacl.176
	is_paper: true
	source_confidence: definitive_paper_link
	```