Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_tasks /NanoCodeRAG /NanoCodeRAGLibraryDocumentationSolutions.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 3 days ago

preview code

raw

history blame contribute delete

11.4 kB

NanoCodeRAG / NanoCodeRAGLibraryDocumentationSolutions

Overview

CodeRAG-Bench studies whether retrieval can support code generation, and its library-documentation source is built from official Python library references collected through devdocs.io. This Nano task uses API names or short reference descriptions as queries and retrieves documentation entries, often TensorFlow pages. The observed records include signatures, aliases, arguments, examples, and migration notes, so the task asks whether a retriever can find the exact reference page that would ground API-aware generation.

Details

What the Original Data Measures

CodeRAG-Bench: Can Retrieval Augment Code Generation? introduces a retrieval-augmented code generation benchmark with a heterogeneous retrieval datastore. The paper reports five retrieval sources: programming solutions, online tutorials, Python library documentation, Stack Overflow posts, and GitHub files. For library documentation, it collects official documentation provided by devdocs.io for Python libraries, which is especially intended to help open-domain and repository-level programming tasks that require library-specific functions.

The same paper manually annotates canonical documents for code-generation tasks and evaluates retrieval with NDCG@10, precision, and recall. It also finds that current retrievers still struggle when useful contexts have limited lexical overlap. In this Nano split, the retrieval surface is the documentation source itself: the correct document is the API documentation entry associated with the query.

Observed Data Profile

The Nano split has 200 queries, 8,683 documents, and 200 positive qrel rows. Every query has one positive. Queries average 397.43 characters, but the median is only 110 characters; the long tail comes from API entries whose query text includes unusually long reference material. Documents average 2,045.70 characters, with some very long documentation pages.

The sampled data is dominated by TensorFlow-style API documentation: function or class names such as tf.autodiff.ForwardAccumulator, tf.compat.v1.confusion_matrix, and tf.compat.v1.batch_to_space_nd, followed by a short description and alias notes. The relevant documents contain method signatures, argument descriptions, examples, deprecation warnings, and migration guidance.

BM25 Difficulty

Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.2279 and hit@10 = 0.3800. BM25 ranks 19 positives first and finds 76 positives in the top 10. This is a difficult lexical retrieval task because many documents repeat generic documentation phrases such as "View aliases", "Compat aliases", and "Migration guide", while the meaningful disambiguator may be a dotted API path.

Observed failures include TensorFlow AutoGraph and audio APIs where BM25 ranks unrelated Keras constraint or optimizer documentation above the positive. A strong retriever must preserve exact API identifiers and namespace structure, while also using semantic clues from the short API summary.

Training Data That May Help

Useful training data includes non-overlapping Python API documentation retrieval, DocPrompting-style natural-language intent to documentation pairs, API search logs, docstring-to-reference retrieval, and library-specific examples paired with the reference page that explains them. Training should exclude the CodeRAG-Bench library-documentation evaluation queries, qrels, and positive documentation entries used by this Nano split.

Models should be trained to keep identifiers intact: dotted module paths, function names, argument names, and versioned aliases are often the decisive tokens. Generic documentation boilerplate should be treated as weak evidence.

Synthetic Data Guidance

For document-to-question generation, use non-evaluation API reference pages and generate short programming questions, API-name lookups, and usage-intent queries that are answerable from the selected documentation. Preserve signatures, argument names, return types, warnings, and version-specific notes.

For joint generation, create realistic library documentation entries and developer queries that ask how to use or locate an API. Hard negatives should be nearby APIs in the same namespace or functions with similar boilerplate but different behavior. Do not seed synthetic data with Nano evaluation queries or positive documentation entries.

Example Data

Query	Positive document
tf.autodiff.ForwardAccumulator Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff. (102 chars)	tf.autodiff.ForwardAccumulator( primals, tangents ) Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients ... [truncated 225 chars](6087 chars)
tf.compat.v1.data.experimental.RandomDataset A Dataset of pseudorandom values. Inherits From: Dataset, Dataset (110 chars)	tf.compat.v1.data.experimental.RandomDataset( seed=None ) Attributes element_spec The type specification of an element of this dataset. dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) dataset.element_spec TensorSpec(s ... [truncated 225 chars](55309 chars)
tf.compat.v1.confusion_matrix Computes the confusion matrix from predictions and labels. View aliases Compat aliases for migration (132 chars)	See Migration guide for more details. tf.compat.v1.math.confusion_matrix tf.compat.v1.confusion_matrix( labels, predictions, num_classes=None, dtype=tf.dtypes.int32, name=None, weights=None ) The matrix columns represent the ... [truncated 225 chars](1943 chars)
tf.compat.v1.batch_to_space_nd BatchToSpace for N-D tensors of type T. View aliases Compat aliases for migration (114 chars)	See Migration guide for more details. tf.compat.v1.manip.batch_to_space_nd tf.compat.v1.batch_to_space_nd( input, block_shape, crops, name=None ) This operation reshapes the "batch" dimension 0 into M + 1 dimensions of shape ... [truncated 225 chars](3558 chars)
tf.compat.v1.distribute.OneDeviceStrategy A distribution strategy for running on a single device. Inherits From: Strategy (121 chars)	tf.compat.v1.distribute.OneDeviceStrategy( device ) Using this strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device ... [truncated 225 chars](30793 chars)

Dataset Information

Field	Value
Nano set	NanoCodeRAG
Backing dataset	NanoCodeRAG
Task / split	NanoCodeRAGLibraryDocumentationSolutions
Hugging Face dataset	hakari-bench/NanoCodeRAG
Language	en
Category	code
Queries	200
Documents	8,683
Positive qrels	200
BM25 nDCG@10	0.2279
BM25 hit@10	0.3800
Query length avg chars	397.43
Document length avg chars	2,045.70

Public Sources

CodeRAG-Bench: Can Retrieval Augment Code Generation?; 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: 10.18653/v1/2025.findings-naacl.176.
CodeRAG-Bench project page.
CodeRAG-Bench GitHub repository.
code-rag-bench/library-documentation dataset card.

Hugging Face Links

Nano dataset: hakari-bench/NanoCodeRAG
Source dataset: code-rag-bench/library-documentation

Source Reference Table

Title	Year	Type	URL
CodeRAG-Bench: Can Retrieval Augment Code Generation?	2025	arXiv paper	https://arxiv.org/abs/2406.14497
CodeRAG-Bench project page	2025	project page	https://code-rag-bench.github.io/
code-rag-bench/library-documentation	2024	dataset card	https://huggingface.co/datasets/code-rag-bench/library-documentation

Machine-Readable Metadata

benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoCodeRAG
  backing_dataset: NanoCodeRAG
  dataset_id: hakari-bench/NanoCodeRAG
  task_name: NanoCodeRAGLibraryDocumentationSolutions
  split_name: NanoCodeRAGLibraryDocumentationSolutions
  language: en
  category: code
  document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGLibraryDocumentationSolutions.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
    paper_url: https://arxiv.org/abs/2406.14497
    additional_source_urls:
      - https://aclanthology.org/2025.findings-naacl.176/
      - https://code-rag-bench.github.io/
      - https://github.com/code-rag-bench/code-rag-bench
      - https://huggingface.co/datasets/code-rag-bench/library-documentation
  counts:
    queries: 200
    documents: 8683
    positive_qrels: 200
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_queries: 0
    multi_positive_query_percent: 0.0
  text_stats_chars:
    query_mean: 397.43
    document_mean: 2045.703098
  bm25:
    ndcg_at_10: 0.227871825
    hit_at_10: 0.38
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: CodeRAG-Bench library documentation retrieval source sampled into NanoCodeRAG
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoCodeRAG library-documentation queries, qrels, and positive documentation entries
    useful_training_data:
      - non-overlapping Python API documentation retrieval pairs
      - DocPrompting-style natural-language intent to documentation pairs
      - docstring and example code to reference-page retrieval
      - library search logs and API usage examples with overlap removed
    synthetic_data:
      document_generation: realistic Python API documentation with signatures, parameters, examples, aliases, and version notes
      question_generation: API-name, usage-intent, and troubleshooting queries grounded in those documentation entries
      answerability: the selected document should contain the exact API behavior, signature, or argument needed by the query
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
    source_urls:
      - label: CodeRAG-Bench arXiv
        url: https://arxiv.org/abs/2406.14497
      - label: CodeRAG-Bench project page
        url: https://code-rag-bench.github.io/
      - label: CodeRAG-Bench GitHub
        url: https://github.com/code-rag-bench/code-rag-bench
      - label: code-rag-bench/library-documentation
        url: https://huggingface.co/datasets/code-rag-bench/library-documentation
    source_notes: []
  references:
    - title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
      url: https://arxiv.org/abs/2406.14497
      year: 2025
      doi: 10.18653/v1/2025.findings-naacl.176
      is_paper: true
      source_confidence: definitive_paper_link