Spaces:

hakari-bench
/

leaderboard

Running

File size: 11,401 Bytes

# NanoCodeRAG / NanoCodeRAGLibraryDocumentationSolutions

## Overview

CodeRAG-Bench studies whether retrieval can support code generation, and its
library-documentation source is built from official Python library references
collected through devdocs.io. This Nano task uses API names or short reference
descriptions as queries and retrieves documentation entries, often TensorFlow
pages. The observed records include signatures, aliases, arguments, examples,
and migration notes, so the task asks whether a retriever can find the exact
reference page that would ground API-aware generation.

## Details

### What the Original Data Measures

[CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497)
introduces a retrieval-augmented code generation benchmark with a heterogeneous
retrieval datastore. The paper reports five retrieval sources: programming
solutions, online tutorials, Python library documentation, Stack Overflow posts,
and GitHub files. For library documentation, it collects official documentation
provided by devdocs.io for Python libraries, which is especially intended to help
open-domain and repository-level programming tasks that require library-specific
functions.

The same paper manually annotates canonical documents for code-generation tasks
and evaluates retrieval with NDCG@10, precision, and recall. It also finds that
current retrievers still struggle when useful contexts have limited lexical
overlap. In this Nano split, the retrieval surface is the documentation source
itself: the correct document is the API documentation entry associated with the
query.

### Observed Data Profile

The Nano split has 200 queries, 8,683 documents, and 200 positive qrel rows.
Every query has one positive. Queries average 397.43 characters, but the median
is only 110 characters; the long tail comes from API entries whose query text
includes unusually long reference material. Documents average 2,045.70
characters, with some very long documentation pages.

The sampled data is dominated by TensorFlow-style API documentation: function or
class names such as `tf.autodiff.ForwardAccumulator`,
`tf.compat.v1.confusion_matrix`, and `tf.compat.v1.batch_to_space_nd`, followed
by a short description and alias notes. The relevant documents contain method
signatures, argument descriptions, examples, deprecation warnings, and migration
guidance.

### BM25 Difficulty

Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.2279
and hit@10 = 0.3800. BM25 ranks 19 positives first and finds 76 positives in the
top 10. This is a difficult lexical retrieval task because many documents repeat
generic documentation phrases such as "View aliases", "Compat aliases", and
"Migration guide", while the meaningful disambiguator may be a dotted API path.

Observed failures include TensorFlow AutoGraph and audio APIs where BM25 ranks
unrelated Keras constraint or optimizer documentation above the positive. A
strong retriever must preserve exact API identifiers and namespace structure,
while also using semantic clues from the short API summary.

### Training Data That May Help

Useful training data includes non-overlapping Python API documentation retrieval,
DocPrompting-style natural-language intent to documentation pairs, API search
logs, docstring-to-reference retrieval, and library-specific examples paired
with the reference page that explains them. Training should exclude the
CodeRAG-Bench library-documentation evaluation queries, qrels, and positive
documentation entries used by this Nano split.

Models should be trained to keep identifiers intact: dotted module paths,
function names, argument names, and versioned aliases are often the decisive
tokens. Generic documentation boilerplate should be treated as weak evidence.

### Synthetic Data Guidance

For document-to-question generation, use non-evaluation API reference pages and
generate short programming questions, API-name lookups, and usage-intent queries
that are answerable from the selected documentation. Preserve signatures,
argument names, return types, warnings, and version-specific notes.

For joint generation, create realistic library documentation entries and
developer queries that ask how to use or locate an API. Hard negatives should be
nearby APIs in the same namespace or functions with similar boilerplate but
different behavior. Do not seed synthetic data with Nano evaluation queries or
positive documentation entries.

## Example Data

| Query | Positive document |
| --- | --- |
| tf.autodiff.ForwardAccumulator Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff. (102 chars) | tf.autodiff.ForwardAccumulator( primals, tangents ) Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients ... [truncated 225 chars](6087 chars) |
| tf.compat.v1.data.experimental.RandomDataset A Dataset of pseudorandom values. Inherits From: Dataset, Dataset (110 chars) | tf.compat.v1.data.experimental.RandomDataset( seed=None ) Attributes element_spec The type specification of an element of this dataset. dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) dataset.element_spec TensorSpec(s ... [truncated 225 chars](55309 chars) |
| tf.compat.v1.confusion_matrix Computes the confusion matrix from predictions and labels. View aliases Compat aliases for migration (132 chars) | See Migration guide for more details. tf.compat.v1.math.confusion_matrix tf.compat.v1.confusion_matrix( labels, predictions, num_classes=None, dtype=tf.dtypes.int32, name=None, weights=None ) The matrix columns represent the ... [truncated 225 chars](1943 chars) |
| tf.compat.v1.batch_to_space_nd BatchToSpace for N-D tensors of type T. View aliases Compat aliases for migration (114 chars) | See Migration guide for more details. tf.compat.v1.manip.batch_to_space_nd tf.compat.v1.batch_to_space_nd( input, block_shape, crops, name=None ) This operation reshapes the "batch" dimension 0 into M + 1 dimensions of shape ... [truncated 225 chars](3558 chars) |
| tf.compat.v1.distribute.OneDeviceStrategy A distribution strategy for running on a single device. Inherits From: Strategy (121 chars) | tf.compat.v1.distribute.OneDeviceStrategy( device ) Using this strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device ... [truncated 225 chars](30793 chars) |

## Dataset Information

| Field | Value |
| --- | --- |
| Nano set | NanoCodeRAG |
| Backing dataset | NanoCodeRAG |
| Task / split | NanoCodeRAGLibraryDocumentationSolutions |
| Hugging Face dataset | [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG) |
| Language | en |
| Category | code |
| Queries | 200 |
| Documents | 8,683 |
| Positive qrels | 200 |
| BM25 nDCG@10 | 0.2279 |
| BM25 hit@10 | 0.3800 |
| Query length avg chars | 397.43 |
| Document length avg chars | 2,045.70 |

### Public Sources

- [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497); 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: `10.18653/v1/2025.findings-naacl.176`.
- [CodeRAG-Bench project page](https://code-rag-bench.github.io/).
- [CodeRAG-Bench GitHub repository](https://github.com/code-rag-bench/code-rag-bench).
- [code-rag-bench/library-documentation dataset card](https://huggingface.co/datasets/code-rag-bench/library-documentation).

### Hugging Face Links

- Nano dataset: [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG)
- Source dataset: [code-rag-bench/library-documentation](https://huggingface.co/datasets/code-rag-bench/library-documentation)

### Source Reference Table

| Title | Year | Type | URL |
| --- | ---: | --- | --- |
| CodeRAG-Bench: Can Retrieval Augment Code Generation? | 2025 | arXiv paper | https://arxiv.org/abs/2406.14497 |
| CodeRAG-Bench project page | 2025 | project page | https://code-rag-bench.github.io/ |
| code-rag-bench/library-documentation | 2024 | dataset card | https://huggingface.co/datasets/code-rag-bench/library-documentation |

## Machine-Readable Metadata

<!-- benchmark-task-metadata:v1 -->

```yaml
benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoCodeRAG
  backing_dataset: NanoCodeRAG
  dataset_id: hakari-bench/NanoCodeRAG
  task_name: NanoCodeRAGLibraryDocumentationSolutions
  split_name: NanoCodeRAGLibraryDocumentationSolutions
  language: en
  category: code
  document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGLibraryDocumentationSolutions.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
    paper_url: https://arxiv.org/abs/2406.14497
    additional_source_urls:
      - https://aclanthology.org/2025.findings-naacl.176/
      - https://code-rag-bench.github.io/
      - https://github.com/code-rag-bench/code-rag-bench
      - https://huggingface.co/datasets/code-rag-bench/library-documentation
  counts:
    queries: 200
    documents: 8683
    positive_qrels: 200
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_queries: 0
    multi_positive_query_percent: 0.0
  text_stats_chars:
    query_mean: 397.43
    document_mean: 2045.703098
  bm25:
    ndcg_at_10: 0.227871825
    hit_at_10: 0.38
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: CodeRAG-Bench library documentation retrieval source sampled into NanoCodeRAG
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoCodeRAG library-documentation queries, qrels, and positive documentation entries
    useful_training_data:
      - non-overlapping Python API documentation retrieval pairs
      - DocPrompting-style natural-language intent to documentation pairs
      - docstring and example code to reference-page retrieval
      - library search logs and API usage examples with overlap removed
    synthetic_data:
      document_generation: realistic Python API documentation with signatures, parameters, examples, aliases, and version notes
      question_generation: API-name, usage-intent, and troubleshooting queries grounded in those documentation entries
      answerability: the selected document should contain the exact API behavior, signature, or argument needed by the query
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
    source_urls:
      - label: CodeRAG-Bench arXiv
        url: https://arxiv.org/abs/2406.14497
      - label: CodeRAG-Bench project page
        url: https://code-rag-bench.github.io/
      - label: CodeRAG-Bench GitHub
        url: https://github.com/code-rag-bench/code-rag-bench
      - label: code-rag-bench/library-documentation
        url: https://huggingface.co/datasets/code-rag-bench/library-documentation
    source_notes: []
  references:
    - title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
      url: https://arxiv.org/abs/2406.14497
      year: 2025
      doi: 10.18653/v1/2025.findings-naacl.176
      is_paper: true
      source_confidence: definitive_paper_link
```