Spaces:
Running
Running
File size: 11,401 Bytes
c9ec30f 1f41326 c9ec30f 1f41326 c9ec30f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | # NanoCodeRAG / NanoCodeRAGLibraryDocumentationSolutions
## Overview
CodeRAG-Bench studies whether retrieval can support code generation, and its
library-documentation source is built from official Python library references
collected through devdocs.io. This Nano task uses API names or short reference
descriptions as queries and retrieves documentation entries, often TensorFlow
pages. The observed records include signatures, aliases, arguments, examples,
and migration notes, so the task asks whether a retriever can find the exact
reference page that would ground API-aware generation.
## Details
### What the Original Data Measures
[CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497)
introduces a retrieval-augmented code generation benchmark with a heterogeneous
retrieval datastore. The paper reports five retrieval sources: programming
solutions, online tutorials, Python library documentation, Stack Overflow posts,
and GitHub files. For library documentation, it collects official documentation
provided by devdocs.io for Python libraries, which is especially intended to help
open-domain and repository-level programming tasks that require library-specific
functions.
The same paper manually annotates canonical documents for code-generation tasks
and evaluates retrieval with NDCG@10, precision, and recall. It also finds that
current retrievers still struggle when useful contexts have limited lexical
overlap. In this Nano split, the retrieval surface is the documentation source
itself: the correct document is the API documentation entry associated with the
query.
### Observed Data Profile
The Nano split has 200 queries, 8,683 documents, and 200 positive qrel rows.
Every query has one positive. Queries average 397.43 characters, but the median
is only 110 characters; the long tail comes from API entries whose query text
includes unusually long reference material. Documents average 2,045.70
characters, with some very long documentation pages.
The sampled data is dominated by TensorFlow-style API documentation: function or
class names such as `tf.autodiff.ForwardAccumulator`,
`tf.compat.v1.confusion_matrix`, and `tf.compat.v1.batch_to_space_nd`, followed
by a short description and alias notes. The relevant documents contain method
signatures, argument descriptions, examples, deprecation warnings, and migration
guidance.
### BM25 Difficulty
Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.2279
and hit@10 = 0.3800. BM25 ranks 19 positives first and finds 76 positives in the
top 10. This is a difficult lexical retrieval task because many documents repeat
generic documentation phrases such as "View aliases", "Compat aliases", and
"Migration guide", while the meaningful disambiguator may be a dotted API path.
Observed failures include TensorFlow AutoGraph and audio APIs where BM25 ranks
unrelated Keras constraint or optimizer documentation above the positive. A
strong retriever must preserve exact API identifiers and namespace structure,
while also using semantic clues from the short API summary.
### Training Data That May Help
Useful training data includes non-overlapping Python API documentation retrieval,
DocPrompting-style natural-language intent to documentation pairs, API search
logs, docstring-to-reference retrieval, and library-specific examples paired
with the reference page that explains them. Training should exclude the
CodeRAG-Bench library-documentation evaluation queries, qrels, and positive
documentation entries used by this Nano split.
Models should be trained to keep identifiers intact: dotted module paths,
function names, argument names, and versioned aliases are often the decisive
tokens. Generic documentation boilerplate should be treated as weak evidence.
### Synthetic Data Guidance
For document-to-question generation, use non-evaluation API reference pages and
generate short programming questions, API-name lookups, and usage-intent queries
that are answerable from the selected documentation. Preserve signatures,
argument names, return types, warnings, and version-specific notes.
For joint generation, create realistic library documentation entries and
developer queries that ask how to use or locate an API. Hard negatives should be
nearby APIs in the same namespace or functions with similar boilerplate but
different behavior. Do not seed synthetic data with Nano evaluation queries or
positive documentation entries.
## Example Data
| Query | Positive document |
| --- | --- |
| tf.autodiff.ForwardAccumulator Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff. (102 chars) | tf.autodiff.ForwardAccumulator( primals, tangents ) Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients ... [truncated 225 chars](6087 chars) |
| tf.compat.v1.data.experimental.RandomDataset A Dataset of pseudorandom values. Inherits From: Dataset, Dataset (110 chars) | tf.compat.v1.data.experimental.RandomDataset( seed=None ) Attributes element_spec The type specification of an element of this dataset. dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) dataset.element_spec TensorSpec(s ... [truncated 225 chars](55309 chars) |
| tf.compat.v1.confusion_matrix Computes the confusion matrix from predictions and labels. View aliases Compat aliases for migration (132 chars) | See Migration guide for more details. tf.compat.v1.math.confusion_matrix tf.compat.v1.confusion_matrix( labels, predictions, num_classes=None, dtype=tf.dtypes.int32, name=None, weights=None ) The matrix columns represent the ... [truncated 225 chars](1943 chars) |
| tf.compat.v1.batch_to_space_nd BatchToSpace for N-D tensors of type T. View aliases Compat aliases for migration (114 chars) | See Migration guide for more details. tf.compat.v1.manip.batch_to_space_nd tf.compat.v1.batch_to_space_nd( input, block_shape, crops, name=None ) This operation reshapes the "batch" dimension 0 into M + 1 dimensions of shape ... [truncated 225 chars](3558 chars) |
| tf.compat.v1.distribute.OneDeviceStrategy A distribution strategy for running on a single device. Inherits From: Strategy (121 chars) | tf.compat.v1.distribute.OneDeviceStrategy( device ) Using this strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device ... [truncated 225 chars](30793 chars) |
## Dataset Information
| Field | Value |
| --- | --- |
| Nano set | NanoCodeRAG |
| Backing dataset | NanoCodeRAG |
| Task / split | NanoCodeRAGLibraryDocumentationSolutions |
| Hugging Face dataset | [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG) |
| Language | en |
| Category | code |
| Queries | 200 |
| Documents | 8,683 |
| Positive qrels | 200 |
| BM25 nDCG@10 | 0.2279 |
| BM25 hit@10 | 0.3800 |
| Query length avg chars | 397.43 |
| Document length avg chars | 2,045.70 |
### Public Sources
- [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://arxiv.org/abs/2406.14497); 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: `10.18653/v1/2025.findings-naacl.176`.
- [CodeRAG-Bench project page](https://code-rag-bench.github.io/).
- [CodeRAG-Bench GitHub repository](https://github.com/code-rag-bench/code-rag-bench).
- [code-rag-bench/library-documentation dataset card](https://huggingface.co/datasets/code-rag-bench/library-documentation).
### Hugging Face Links
- Nano dataset: [hakari-bench/NanoCodeRAG](https://huggingface.co/datasets/hakari-bench/NanoCodeRAG)
- Source dataset: [code-rag-bench/library-documentation](https://huggingface.co/datasets/code-rag-bench/library-documentation)
### Source Reference Table
| Title | Year | Type | URL |
| --- | ---: | --- | --- |
| CodeRAG-Bench: Can Retrieval Augment Code Generation? | 2025 | arXiv paper | https://arxiv.org/abs/2406.14497 |
| CodeRAG-Bench project page | 2025 | project page | https://code-rag-bench.github.io/ |
| code-rag-bench/library-documentation | 2024 | dataset card | https://huggingface.co/datasets/code-rag-bench/library-documentation |
## Machine-Readable Metadata
<!-- benchmark-task-metadata:v1 -->
```yaml
benchmark_task_metadata:
schema_version: 1
document_status: first_pass
nano_set: NanoCodeRAG
backing_dataset: NanoCodeRAG
dataset_id: hakari-bench/NanoCodeRAG
task_name: NanoCodeRAGLibraryDocumentationSolutions
split_name: NanoCodeRAGLibraryDocumentationSolutions
language: en
category: code
document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGLibraryDocumentationSolutions.md
source_research:
primary_source_type: benchmark_paper
paper_pdf_or_html_checked: true
paper_url: https://arxiv.org/abs/2406.14497
additional_source_urls:
- https://aclanthology.org/2025.findings-naacl.176/
- https://code-rag-bench.github.io/
- https://github.com/code-rag-bench/code-rag-bench
- https://huggingface.co/datasets/code-rag-bench/library-documentation
counts:
queries: 200
documents: 8683
positive_qrels: 200
positives_per_query:
average: 1.0
min: 1
median: 1.0
max: 1
multi_positive_queries: 0
multi_positive_query_percent: 0.0
text_stats_chars:
query_mean: 397.43
document_mean: 2045.703098
bm25:
ndcg_at_10: 0.227871825
hit_at_10: 0.38
source: dataset_bm25_column
learning:
original_train_split: unknown
evaluation_split_origin: CodeRAG-Bench library documentation retrieval source sampled into NanoCodeRAG
train_eval_overlap_audit: not_audited
leakage_note: exclude NanoCodeRAG library-documentation queries, qrels, and positive documentation entries
useful_training_data:
- non-overlapping Python API documentation retrieval pairs
- DocPrompting-style natural-language intent to documentation pairs
- docstring and example code to reference-page retrieval
- library search logs and API usage examples with overlap removed
synthetic_data:
document_generation: realistic Python API documentation with signatures, parameters, examples, aliases, and version notes
question_generation: API-name, usage-intent, and troubleshooting queries grounded in those documentation entries
answerability: the selected document should contain the exact API behavior, signature, or argument needed by the query
multi_positive_training: single_positive_question_document_focus
links:
nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
source_urls:
- label: CodeRAG-Bench arXiv
url: https://arxiv.org/abs/2406.14497
- label: CodeRAG-Bench project page
url: https://code-rag-bench.github.io/
- label: CodeRAG-Bench GitHub
url: https://github.com/code-rag-bench/code-rag-bench
- label: code-rag-bench/library-documentation
url: https://huggingface.co/datasets/code-rag-bench/library-documentation
source_notes: []
references:
- title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
url: https://arxiv.org/abs/2406.14497
year: 2025
doi: 10.18653/v1/2025.findings-naacl.176
is_paper: true
source_confidence: definitive_paper_link
```
|