Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_tasks /NanoCodeRAG /index.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 10 days ago

preview code

raw

history blame contribute delete

11 kB

NanoCodeRAG

Overview

NanoCodeRAG is the compact CodeRAG-Bench group for code retrieval-augmented generation. It evaluates whether a retriever can find useful programming context before a code-generation model answers a developer request. The group covers four source genres: Python library documentation, online tutorials, compact programming solutions, and Stack Overflow-style posts.

The tasks are all English code retrieval tasks, but they do not have the same retrieval shape. Documentation and tutorials are long explanatory documents, Stack Overflow posts mix problem statements with answers and discussion, and programming solutions are short snippets whose semantics may not repeat the query wording. This makes NanoCodeRAG a source-genre benchmark as much as a code benchmark.

Details

What the Original Group Measures

CodeRAG-Bench: Can Retrieval Augment Code Generation? studies retrieval sources for code-generation tasks, asking whether retrieved programming context can improve generation. NanoCodeRAG packages compact retrieval tasks from that setting. The query is a code-related information need, and the positive document is the library documentation, tutorial, programming solution, or Stack Overflow post that should help answer it.

The group measures practical developer retrieval behavior. A good retriever should find API documentation for library usage, tutorials for procedural tasks, posts for error-oriented or Q&A-style problems, and compact code solutions for implementation prompts.

Subtask Coverage

NanoCodeRAGLibraryDocumentationSolutions: API names or short API descriptions retrieve Python library documentation entries.
NanoCodeRAGOnlineTutorials: short tutorial or programming-problem titles retrieve long tutorial articles.
NanoCodeRAGProgrammingSolutions: natural-language Python prompts retrieve compact solution snippets.
NanoCodeRAGStackoverflowPosts: programming questions retrieve Stack Overflow-style posts containing answers, snippets, and discussion.

All four subtasks are single-positive in the current Nano qrels.

Observed Group Profile

The task pages report 800 queries, 800 positive qrels, and 29,664 split-local candidate documents. Queries average 184.36 characters when weighted by query count. NanoCodeRAGLibraryDocumentationSolutions has the longest average query length because some queries include detailed API descriptions, while NanoCodeRAGOnlineTutorials has short title-like queries.

Documents are much longer than in many code retrieval benchmarks: the split-local document-weighted average is 4,129.84 characters. Online tutorials and Stack Overflow posts dominate this length, while ProgrammingSolutions uses short snippets averaging 189.13 characters. This contrast is central to the group: retrieval over long prose-plus-code resources behaves differently from retrieval over compact solution code.

BM25 Difficulty

Using the dataset-provided BM25 candidate columns, NanoCodeRAG has query-weighted BM25 nDCG@10 = 0.4198 and hit@10 = 0.5100. The strongest lexical task is NanoCodeRAGOnlineTutorials (nDCG@10 = 0.7472, hit@10 = 0.8400), followed by NanoCodeRAGStackoverflowPosts (0.6902, 0.7950). Long documents often repeat query terms, API names, and error or topic words.

The weakest task is NanoCodeRAGProgrammingSolutions (nDCG@10 = 0.0138, hit@10 = 0.0250). Short solution snippets may contain little of the natural language prompt, so BM25 rarely finds the correct implementation. Library documentation sits between these extremes: API names help, but documentation entries can be long, nested, or phrased differently from the query.

Training Data That May Help

Useful training data includes non-overlapping CodeRAG-Bench retrieval pairs, API-documentation search pairs, tutorial title-to-article pairs, Stack Overflow question-answer retrieval pairs, and natural-language prompt-to-code solution pairs. Training should preserve source genre, because a single pooled code retrieval objective can overfit to long prose documents and underperform on short implementation snippets.

Training should exclude NanoCodeRAG evaluation queries, qrels, and positive documents. Public Stack Overflow, tutorial, or documentation data should be deduplicated against the evaluation positives when used for supervised training.

Synthetic Data Guidance

Synthetic data should generate both developer requests and source documents in the correct genre. For documentation retrieval, use API signatures, parameter descriptions, examples, and version constraints. For tutorials, generate longer step-by-step articles with code examples. For Stack Overflow-style data, include titles, errors, failed attempts, answers, and accepted-solution cues. For programming-solution retrieval, generate concise prompts paired with correct short implementations.

Do not seed generation with NanoCodeRAG evaluation queries or positive documents. Negatives should share language, APIs, or task type while failing to answer the request or solve the implementation.

Task Summary

Task	Retrieval focus	Queries	Docs	Positive qrels	BM25 nDCG@10	BM25 hit@10	Query avg chars	Doc avg chars
NanoCodeRAGLibraryDocumentationSolutions	API or library query to documentation	200	8,683	200	0.2279	0.3800	397.43	2,045.70
NanoCodeRAGOnlineTutorials	tutorial title to tutorial article	200	9,997	200	0.7472	0.8400	51.91	5,722.55
NanoCodeRAGProgrammingSolutions	programming prompt to solution snippet	200	984	200	0.0138	0.0250	78.28	189.13
NanoCodeRAGStackoverflowPosts	programming question to Stack Overflow post	200	10,000	200	0.6902	0.7950	209.83	4,735.02

Dataset Information

Field	Value
Nano set	NanoCodeRAG
Backing dataset	NanoCodeRAG
Hugging Face dataset	hakari-bench/NanoCodeRAG
Language	en
Category	code
Subtasks	4
Total queries	800
Split-local documents	29,664
Positive qrels	800
Positives per query	exactly 1.00 for every subtask
Query-weighted BM25 nDCG@10	0.4198
Query-weighted BM25 hit@10	0.5100
Mean query length	184.36 chars, weighted by query count
Mean document length	4,129.84 chars, weighted by split-local document count

Public Sources

CodeRAG-Bench: Can Retrieval Augment Code Generation?; 2025; DOI: 10.18653/v1/2025.findings-naacl.176.

Hugging Face Links

Nano dataset: hakari-bench/NanoCodeRAG

Source Reference Table

Title	Year	Type	URL
CodeRAG-Bench: Can Retrieval Augment Code Generation?	2025	benchmark paper	https://aclanthology.org/2025.findings-naacl.176/

Machine-Readable Metadata

benchmark_task_group_metadata:
  schema_version: 1
  document_status: reviewed_manual
  nano_set: NanoCodeRAG
  backing_dataset: NanoCodeRAG
  dataset_id: hakari-bench/NanoCodeRAG
  language: en
  category: code
  document_path: docs/benchmark_tasks/NanoCodeRAG/index.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
    no_paper_note: null
  counts:
    tasks: 4
    queries: 800
    split_local_documents: 29664
    positive_qrels: 800
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_tasks: 0
    multi_positive_queries: 0
  text_stats_chars:
    query_mean_weighted_by_queries: 184.3625
    document_mean_weighted_by_documents: 4129.841626299554
  bm25:
    ndcg_at_10_query_weighted: 0.419752935175
    hit_at_10_query_weighted: 0.51
    source: dataset_bm25_column
    strongest_task_by_ndcg_at_10: NanoCodeRAGOnlineTutorials
    weakest_task_by_ndcg_at_10: NanoCodeRAGProgrammingSolutions
  tasks:
    - name: NanoCodeRAGLibraryDocumentationSolutions
      path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGLibraryDocumentationSolutions.md
      retrieval_focus: api_or_library_query_to_documentation
      queries: 200
      documents: 8683
      positive_qrels: 200
      bm25_ndcg_at_10: 0.2279
      bm25_hit_at_10: 0.38
    - name: NanoCodeRAGOnlineTutorials
      path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGOnlineTutorials.md
      retrieval_focus: tutorial_title_to_tutorial_article
      queries: 200
      documents: 9997
      positive_qrels: 200
      bm25_ndcg_at_10: 0.7472
      bm25_hit_at_10: 0.84
    - name: NanoCodeRAGProgrammingSolutions
      path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGProgrammingSolutions.md
      retrieval_focus: programming_prompt_to_solution_snippet
      queries: 200
      documents: 984
      positive_qrels: 200
      bm25_ndcg_at_10: 0.0138
      bm25_hit_at_10: 0.025
    - name: NanoCodeRAGStackoverflowPosts
      path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGStackoverflowPosts.md
      retrieval_focus: programming_question_to_stackoverflow_post
      queries: 200
      documents: 10000
      positive_qrels: 200
      bm25_ndcg_at_10: 0.6902
      bm25_hit_at_10: 0.795
  learning:
    leakage_note: exclude NanoCodeRAG evaluation queries, qrels, and positive documents; deduplicate public documentation, tutorial, and Stack Overflow data before training
    useful_training_data:
      - CodeRAG-Bench retrieval pairs
      - API documentation search pairs
      - tutorial title-to-article pairs
      - Stack Overflow question-answer retrieval pairs
      - natural-language prompt-to-code solution pairs
    synthetic_data:
      document_generation: documentation pages, tutorials, Stack Overflow posts, and compact solution snippets with realistic APIs and code
      question_generation: developer requests, API queries, tutorial titles, errors, and implementation prompts grounded in the source document
      answerability: positives must help solve the programming request or supply the needed code context
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
    source_urls:
      - label: CodeRAG-Bench ACL Anthology
        url: https://aclanthology.org/2025.findings-naacl.176/
  references:
    - title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
      url: https://aclanthology.org/2025.findings-naacl.176/
      year: 2025
      doi: 10.18653/v1/2025.findings-naacl.176
      is_paper: true
      source_confidence: definitive_paper_link