Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_tasks /NanoCodeRAG /NanoCodeRAGStackoverflowPosts.md

hotchpotch

Deploy remote main docs sync

1f41326 verified 3 days ago

preview code

raw

history blame contribute delete

11.4 kB

NanoCodeRAG / NanoCodeRAGStackoverflowPosts

Overview

CodeRAG-Bench treats Stack Overflow posts as developer knowledge that can augment code generation, using question-answer posts from the StackExchange portion of RedPajama-1T as retrievable documents. In this Nano split, a programming question, usually beginning with a title and short problem description, must retrieve a long post containing answers, code examples, caveats, and discussion. The observed topics include Photoshop automation, locked files in C#, concurrent database editing, MySQL triggers, and IIS bandwidth behavior, so relevance is a practical fix or design answer rather than a generic topic match.

Details

What the Original Data Measures

CodeRAG-Bench: Can Retrieval Augment Code Generation? collects Stack Overflow posts as one of its five developer retrieval sources, using the StackExchange split of RedPajama-1T. The paper treats each post as a retrievable document with a question, code responses, and textual explanations. Its open retrieval analysis reports that Stack Overflow posts can improve general programming generation, because retrieved posts may contain the same programming problem, code, and detailed explanations.

This Nano split focuses on retrieving those community Q&A documents directly. The source is less formal than documentation: posts include multiple answers, partial fixes, warnings, tool recommendations, and conversational text.

Observed Data Profile

The Nano split has 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has one positive. Queries average 209.84 characters and often begin with Q: followed by a title and problem details. Documents average 4,735.05 characters and may contain several answers, code examples, caveats, and links.

The sampled queries include Mac font lookup from Photoshop automation, deleting a locked file in C#, concurrent database editing, MySQL trigger errors, and IIS bandwidth throttling. The positives are practical answer threads, not polished reference pages, so relevance may depend on matching the exact error condition or development environment.

BM25 Difficulty

Using the dataset-provided BM25 candidate column, BM25 reaches nDCG@10 = 0.6902 and hit@10 = 0.7950. It ranks 115 positives first and finds 159 positives in the top 10. Lexical matching is often strong because the query text is copied from the post's question and the answer thread repeats product names, languages, and error phrases.

The misses are usually near-neighbor Q&A failures. For TortoiseSVN branching, BM25 retrieves another version-control discussion before the positive. For SQL Server's equivalent of MySQL REPLACE INTO, it retrieves a database hosting cost discussion. The retriever must distinguish the requested operation, platform, and tool from other posts with similar technology words.

Training Data That May Help

Useful training data includes non-overlapping Stack Overflow question-to-answer thread retrieval, duplicate-question retrieval, issue-to-fix pairs, API usage Q&A, and documentation-linked Q&A. Training should exclude the NanoCodeRAG Stack Overflow evaluation queries, qrels, and positive posts.

Community Q&A data benefits from negatives that share tags or error messages but answer a different problem. Models should learn to use both the question title and body, because the title alone may be too broad.

Synthetic Data Guidance

For document-to-question generation, use non-evaluation Q&A posts and generate developer questions that preserve the language, framework, error, environment, and desired operation. The selected post should contain a usable answer, warning, or workaround.

For joint generation, create realistic Stack Overflow-style threads with a question, accepted answer, alternative answers, code snippets, and caveats. Hard negatives should share the same tags or tool names but solve a different failure mode. Do not use Nano evaluation queries or positive posts as seeds.

Example Data

Query	Positive document
Q: How can I find the full path to a font from its display name on a Mac? I am using the Photoshop's javascript API to find the fonts in a given PSD. (149 chars)	Given a font name returned by the API, I want to find the actual physical font file that font name corresponds to on the disc. This is all happening in a python program running on OSX so I guess I'm looking for one of: * *Som ... [truncated 225 chars](5076 chars)
Q: How do I delete a file which is locked by another process in C#? I'm looking for a way to delete a file which is locked by another process using C#. I suspect the method must be able to find which process is locking the fi ... [truncated 225 chars](396 chars)	A: If you want to do it programmatically. I'm not sure... and I'd really recommend against it. If you're just troubleshooting stuff on your own machine, SysInternals Process Explorer can help you Run it, use the Find Handle c ... [truncated 225 chars](13199 chars)
Q: Editing database records by multiple users I have designed database tables (normalised, on an MS SQL server) and created a standalone windows front end for an application that will be used by a handful of users to add and ... [truncated 225 chars](334 chars)	I am concerned that if two users start editing the same record then the last to commit the update would be the 'winner' and important information may be lost. A number of solutions come to mind but I'm not sure if I am going ... [truncated 225 chars](4026 chars)
Q: Throw an error preventing a table update in a MySQL trigger If I have a trigger before the update on a table, how can I throw an error that prevents the update on that table? (177 chars)	A: CREATE TRIGGER sample_trigger_msg BEFORE INSERT FOR EACH ROW BEGIN IF(NEW.important_value) < (1*2) THEN DECLARE dummy INT; SELECT Enter your Message Here!!! INTO dummy FROM mytable WHERE mytable.id=new.id END IF; END; A: H ... [truncated 225 chars](5314 chars)
Q: Bandwith throttling in IIS 6 by IP Address I am writing an application that downloads large files in the background. All clients are logged in locally, or through a VPN. When they are logged in locally, I do not want to th ... [truncated 225 chars](391 chars)	Since this is an AIR Application, I figure I will throttle via server-side since I can do it from either the server itself (IIS 6) or the web service (asp.net / C#). Throttling through IIS 6 seems to work fine, but it seems l ... [truncated 225 chars](922 chars)

Dataset Information

Field	Value
Nano set	NanoCodeRAG
Backing dataset	NanoCodeRAG
Task / split	NanoCodeRAGStackoverflowPosts
Hugging Face dataset	hakari-bench/NanoCodeRAG
Language	en
Category	code
Queries	200
Documents	10,000
Positive qrels	200
BM25 nDCG@10	0.6902
BM25 hit@10	0.7950
Query length avg chars	209.84
Document length avg chars	4,735.05

Public Sources

CodeRAG-Bench: Can Retrieval Augment Code Generation?; 2025; Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried; DOI: 10.18653/v1/2025.findings-naacl.176.
CodeRAG-Bench project page.
CodeRAG-Bench GitHub repository.
code-rag-bench/stackoverflow-posts dataset card.

Hugging Face Links

Nano dataset: hakari-bench/NanoCodeRAG
Source dataset: code-rag-bench/stackoverflow-posts

Source Reference Table

Title	Year	Type	URL
CodeRAG-Bench: Can Retrieval Augment Code Generation?	2025	arXiv paper	https://arxiv.org/abs/2406.14497
CodeRAG-Bench project page	2025	project page	https://code-rag-bench.github.io/
code-rag-bench/stackoverflow-posts	2024	dataset card	https://huggingface.co/datasets/code-rag-bench/stackoverflow-posts

Machine-Readable Metadata

benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoCodeRAG
  backing_dataset: NanoCodeRAG
  dataset_id: hakari-bench/NanoCodeRAG
  task_name: NanoCodeRAGStackoverflowPosts
  split_name: NanoCodeRAGStackoverflowPosts
  language: en
  category: code
  document_path: docs/benchmark_tasks/NanoCodeRAG/NanoCodeRAGStackoverflowPosts.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
    paper_url: https://arxiv.org/abs/2406.14497
    additional_source_urls:
      - https://aclanthology.org/2025.findings-naacl.176/
      - https://code-rag-bench.github.io/
      - https://github.com/code-rag-bench/code-rag-bench
      - https://huggingface.co/datasets/code-rag-bench/stackoverflow-posts
  counts:
    queries: 200
    documents: 10000
    positive_qrels: 200
  positives_per_query:
    average: 1.0
    min: 1
    median: 1.0
    max: 1
    multi_positive_queries: 0
    multi_positive_query_percent: 0.0
  text_stats_chars:
    query_mean: 209.835
    document_mean: 4735.0462
  bm25:
    ndcg_at_10: 0.6901992074
    hit_at_10: 0.795
    source: dataset_bm25_column
  learning:
    original_train_split: unknown
    evaluation_split_origin: CodeRAG-Bench Stack Overflow posts retrieval source sampled into NanoCodeRAG
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoCodeRAG Stack Overflow queries, qrels, and positive posts
    useful_training_data:
      - non-overlapping Stack Overflow question-to-answer thread retrieval
      - duplicate-question and related-question retrieval pairs
      - issue-to-fix and API usage Q&A pairs
      - documentation-linked Q&A with tag-matched hard negatives
    synthetic_data:
      document_generation: realistic Stack Overflow-style threads with question, accepted answer, alternative answers, code snippets, caveats, and environment details
      question_generation: developer questions preserving language, framework, error message, and desired operation
      answerability: the selected post should contain a usable answer, workaround, warning, or API usage pattern
    multi_positive_training: single_positive_question_document_focus
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoCodeRAG
    source_urls:
      - label: CodeRAG-Bench arXiv
        url: https://arxiv.org/abs/2406.14497
      - label: CodeRAG-Bench project page
        url: https://code-rag-bench.github.io/
      - label: CodeRAG-Bench GitHub
        url: https://github.com/code-rag-bench/code-rag-bench
      - label: code-rag-bench/stackoverflow-posts
        url: https://huggingface.co/datasets/code-rag-bench/stackoverflow-posts
    source_notes: []
  references:
    - title: "CodeRAG-Bench: Can Retrieval Augment Code Generation?"
      url: https://arxiv.org/abs/2406.14497
      year: 2025
      doi: 10.18653/v1/2025.findings-naacl.176
      is_paper: true
      source_confidence: definitive_paper_link