leaderboard / docs /benchmark_tasks /NanoDAPFAM /NanoDAPFAMInTitlAbsToFullText.md
hotchpotch's picture
Deploy remote main docs sync
1f41326 verified

NanoDAPFAM / NanoDAPFAMInTitlAbsToFullText

Overview

NanoDAPFAMInTitlAbsToFullText retrieves full-text patent-family records from title-abstract queries. Positives are IN-domain DAPFAM citation links where the query and target share an IPC3 class.

Details

What the Original Data Measures

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval reports that domain partitions are central to understanding patent retrieval difficulty. This split represents the same-domain setting with compact queries and full target documents.

Observed Data Profile

The Nano split has 200 queries, 10,000 documents, and 3,072 positive qrels. Queries average 771.33 characters, and full-text targets average 68,924.34 characters. Positives per query average 15.36.

The query title and abstract usually identify the technical problem and proposed solution, while full target records contain detailed descriptions and claims from same-domain patent families.

BM25 Difficulty

Using the dataset-provided BM25 candidate column and best-ranked positive per query, BM25 reaches nDCG@10 = 0.6643 and hit@10 = 0.8500. It ranks 95 positives first and finds a positive in the top 100 for every query.

The same-domain restriction and long targets make lexical matching effective, though the task still requires choosing among many related patents.

Training Data That May Help

Useful data includes same-domain title-abstract prior-art search against full patent records, family-level citation retrieval, and patent semantic search outside NanoDAPFAM evaluation families.

Synthetic Data Guidance

Generate compact source summaries and long same-domain target records. Positives should share a cited technical mechanism, while hard negatives should have similar IPC terms but no direct prior-art relation.

Example Data

Query Positive document
snow removal equipment with automatic walking function the invention relates to snow removal equipment with an automatic walking function. the snow removal equipment comprises a walking module, a working module and a control ... [truncated 225 chars](988 chars) multifunctional device for clearing snow an apparatus and method for clearing an accumulation of matter from a surface that includes a blade configured to collect matter upon movement of the apparatus and means to shift the c ... [truncated 225 chars](59310 chars)
waste disposal devices waste disposal device including a housing defining a waste compartment for receiving enclosed waste and arranged to removably receive a cartridge containing a length of flexible tubing which operatively ... [truncated 225 chars](891 chars) cassette for dispensing pleated tubing a cassette for use in dispensing a pleated tubing. the cassette includes an annular body having a generally u shaped housing with an open central cylindrical core. the annular body inclu ... [truncated 225 chars](36269 chars)
an article including identification for use in an electrically heated smoking system. there is provided an electrically heated smoking system (101) for receiving a smoking article (115) or cleaning article (205) configured fo ... [truncated 225 chars](1164 chars) apparatus for generating aerosol from an aerosolisable medium, an article of aerosolisable medium and a method of determining a parameter of an article to provide an apparatus that heats an aerosolizable medium to volatilize ... [truncated 225 chars](52756 chars)
low weight carpet and carpet tile and methods of manufacture low weight and non-square carpet tile suitable for use in mass transit vehicles, particularly passenger aircraft. the carpet tile preferably weighs less than about ... [truncated 225 chars](565 chars) anti-static mats and carpets a novel carpet material or mat which is characterized by an extraordinary ability to quickly and comfortably discharge any build-up of a static electricity charge on a person who has built up such ... [truncated 225 chars](17195 chars)
organosilicon precursors for interlayer dielectric films with low dielectric constants a method of forming a low dielectric constant interlayer dielectric film on a substrate by reacting, under chemical vapor deposition condi ... [truncated 225 chars](546 chars) radiation shield a radiation shield and an assembly and a reactor including the radiation shield are disclosed. the radiation shield can be used to control heat flux from a susceptor heater assembly and thereby enable better ... [truncated 225 chars](48536 chars)

Dataset Information

Field Value
Nano set NanoDAPFAM
Backing dataset NanoDAPFAM
Task / split NanoDAPFAMInTitlAbsToFullText
Hugging Face dataset hakari-bench/NanoDAPFAM
Language en
Category natural_language
Queries 200
Documents 10000
Positive qrels 3072
Positives per query avg 15.36, min 1, median 18.0, max 20
BM25 nDCG@10 0.6643
BM25 hit@10 0.8500
Query length avg chars 771.33
Document length avg chars 68924.34

Public Sources

Hugging Face Links

Source Reference Table

Title Year Type URL
DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval 2026 arXiv paper https://arxiv.org/abs/2506.22141
DAPFAM DOI record 2026 DOI https://doi.org/10.1016/j.array.2026.100720
datalyes/DAPFAM_patent 2025 dataset card https://huggingface.co/datasets/datalyes/DAPFAM_patent

Machine-Readable Metadata

benchmark_task_metadata:
  schema_version: 1
  document_status: first_pass
  nano_set: NanoDAPFAM
  backing_dataset: NanoDAPFAM
  dataset_id: hakari-bench/NanoDAPFAM
  task_name: NanoDAPFAMInTitlAbsToFullText
  split_name: NanoDAPFAMInTitlAbsToFullText
  language: en
  category: natural_language
  document_path: docs/benchmark_tasks/NanoDAPFAM/NanoDAPFAMInTitlAbsToFullText.md
  source_research:
    primary_source_type: benchmark_paper
    paper_pdf_or_html_checked: true
  counts:
    queries: 200
    documents: 10000
    positive_qrels: 3072
  positives_per_query:
    average: 15.36
    min: 1
    median: 18.0
    max: 20
    multi_positive_queries: 198
  text_stats_chars:
    query_mean: 771.33
    document_mean: 68924.3436
  bm25:
    ndcg_at_10: 0.6642533082
    hit_at_10: 0.85
    source: dataset_bm25_column
  learning:
    original_train_split: not_confirmed
    evaluation_split_origin: DAPFAM IN-domain title-abstract to full-text patent-family retrieval
    train_eval_overlap_audit: not_audited
    leakage_note: exclude NanoDAPFAM evaluation family IDs and citation labels
    useful_training_data:
      - same-domain title-abstract patent retrieval
      - full-text patent prior-art search
      - family-level citation retrieval
    synthetic_data:
      document_generation: full-text same-domain patent family records
      question_generation: title and abstract source patent summaries
      answerability: positives should be cited same-domain families
    multi_positive_training: citation_family_multi_positive
  links:
    nano_dataset: https://huggingface.co/datasets/hakari-bench/NanoDAPFAM
    source_urls:
      - label: DAPFAM arXiv
        url: https://arxiv.org/abs/2506.22141
      - label: DAPFAM DOI
        url: https://doi.org/10.1016/j.array.2026.100720
      - label: datalyes/DAPFAM_patent
        url: https://huggingface.co/datasets/datalyes/DAPFAM_patent
  references:
    - title: "DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval"
      url: https://arxiv.org/abs/2506.22141
      year: 2026
      doi: 10.1016/j.array.2026.100720
      is_paper: true
      source_confidence: definitive_paper_link