File size: 3,906 Bytes
d07c36b
09151f8
d07c36b
09151f8
 
 
d07c36b
 
 
 
318fa8a
 
 
 
d07c36b
 
09151f8
d07c36b
1d99f91
d07c36b
318fa8a
d07c36b
09151f8
 
 
 
318fa8a
 
 
 
09151f8
318fa8a
d07c36b
318fa8a
d07c36b
318fa8a
 
 
d07c36b
 
09151f8
 
66b19fc
707c5ae
 
 
09151f8
707c5ae
 
 
 
 
09151f8
707c5ae
 
 
 
 
 
 
 
 
318fa8a
707c5ae
09151f8
707c5ae
318fa8a
d07c36b
318fa8a
707c5ae
d07c36b
318fa8a
707c5ae
 
 
d07c36b
707c5ae
09151f8
d07c36b
09151f8
707c5ae
 
 
 
 
d07c36b
707c5ae
 
09151f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318fa8a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
base_model: sentence-transformers/all-MiniLM-L6-v2
language: en
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- tool-use
- llm-agent
- r-language
---

![DARE Banner](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)

DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.

It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.

- **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
- **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
- **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)

## Model Details
- **Architecture:** Bi-encoder (Sentence Transformer)
- **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
- **Task:** Dense Retrieval for Tool-Augmented LLMs
- **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
- **Domain:** R programming language, Data Science, Statistical Analysis functions

### Usage (Sentence-Transformers)

First, install the `sentence-transformers` library:
```bash
pip install -U sentence-transformers
```

### Usage with RPKB (Recommended)
Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.

```python
from huggingface_hub import snapshot_download
import chromadb
import os

# 1. Download the database folder from Hugging Face
db_path = snapshot_download(
    repo_id="Stephen-SMJ/RPKB", 
    repo_type="dataset",
    allow_patterns="RPKB/*"
)

# 2. Connect to the local ChromaDB instance
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")

# 3. Access the specific collection
collection = client.get_collection(name="inference")

print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
```

### Retrieval with DARE
```python
from sentence_transformers import SentenceTransformer

# 1. Load the DARE model
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")

# 2. Define the exact input format: Query + Data Profile
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
first value of the estimated scores (est_a) for the very first region identified."

# 3. Generate embedding
query_embedding = model.encode(query).tolist()

# 4. Search in the database
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["metadatas", "distances", "documents"]
)

# Display Top-1 Result
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
```

## Citation

If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:

```bibtex
@article{sun2026dare,
      title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval}, 
      author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
      year={2026},
      eprint={2603.04743},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2603.04743}, 
}
```