File size: 3,463 Bytes
d07c36b
 
 
 
 
 
318fa8a
 
 
 
 
 
d07c36b
 
b6aa528
d07c36b
1d99f91
d07c36b
318fa8a
d07c36b
318fa8a
 
 
 
1d99f91
318fa8a
d07c36b
1d99f91
 
d07c36b
318fa8a
d07c36b
318fa8a
 
 
d07c36b
 
707c5ae
66b19fc
 
707c5ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318fa8a
707c5ae
 
 
318fa8a
 
d07c36b
318fa8a
707c5ae
d07c36b
318fa8a
707c5ae
 
 
d07c36b
707c5ae
 
d07c36b
707c5ae
 
 
 
 
 
d07c36b
707c5ae
 
318fa8a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language: en
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- tool-use
- llm-agent
- r-language
license: apache-2.0
base_model: sentence-transformers/all-MiniLM-L6-v2
---

![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)

DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.

It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.

## Model Details
- **Architecture:** Bi-encoder (Sentence Transformer)
- **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
- **Task:** Dense Retrieval for Tool-Augmented LLMs
- **Performance**: SoTA on R package retrieval tasks.
- **Domain:** R programming language, Data Science, Statistical Analysis functions

<!-- ## 💡 Why DARE? (The Input Formatting)
Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->

### Usage (Sentence-Transformers)

First, install the `sentence-transformers` library:
```bash
pip install -U sentence-transformers
```

### Usage by our RPKB (Optional and Recommended)
Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB)

```python
from huggingface_hub import snapshot_download
import chromadb

# 1. Download the database folder from Hugging Face
db_path = snapshot_download(
    repo_id="Stephen-SMJ/RPKB", 
    repo_type="dataset",
    allow_patterns="RPKB/*"  # Adjust this if your folder name is different
)

# 2. Connect to the local ChromaDB instance
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")

# 3. Access the specific collection
collection = client.get_collection(name="inference")

print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
```

### Then, you can load the DARE model do retrieval:
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# 1. Load the DARE model
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")

# 2. Define the exact input format: Query + Data Profile
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
first value of the estimated scores (est_a) for the very first region identified."

# 3. Generate embedding
query_embedding = model.encode(user_query).tolist()

# 4. Search in the database with Hard Filters
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["metadatas", "distances", "documents"]
)

# Display Top-1 Result
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
```