Feature Extraction
sentence-transformers
Safetensors
English
bert
sentence-similarity
retrieval
tool-use
llm-agent
r-language
text-embeddings-inference
Instructions to use Stephen-SMJ/DARE-R-Retriever with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Stephen-SMJ/DARE-R-Retriever with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Stephen-SMJ/DARE-R-Retriever") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 3,906 Bytes
d07c36b 09151f8 d07c36b 09151f8 d07c36b 318fa8a d07c36b 09151f8 d07c36b 1d99f91 d07c36b 318fa8a d07c36b 09151f8 318fa8a 09151f8 318fa8a d07c36b 318fa8a d07c36b 318fa8a d07c36b 09151f8 66b19fc 707c5ae 09151f8 707c5ae 09151f8 707c5ae 318fa8a 707c5ae 09151f8 707c5ae 318fa8a d07c36b 318fa8a 707c5ae d07c36b 318fa8a 707c5ae d07c36b 707c5ae 09151f8 d07c36b 09151f8 707c5ae d07c36b 707c5ae 09151f8 318fa8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | ---
base_model: sentence-transformers/all-MiniLM-L6-v2
language: en
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- tool-use
- llm-agent
- r-language
---

DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
- **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
- **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
- **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)
## Model Details
- **Architecture:** Bi-encoder (Sentence Transformer)
- **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
- **Task:** Dense Retrieval for Tool-Augmented LLMs
- **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
- **Domain:** R programming language, Data Science, Statistical Analysis functions
### Usage (Sentence-Transformers)
First, install the `sentence-transformers` library:
```bash
pip install -U sentence-transformers
```
### Usage with RPKB (Recommended)
Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.
```python
from huggingface_hub import snapshot_download
import chromadb
import os
# 1. Download the database folder from Hugging Face
db_path = snapshot_download(
repo_id="Stephen-SMJ/RPKB",
repo_type="dataset",
allow_patterns="RPKB/*"
)
# 2. Connect to the local ChromaDB instance
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
# 3. Access the specific collection
collection = client.get_collection(name="inference")
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
```
### Retrieval with DARE
```python
from sentence_transformers import SentenceTransformer
# 1. Load the DARE model
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
# 2. Define the exact input format: Query + Data Profile
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
first value of the estimated scores (est_a) for the very first region identified."
# 3. Generate embedding
query_embedding = model.encode(query).tolist()
# 4. Search in the database
results = collection.query(
query_embeddings=[query_embedding],
n_results=3,
include=["metadatas", "distances", "documents"]
)
# Display Top-1 Result
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
```
## Citation
If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:
```bibtex
@article{sun2026dare,
title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
year={2026},
eprint={2603.04743},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2603.04743},
}
``` |