DARE-R-Retriever / README.md
Stephen-SMJ's picture
Update README.md
b6aa528 verified
---
language: en
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- tool-use
- llm-agent
- r-language
license: apache-2.0
base_model: sentence-transformers/all-MiniLM-L6-v2
---
![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
## Model Details
- **Architecture:** Bi-encoder (Sentence Transformer)
- **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
- **Task:** Dense Retrieval for Tool-Augmented LLMs
- **Performance**: SoTA on R package retrieval tasks.
- **Domain:** R programming language, Data Science, Statistical Analysis functions
<!-- ## 💡 Why DARE? (The Input Formatting)
Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
### Usage (Sentence-Transformers)
First, install the `sentence-transformers` library:
```bash
pip install -U sentence-transformers
```
### Usage by our RPKB (Optional and Recommended)
Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB)
```python
from huggingface_hub import snapshot_download
import chromadb
# 1. Download the database folder from Hugging Face
db_path = snapshot_download(
repo_id="Stephen-SMJ/RPKB",
repo_type="dataset",
allow_patterns="RPKB/*" # Adjust this if your folder name is different
)
# 2. Connect to the local ChromaDB instance
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
# 3. Access the specific collection
collection = client.get_collection(name="inference")
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
```
### Then, you can load the DARE model do retrieval:
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
# 1. Load the DARE model
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
# 2. Define the exact input format: Query + Data Profile
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
first value of the estimated scores (est_a) for the very first region identified."
# 3. Generate embedding
query_embedding = model.encode(user_query).tolist()
# 4. Search in the database with Hard Filters
results = collection.query(
query_embeddings=[query_embedding],
n_results=3,
include=["metadatas", "distances", "documents"]
)
# Display Top-1 Result
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
```