--- language: en tags: - sentence-transformers - feature-extraction - sentence-similarity - retrieval - tool-use - llm-agent - r-language license: apache-2.0 base_model: sentence-transformers/all-MiniLM-L6-v2 --- ![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png) DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**. It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows. ## Model Details - **Architecture:** Bi-encoder (Sentence Transformer) - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters) - **Task:** Dense Retrieval for Tool-Augmented LLMs - **Performance**: SoTA on R package retrieval tasks. - **Domain:** R programming language, Data Science, Statistical Analysis functions ### Usage (Sentence-Transformers) First, install the `sentence-transformers` library: ```bash pip install -U sentence-transformers ``` ### Usage by our RPKB (Optional and Recommended) Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) ```python from huggingface_hub import snapshot_download import chromadb # 1. Download the database folder from Hugging Face db_path = snapshot_download( repo_id="Stephen-SMJ/RPKB", repo_type="dataset", allow_patterns="RPKB/*" # Adjust this if your folder name is different ) # 2. Connect to the local ChromaDB instance client = chromadb.PersistentClient(path=f"{db_path}/RPKB") # 3. Access the specific collection collection = client.get_collection(name="inference") print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!") ``` ### Then, you can load the DARE model do retrieval: ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim # 1. Load the DARE model model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval") # 2. Define the exact input format: Query + Data Profile query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the first value of the estimated scores (est_a) for the very first region identified." # 3. Generate embedding query_embedding = model.encode(user_query).tolist() # 4. Search in the database with Hard Filters results = collection.query( query_embeddings=[query_embedding], n_results=3, include=["metadatas", "distances", "documents"] ) # Display Top-1 Result print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"]) ```