Stephen-SMJ
/

DARE-R-Retriever

@@ -24,7 +24,6 @@ It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a hig
 - **Task:** Dense Retrieval for Tool-Augmented LLMs
 - **Performance**: SoTA on R package retrieval tasks.
 - **Domain:** R programming language, Data Science, Statistical Analysis functions
-- **Max Sequence Length:** 256 tokens
 <!-- ## 💡 Why DARE? (The Input Formatting)
 Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
@@ -36,30 +35,50 @@ First, install the `sentence-transformers` library:
 pip install -U sentence-transformers
 ```
-Then, you can load the model and compute embeddings:
 ```
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.util import cos_sim
 # 1. Load the DARE model
-model = SentenceTransformer("Stephen-SMJ/DARE-MiniLM-L6-v2")
 # 2. Define the exact input format: Query + Data Profile
-query = "I want to perform PCA to reduce dimensions."
-data_profile = "{ 'data_modality': 'tabular', 'dimensionality': 'high', 'distribution': 'sparse matrix' }"
-# ⚠️ Crucial Step: Concatenate them!
-x_q = f"{query} {data_profile}"
-# 3. Formulate the Document/Function format: Data Constraints + Function Description
-doc = """Data Constraints: {"data_modality": "tabular", "dimensionality": "high", "distribution_assumption": "sparse"}. Task: dimensionality_reduction.
-R Package: sparsepca. Function: spca(). Description: Calculates sparse principal components for high-dimensional and sparse datasets."""
-# 4. Compute embeddings
-query_emb = model.encode(x_q)
-doc_emb = model.encode(doc)
-# 5. Compute cosine similarity
-similarity = cos_sim(query_emb, doc_emb)
-print(f"Similarity Score: {similarity.item():.4f}")
 ```

 - **Task:** Dense Retrieval for Tool-Augmented LLMs
 - **Performance**: SoTA on R package retrieval tasks.
 - **Domain:** R programming language, Data Science, Statistical Analysis functions
 <!-- ## 💡 Why DARE? (The Input Formatting)
 Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
 pip install -U sentence-transformers
 ```
+### Usage by our RPKB (Optional and Recommended)
+```python
+from huggingface_hub import snapshot_download
+import chromadb
+# 1. Download the database folder from Hugging Face
+db_path = snapshot_download(
+    repo_id="Stephen-SMJ/RPKB",
+    repo_type="dataset",
+    allow_patterns="RPKB/*"  # Adjust this if your folder name is different
+)
+# 2. Connect to the local ChromaDB instance
+client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
+# 3. Access the specific collection
+collection = client.get_collection(name="inference")
+print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
 ```
+### Then, you can load the DARE model do retrieval:
+```python
 from sentence_transformers import SentenceTransformer
 from sentence_transformers.util import cos_sim
 # 1. Load the DARE model
+model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
 # 2. Define the exact input format: Query + Data Profile
+query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
+in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
+first value of the estimated scores (est_a) for the very first region identified."
+# 3. Generate embedding
+query_embedding = model.encode(user_query).tolist()
+# 4. Search in the database with Hard Filters
+results = collection.query(
+    query_embeddings=[query_embedding],
+    n_results=3,
+    include=["metadatas", "distances", "documents"]
+)
+# Display Top-1 Result
+print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
 ```