Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,6 @@ It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a hig
|
|
| 24 |
- **Task:** Dense Retrieval for Tool-Augmented LLMs
|
| 25 |
- **Performance**: SoTA on R package retrieval tasks.
|
| 26 |
- **Domain:** R programming language, Data Science, Statistical Analysis functions
|
| 27 |
-
- **Max Sequence Length:** 256 tokens
|
| 28 |
|
| 29 |
<!-- ## 💡 Why DARE? (The Input Formatting)
|
| 30 |
Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
|
|
@@ -36,30 +35,50 @@ First, install the `sentence-transformers` library:
|
|
| 36 |
pip install -U sentence-transformers
|
| 37 |
```
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
```
|
|
|
|
|
|
|
|
|
|
| 41 |
from sentence_transformers import SentenceTransformer
|
| 42 |
from sentence_transformers.util import cos_sim
|
| 43 |
|
| 44 |
# 1. Load the DARE model
|
| 45 |
-
model = SentenceTransformer("Stephen-SMJ/DARE-
|
| 46 |
|
| 47 |
# 2. Define the exact input format: Query + Data Profile
|
| 48 |
-
query = "I
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
# ⚠️ Crucial Step: Concatenate them!
|
| 52 |
-
x_q = f"{query} {data_profile}"
|
| 53 |
|
| 54 |
-
# 3.
|
| 55 |
-
|
| 56 |
-
R Package: sparsepca. Function: spca(). Description: Calculates sparse principal components for high-dimensional and sparse datasets."""
|
| 57 |
|
| 58 |
-
# 4.
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
#
|
| 63 |
-
|
| 64 |
-
print(f"Similarity Score: {similarity.item():.4f}")
|
| 65 |
```
|
|
|
|
| 24 |
- **Task:** Dense Retrieval for Tool-Augmented LLMs
|
| 25 |
- **Performance**: SoTA on R package retrieval tasks.
|
| 26 |
- **Domain:** R programming language, Data Science, Statistical Analysis functions
|
|
|
|
| 27 |
|
| 28 |
<!-- ## 💡 Why DARE? (The Input Formatting)
|
| 29 |
Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
|
|
|
|
| 35 |
pip install -U sentence-transformers
|
| 36 |
```
|
| 37 |
|
| 38 |
+
### Usage by our RPKB (Optional and Recommended)
|
| 39 |
+
```python
|
| 40 |
+
from huggingface_hub import snapshot_download
|
| 41 |
+
import chromadb
|
| 42 |
+
|
| 43 |
+
# 1. Download the database folder from Hugging Face
|
| 44 |
+
db_path = snapshot_download(
|
| 45 |
+
repo_id="Stephen-SMJ/RPKB",
|
| 46 |
+
repo_type="dataset",
|
| 47 |
+
allow_patterns="RPKB/*" # Adjust this if your folder name is different
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
# 2. Connect to the local ChromaDB instance
|
| 51 |
+
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
|
| 52 |
+
|
| 53 |
+
# 3. Access the specific collection
|
| 54 |
+
collection = client.get_collection(name="inference")
|
| 55 |
+
|
| 56 |
+
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
|
| 57 |
```
|
| 58 |
+
|
| 59 |
+
### Then, you can load the DARE model do retrieval:
|
| 60 |
+
```python
|
| 61 |
from sentence_transformers import SentenceTransformer
|
| 62 |
from sentence_transformers.util import cos_sim
|
| 63 |
|
| 64 |
# 1. Load the DARE model
|
| 65 |
+
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
|
| 66 |
|
| 67 |
# 2. Define the exact input format: Query + Data Profile
|
| 68 |
+
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
|
| 69 |
+
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
|
| 70 |
+
first value of the estimated scores (est_a) for the very first region identified."
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
# 3. Generate embedding
|
| 73 |
+
query_embedding = model.encode(user_query).tolist()
|
|
|
|
| 74 |
|
| 75 |
+
# 4. Search in the database with Hard Filters
|
| 76 |
+
results = collection.query(
|
| 77 |
+
query_embeddings=[query_embedding],
|
| 78 |
+
n_results=3,
|
| 79 |
+
include=["metadatas", "distances", "documents"]
|
| 80 |
+
)
|
| 81 |
|
| 82 |
+
# Display Top-1 Result
|
| 83 |
+
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
|
|
|
|
| 84 |
```
|