Stephen-SMJ commited on
Commit
707c5ae
·
verified ·
1 Parent(s): 1d99f91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -17
README.md CHANGED
@@ -24,7 +24,6 @@ It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a hig
24
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
25
  - **Performance**: SoTA on R package retrieval tasks.
26
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
27
- - **Max Sequence Length:** 256 tokens
28
 
29
  <!-- ## 💡 Why DARE? (The Input Formatting)
30
  Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
@@ -36,30 +35,50 @@ First, install the `sentence-transformers` library:
36
  pip install -U sentence-transformers
37
  ```
38
 
39
- Then, you can load the model and compute embeddings:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```
 
 
 
41
  from sentence_transformers import SentenceTransformer
42
  from sentence_transformers.util import cos_sim
43
 
44
  # 1. Load the DARE model
45
- model = SentenceTransformer("Stephen-SMJ/DARE-MiniLM-L6-v2")
46
 
47
  # 2. Define the exact input format: Query + Data Profile
48
- query = "I want to perform PCA to reduce dimensions."
49
- data_profile = "{ 'data_modality': 'tabular', 'dimensionality': 'high', 'distribution': 'sparse matrix' }"
50
-
51
- # ⚠️ Crucial Step: Concatenate them!
52
- x_q = f"{query} {data_profile}"
53
 
54
- # 3. Formulate the Document/Function format: Data Constraints + Function Description
55
- doc = """Data Constraints: {"data_modality": "tabular", "dimensionality": "high", "distribution_assumption": "sparse"}. Task: dimensionality_reduction.
56
- R Package: sparsepca. Function: spca(). Description: Calculates sparse principal components for high-dimensional and sparse datasets."""
57
 
58
- # 4. Compute embeddings
59
- query_emb = model.encode(x_q)
60
- doc_emb = model.encode(doc)
 
 
 
61
 
62
- # 5. Compute cosine similarity
63
- similarity = cos_sim(query_emb, doc_emb)
64
- print(f"Similarity Score: {similarity.item():.4f}")
65
  ```
 
24
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
25
  - **Performance**: SoTA on R package retrieval tasks.
26
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
 
27
 
28
  <!-- ## 💡 Why DARE? (The Input Formatting)
29
  Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
 
35
  pip install -U sentence-transformers
36
  ```
37
 
38
+ ### Usage by our RPKB (Optional and Recommended)
39
+ ```python
40
+ from huggingface_hub import snapshot_download
41
+ import chromadb
42
+
43
+ # 1. Download the database folder from Hugging Face
44
+ db_path = snapshot_download(
45
+ repo_id="Stephen-SMJ/RPKB",
46
+ repo_type="dataset",
47
+ allow_patterns="RPKB/*" # Adjust this if your folder name is different
48
+ )
49
+
50
+ # 2. Connect to the local ChromaDB instance
51
+ client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
52
+
53
+ # 3. Access the specific collection
54
+ collection = client.get_collection(name="inference")
55
+
56
+ print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
57
  ```
58
+
59
+ ### Then, you can load the DARE model do retrieval:
60
+ ```python
61
  from sentence_transformers import SentenceTransformer
62
  from sentence_transformers.util import cos_sim
63
 
64
  # 1. Load the DARE model
65
+ model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
66
 
67
  # 2. Define the exact input format: Query + Data Profile
68
+ query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
69
+ in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
70
+ first value of the estimated scores (est_a) for the very first region identified."
 
 
71
 
72
+ # 3. Generate embedding
73
+ query_embedding = model.encode(user_query).tolist()
 
74
 
75
+ # 4. Search in the database with Hard Filters
76
+ results = collection.query(
77
+ query_embeddings=[query_embedding],
78
+ n_results=3,
79
+ include=["metadatas", "distances", "documents"]
80
+ )
81
 
82
+ # Display Top-1 Result
83
+ print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
 
84
  ```