Spaces:

Asish22
/

code-crawler

Running

Mihail Eric commited on Oct 1, 2024

Commit

ca05bc9

1 Parent(s): 4e68d3a

download nltk if not downloaded

Files changed (2) hide show

README.md CHANGED Viewed

@@ -89,7 +89,9 @@ pip install git+https://github.com/Storia-AI/sage.git@main
     export PINECONE_INDEX_NAME=...
     ```
-3. For reranking, we support <a href="https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/">NVIDIA</a>, <a href="https://docs.voyageai.com/docs/reranker">Voyage</a>, <a href="https://cohere.com/rerank">Cohere</a>, and <a href="https://jina.ai/reranker/">Jina</a>. According to [our experiments](benchmark/retrieval/README.md), NVIDIA performs best. Export the API key of the desired provider:
     ```
     export NVIDIA_API_KEY=...  # or
     export VOYAGE_API_KEY=...  # or
@@ -102,6 +104,19 @@ pip install git+https://github.com/Storia-AI/sage.git@main
     export ANTHROPIC_API_KEY=...
     ```
 </details>
 ### Optional

     export PINECONE_INDEX_NAME=...
     ```
+3. For reranking, we support <a href="https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/">NVIDIA</a>, <a href="https://docs.voyageai.com/docs/reranker">Voyage</a>, <a href="https://cohere.com/rerank">Cohere</a>, and <a href="https://jina.ai/reranker/">Jina</a>. According to [our experiments](benchmark/retrieval/README.md), NVIDIA performs best. Note: for NVIDIA you should use the `nvidia/nv-rerankqa-mistral-4b-v3` reranker.
+Export the API key of the desired provider:
     ```
     export NVIDIA_API_KEY=...  # or
     export VOYAGE_API_KEY=...  # or
     export ANTHROPIC_API_KEY=...
     ```
+For easier configuration, create a `.sage-env` file with the following contents (change the API keys names based on your desired setup):
+```
+# Embeddings
+export OPENAI_API_KEY=
+# Vector store
+export PINECONE_API_KEY=
+# Reranking
+export NVIDIA_API_KEY=
+# Generation LLM
+export ANTHROPIC_API_KEY=
+# Github issues
+export GITHUB_TOKEN=
+```
 </details>
 ### Optional

sage/index.py CHANGED Viewed

@@ -1,6 +1,7 @@
 """Runs a batch job to compute embeddings for an entire repo and stores them into a vector store."""
 import logging
 import time
 import configargparse
@@ -13,10 +14,20 @@ from sage.embedder import build_batch_embedder_from_flags
 from sage.github import GitHubIssuesChunker, GitHubIssuesManager
 from sage.vector_store import build_vector_store_from_args
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger()
 logger.setLevel(logging.INFO)
 def main():
     parser = configargparse.ArgParser(
@@ -42,6 +53,14 @@ def main():
     if args.embedding_provider == "marqo" and args.vector_store_provider != "marqo":
         parser.error("When using the marqo embedder, the vector store type must also be marqo.")
     ######################
     # Step 1: Embeddings #
     ######################

 """Runs a batch job to compute embeddings for an entire repo and stores them into a vector store."""
 import logging
+import nltk
 import time
 import configargparse
 from sage.github import GitHubIssuesChunker, GitHubIssuesManager
 from sage.vector_store import build_vector_store_from_args
+from nltk.data import find
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger()
 logger.setLevel(logging.INFO)
+def is_punkt_downloaded():
+    try:
+        find('tokenizers/punkt_tab')
+        return True
+    except LookupError:
+        return False
 def main():
     parser = configargparse.ArgParser(
     if args.embedding_provider == "marqo" and args.vector_store_provider != "marqo":
         parser.error("When using the marqo embedder, the vector store type must also be marqo.")
+    # We need nltk tokenizers for
+    if is_punkt_downloaded():
+        print("punkt is already downloaded")
+    else:
+        print("punkt is not downloaded")
+        # Optionally download it
+        nltk.download('punkt_tab')
     ######################
     # Step 1: Embeddings #
     ######################