setu / module_a /pinecone_vector_db /PINECONE_SETUP.md
khagu's picture
chore: finally untrack large database files
3998131

Pinecone Vector Database Setup Guide

This guide will help you set up Pinecone cloud storage for your vector database.

Prerequisites

  1. A Pinecone account (sign up at https://app.pinecone.io/)
  2. A Pinecone API key

Setup Steps

1. Get Your Pinecone API Key

  1. Go to https://app.pinecone.io/
  2. Sign up or log in
  3. Navigate to your API keys section
  4. Copy your API key

2. Set the API Key

You have two options:

Option A: Environment Variable (Recommended)

Set the environment variable before running your code:

Windows (PowerShell):

$env:PINECONE_API_KEY="your-api-key-here"

Windows (Command Prompt):

set PINECONE_API_KEY=your-api-key-here

Linux/Mac:

export PINECONE_API_KEY="your-api-key-here"

Option B: Direct Configuration

Edit module_a/config.py and set:

PINECONE_API_KEY = "your-api-key-here"

Note: Option A is recommended for security reasons.

3. Install Dependencies

Make sure you have the Pinecone client installed:

pip install pinecone-client[grpc]>=3.0.0

Or install all requirements:

pip install -r module_a/requirements.txt

4. Build Your Vector Database

Run the build script:

python -m module_a.build_vector_db

The script will automatically detect if Pinecone is configured and use it instead of ChromaDB.

5. Verify Setup

The build script will:

  • Create a Pinecone index if it doesn't exist
  • Upload your document chunks to Pinecone
  • Store full text in a local JSON file (to avoid Pinecone metadata limits)

How It Works

Text Storage

Pinecone has a 40KB limit on metadata per vector. To work around this:

  • Full text is stored in a local JSON file (data/module-A/pinecone_text_storage.json)
  • Only a text preview is stored in Pinecone metadata
  • The system automatically loads and saves this storage file

Index Configuration

  • Index Name: nepal-legal-docs (configurable in config.py)
  • Dimension: 384 (matches the embedding model)
  • Metric: Cosine similarity
  • Cloud: AWS
  • Region: us-east-1 (configurable in pinecone_vector_db.py)

Troubleshooting

"PINECONE_API_KEY must be set"

  • Make sure you've set the API key (see Step 2)
  • Check that the environment variable is set in the same terminal session

"Index creation failed"

  • Check your Pinecone dashboard for quota limits
  • Verify your API key is valid
  • Try a different region if us-east-1 is unavailable

"Failed to connect to index"

  • Wait a few minutes after index creation (it takes time to initialize)
  • Check your network connection
  • Verify the index exists in your Pinecone dashboard

Text not found in queries

  • Make sure pinecone_text_storage.json exists and contains your data
  • The file is automatically created when you build the database
  • If you delete it, you'll need to rebuild the database

Switching Between ChromaDB and Pinecone

The system automatically uses Pinecone if PINECONE_API_KEY is set, otherwise it falls back to ChromaDB.

Important: The application will automatically fall back to ChromaDB if:

  • Pinecone API key is not set
  • Pinecone initialization fails
  • Pinecone client is not installed

This means your application will work even without Pinecone configured - it will just use the local ChromaDB instead.

To switch:

  • Use Pinecone: Set PINECONE_API_KEY environment variable
  • Use ChromaDB: Unset or remove PINECONE_API_KEY

The RAG chain (LegalRAGChain) automatically detects which vector database to use at initialization time.

Cost Considerations

Pinecone offers a free tier with:

  • 1 index
  • 100K vectors
  • 1M queries/month

Check https://www.pinecone.io/pricing/ for current pricing.

Support

For Pinecone-specific issues, check: