# Create and run a local RAG pipeline from scratch

The goal of this notebook is to build a RAG (Retrieval Augmented Generation) pipeline from scratch.

Specifically, we'd like to be able to open a PDF file, ask questions (queries) of it and have them answered by a Large Language Model (LLM).

## Requirements and setup

In [1]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install cohere pinecone-client # for embedding models and vector database
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

[INFO] Running in Google Colab, installing requirements.
Collecting torch
  Downloading torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-n

## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model Cohere.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to vector database Pinecone.

### Import PDF Document

We're going to work on the open-source PDF textbook [*Human Nutrition: 2020 Edition*](https://pressbooks.oer.hawaii.edu/humannutrition2/).

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well.

First we'll download the PDF if it doesn't exist.

In [2]:
# Download PDF file
import os
import requests

# Get PDF document
pdf_path = "human-nutrition-text.pdf"

# Download PDF if it doesn't already exist
if not os.path.exists(pdf_path):
  print("File doesn't exist, downloading...")

  # The URL of the PDF you want to download
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request was successful
  if response.status_code == 200:
      # Open a file in binary write mode and save the content to it
      with open(filename, "wb") as file:
          file.write(response.content)
      print(f"The file has been downloaded and saved as {filename}")
  else:
      print(f"Failed to download the file. Status code: {response.status_code}")
else:
  print(f"File {pdf_path} exists.")

File doesn't exist, downloading...
The file has been downloaded and saved as human-nutrition-text.pdf


PDF acquired!

We can import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`).

We'll write a small helper function to preprocess the text as it gets read.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [3]:
import fitz  # PyMuPDF
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # this might be different for each doc

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# this only focuses on text
def open_and_read_pdf(pdf_path: str, page_offset: int = 0) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - page_offset,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path, page_offset=42)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -42,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -41,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

Now let's get a random sample of the pages.

In [4]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 591,
  'page_char_count': 313,
  'page_word_count': 52,
  'page_sentence_count_raw': 3,
  'page_token_count': 78.25,
  'text': 'recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=347  \xa0 592  |  Water-Soluble Vitamins'},
 {'page_number': 1126,
  'page_char_count': 1533,
  'page_word_count': 218,
  'page_sentence_count_raw': 25,
  'page_token_count': 383.25,
  'text': 'an incessant fear of weight gain but instead have an obsession with  “feeling pure, healthy and natural.”7 People affected by orthorexia  nervosa tend to follow diets tied to a philosophy or theory and  believe that their theory of eating is the best.8 9 Such diets often  have a redemptive quality that involves denying oneself of “bad” or  “wrong” foods.10 In extreme cases, affec

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

The different sizes of texts will be a good indicator into how we should split our texts.

For now, let's turn our list of dictionaries into a DataFrame and explore it.

In [5]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-42,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-41,0,1,1,0.0,
2,-40,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-39,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-38,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [6]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,561.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-42.0,0.0,1.0,1.0,0.0
25%,259.75,762.0,134.0,4.0,190.5
50%,561.5,1231.5,214.5,10.0,307.88
75%,863.25,1603.5,271.0,14.0,400.88
max,1165.0,2308.0,429.0,32.0,577.0


Okay, looks like our average token count per page is 287.



### Further text processing (splitting pages into sentences)

The ideal is processing text before embedding.

A simple method is to break the text into chunks of sentences.

As in, chunk a page of text into groups of 5, 7, 10 or more sentences.

We will follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Some options for splitting text into sentences:

1. Split into sentences with simple rules (e.g. split on ". " with `text = text.split(". ")`, like we did above).
2. Split into sentences with a natural language processing (NLP) library such as [spaCy](https://spacy.io/) or [nltk](https://www.nltk.org/).

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.

> **Resource:** See [spaCy install instructions](https://spacy.io/usage).

Let's use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`.

In [7]:
from spacy.lang.en import English  # English language model

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

So let's run our small sentencizing pipeline on our pages of text.

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [9]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 578,
  'page_char_count': 1284,
  'page_word_count': 216,
  'page_sentence_count_raw': 8,
  'page_token_count': 321.0,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Folate is especially essential for the growth and specialization of  cells of the central nervous system. Children whose mothers were  folate-deficient during pregnancy have a higher risk of neural-tube  birth defects. Folate deficiency is causally linked to the development  of spina bifida, a neural-tube defect that occurs when the spine  does not completely enclose the spinal cord. Spina bifida can lead  to many physical and mental disabilities (Figure 9.18\xa0“Spina Bifida in  Infants” ). Observational studies show that the prevalence of neural- tube defects was decreased after the fortification of enriched cereal  grain products with folate in 1996 in the United States (and 1998  in Canada) compared to before grain products were fortified with  folate.  Additionally, results of clinical trials h

Wonderful!

Now let's turn out list of dictionaries into a DataFrame and get some stats.

In [10]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,561.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-42.0,0.0,1.0,1.0,0.0,0.0
25%,259.75,762.0,134.0,4.0,190.5,5.0
50%,561.5,1231.5,214.5,10.0,307.88,10.0
75%,863.25,1603.5,271.0,14.0,400.88,15.0
max,1165.0,2308.0,429.0,32.0,577.0,28.0


For our set of text, it looks like our raw sentence count (e.g. splitting on `". "`) is quite close to what spaCy came up with.

Now we've got our text split into sentences, lets group those sentences!

### Chunking our sentences together

Let's take a step to break down our list of sentences/text into smaller chunks.

This process is referred to as **chunking**.

Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if we try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

There are many different ways emerging for creating chunks of information/text.

For now, we're going to keep it simple and break our pages of sentences into groups of 10 (this number is arbitrary and can be changed).

On average each of our pages has 10 sentences.

And an average total of 287 tokens per page.

So our groups of 10 sentences will also be ~287 tokens long.

To split our groups of sentences into chunks of 10 or less, let's create a function which accepts a list as input and recursively breaks into down into sublists of a specified size.

In [11]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [12]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 684,
  'page_char_count': 297,
  'page_word_count': 51,
  'page_sentence_count_raw': 3,
  'page_token_count': 74.25,
  'text': 'recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=393  \xa0 Iodine  |  685',
  'sentences': ['recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.',
   ' \xa0 An interactive or media element has been  excluded from this version of the text.',
   'You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=393  \xa0 Iodine  |  685'],
  'page_sentence_count_spacy': 3,
  'sentence_chunks': [['recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.',
    ' \xa0 An interactive or media element has

In [13]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,561.5,1148.0,198.3,9.97,287.0,10.32,1.53
std,348.86,560.38,95.76,6.19,140.1,6.3,0.64
min,-42.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,259.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,561.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,863.25,1603.5,271.0,14.0,400.88,15.0,2.0
max,1165.0,2308.0,429.0,32.0,577.0,28.0,3.0


The average number of chunks is around 1.5, this is expected since many of our pages only contain an average of 10 sentences.

### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

Let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [14]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [15]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 803,
  'sentence_chunk': 'and eclampsia, which is sometimes referred to as toxemia of pregnancy. This disorder is marked by elevated blood pressure and protein in the urine and is associated with swelling. To prevent preeclampsia, the WHO recommends increasing calcium intake for women consuming diets low in that micronutrient, administering a low dosage of aspirin (75 milligrams), and increasing prenatal checkups. The WHO does not recommend the restriction of dietary salt intake during pregnancy with the aim of preventing the development of pre-eclampsia and its complications12. About 4 percent of pregnant women suffer from a condition known as gestational diabetes, which is abnormal glucose tolerance during pregnancy. The body becomes resistant to the hormone insulin, which enables cells to transport glucose from the blood. Gestational diabetes is usually diagnosed around twenty-four to twenty-six weeks, although it is possible for the condition to develop later into 

Excellent!

Now we've broken our whole textbook into chunks of 10 sentences or less as well as the page number they came from.

This means we could reference a chunk of text and know its source.

Let's get some stats about our chunks.

In [16]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,582.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-42.0,12.0,3.0,3.0
25%,279.5,315.0,44.0,78.75
50%,585.0,746.0,114.0,186.5
75%,889.0,1118.5,173.0,279.62
max,1165.0,1831.0,297.0,457.75


Looks like some of our chunks have quite a low token count.

Let's check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?

In [17]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 24.75 | Text: http://www.ajcn.org/content/87/1/64.long. Accessed September 22, 2017. 554 | Water-Soluble Vitamins
Chunk token count: 26.5 | Text: It is stored in the rectum until it is expelled through the anus via defecation. The Digestive System | 77
Chunk token count: 11.0 | Text: Accessed October 5, 2017. Introduction | 433
Chunk token count: 19.25 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=519   Introduction | 991
Chunk token count: 17.75 | Text: Table 6.1 Essential and Nonessential Amino Acids Defining Protein | 365


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [18]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -40,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

Smaller chunks filtered!

Time to embed our chunks of text!

### Embedding our text chunks

While humans understand text, machines understand numbers best.

An [embedding](https://vickiboykis.com/what_are_embeddings/index.html) is a broad concept.

A simple definitions is "a useful numerical representation".

The most powerful thing about modern embeddings is that they are *learned* representations.

Meaning rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).


To do so, we'll use the [Cohere](https://cohere.com/embed) embedding model.

Specifically, we'll get the `embed-english-v2.0` model (you can see the model's intended use on the [Model](https://docs.cohere.com/reference/embed)).

Upload these vector embeddings into [Pinecone](https://docs.pinecone.io/guides/get-started/quickstart).

In [19]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

In [20]:
COHERE_KEY = 'mMJW7g9UDwQCFhtW905hK854aQJPoU13cRevsrvg'

#### Create Embeddings

In [21]:
import cohere

co = cohere.Client(COHERE_KEY)

In [22]:
%%time

# Embed all texts
embeds = co.embed(
    texts=text_chunks,
    model='embed-english-v2.0',
    input_type='search_query',
    truncate='END'
).embeddings

CPU times: user 46.6 s, sys: 1.16 s, total: 47.8 s
Wall time: 52.9 s


Check the dimensionality of the returned vectors. We will need to save the embedding dimensionality from this to be used when initializing your Pinecone index later

In [23]:
import numpy as np

shape = np.array(embeds).shape
shape

(1680, 4096)

We can see the 4096 embedding dimensionality produced by Cohere’s `embed-english-v2.0` model, and the 1680 samples we built embeddings for.

#### Store the embeddings

Now that we have our embeddings, we can move on to indexing them in the Pinecone vector database.

We first initialize our connection to Pinecone and then create a new index called cohere-pinecone for storing the embeddings. When creating the index, we specify that we would like to use the cosine similarity metric to align with Cohere’s embeddings, and also pass the embedding dimensionality of 4096.

In [24]:
from pinecone import Pinecone, ServerlessSpec
import os

# Use the API key to initialize the Pinecone client
pc = Pinecone(api_key='bdb5ea29-449c-4e3d-8075-a0898d1b8404')

index_name = 'cohere-pinecone'

# if the index does not exist, we create it
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=shape[1],
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

# connect to index
index = pc.Index(index_name)

Now we can begin populating the index with our embeddings. Pinecone expects us to provide a list of tuples in the format (*id, vector, metadata*), where the metadata field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

While uploading our data, we will batch everything to avoid pushing too much data in one go.

In [25]:
batch_size = 128

ids = [str(i) for i in range(shape[0])]
# create list of metadata dictionaries
meta = [{'text': text} for text in text_chunks]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
index.describe_index_stats()

{'dimension': 4096,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1680}},
 'total_vector_count': 1680}

We can see from `index.describe_index_stats()` that we have a 4096-dimensionality index populated with 1680 embeddings. Note that serverless indexes scale automatically as needed, so the index_fullness metric is relevant only for pod-based indexes.

### Semantic Search

Now that we have our indexed vectors, we can perform a few search queries. When searching, we will first embed our query using Cohere, and then search using the returned vector in Pinecone.

In [26]:
# Functionising the semantic search
import textwrap # for wrapping text

def search_queries(queries: list[str], k: int = 1) -> dict:
  """
  Function to embed multiple queries, search in Pinecone, and return the top-k results.

  Args:
  - queries (list): A list of query strings.
  - k (int): The number of top results to retrieve for each query (default is 1).

  Returns:
  - results (dict): A dictionary where each query maps to its top-k results.
  """
  # Step 1: Create embeddings for all queries
  query_embeddings = co.embed(
      texts=queries,
      model='embed-english-v2.0',
      input_type='search_query',
      truncate='END'
  ).embeddings

  # Step 2: Perform Pinecone search for each query embedding
  all_results = {}

  for i, query_embedding in enumerate(query_embeddings):
      # Query Pinecone index with each query embedding
      res = index.query(vector=query_embedding, top_k=k, include_metadata=True)

      # Store the result for each query (as a list of matches)
      all_results[queries[i]] = res['matches']

  # Step 3: Display results for each query
  wrapper = textwrap.TextWrapper(width=80)
  for query, matches in all_results.items():
      print(f"Results for Query: {query}\n")

      # Iterate over the top-k matches
      for match in matches:
          score = match['score']
          text = match['metadata']['text']

          # Wrap the text to fit within 80 characters per line
          wrapped_text = wrapper.fill(text=text)

          # Print the score and corresponding text in a readable format
          print(f"Score: {score:.2f}")
          print("Document:\n")
          print(f"{wrapped_text}\n{'-'*50}\n")  # Divider to separate results

      # Larger divider between different queries
      print(f"\n{'='*100}\n")


Let's Look our Result

In [27]:
result = search_queries(queries=["macro nutrients"],
                        k=1)
result

Results for Query: macro nutrients

Score: 0.63
Document:

Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other macronutrients, it does not yield calories. Carbohydrates Carbohydrates
are molec

### Several example queries

Multiple-queries

In [33]:
result = search_queries(queries=["what is carbohydrates", "what is fats", "What is Starch"],
                        k=1)
result

Results for Query: what is carbohydrates

Score: 0.62
Document:

Carbohydrat es are broken down into the subgroups simple and complex
carbohydrate s. These subgroups are further categorized into mono-, di-, and
polysacchari des. indigestible carbohydrates provide a good amount of fiber with
a host of other health benefits. Plants synthesize the fast-releasing
carbohydrate, glucose, from carbon dioxide in the air and water, and by
harnessing the sun’s energy. Recall that plants convert the energy in sunlight
to chemical energy in the molecule, glucose. Plants use glucose to make other
larger, more slow-releasing carbohydrates. When we eat plants we harvest the
energy of glucose to support life’s processes. Figure 4.1 Carbohydrate
Classification Scheme Carbohydrates are a group of organic compounds containing
a ratio of one carbon atom to two hydrogen atoms to one oxygen atom. Basically,
they are hydrated carbons. The word “carbo” means carbon and “hydrate” means
water. Glucose, the most

In [32]:
result = search_queries(queries=["what are fat-soluble vitamins?", "What are the causes of type 2 diabetes?"],
                        k=1)
result

Results for Query: what are fat-soluble vitamins?

Score: 0.67
Document:

subcutaneous fat, or fat underneath the skin. This blanket layer of tissue
insulates the body from extreme temperatures and helps keep the internal climate
under control. It pads our hands and buttocks and prevents friction, as these
areas frequently come in contact with hard surfaces. It also gives the body the
extra padding required when engaging in physically demanding activities such as
ice- or roller skating, horseback riding, or snowboarding. Aiding Digestion and
Increasing Bioavailability The dietary fats in the foods we eat break down in
our digestive systems and begin the transport of precious micronutrients. By
carrying fat-soluble nutrients through the digestive process, intestinal
absorption is improved. This improved absorption is also known as increased
bioavailability. Fat-soluble nutrients are especially important for good health
and exhibit a variety of functions. Vitamins A, D, E, and K—the fat-

In [37]:
result = search_queries(queries=["What is the importance of hydration for physical performance?", "What role does fibre play in digestion?", "What is the RDA for protein per day?"],
                        k=1)
result

Results for Query: What is the importance of hydration for physical performance?

Score: 0.68
Document:

Image by Allison Calabrese / CC BY 4.0 Water and Electrolyte Needs UNIVERSITY OF
HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION
PROGRAM During exercise, being appropriately hydrated contributes to
performance. Water is needed to cool the body, transport oxygen and nutrients,
and remove waste products from the muscles. Water needs are increased during
exercise due to the extra water losses through evaporation and sweat.
Dehydration can occur when there is inadequate water levels in the body and can
be very hazardous to the health of an individual. As the severity of dehydration
increases, the exercise performance of an individual will begin to decline (see
Figure 16.9 “Dehydration Effect on Exercise Performance”). It is important to
continue to consume water before, during and after exercise to avoid dehydration
as much as possible. Figure 16.9 Dehydrat

In [50]:
result = search_queries(queries=["what are other health benefits of Calcium in the body?", "define weight gain during pregnancy?", "How does saliva help with digestion?"],
                        k=1)
result

Results for Query: what are other health benefits of Calcium in the body?

Score: 0.78
Document:

Image by Allison Calabrese / CC BY 4.0 Other Health Benefits of Calcium in the
Body Besides forming and maintaining strong bones and teeth, calcium has been
shown to have other health benefits for the body, including: • Cancer. The
National Cancer Institute reports that there is enough scientific evidence to
conclude that higher intakes of calcium decrease colon cancer risk and may
suppress the growth of polyps that often precipitate cancer. Although higher
calcium consumption protects against colon cancer, some studies have looked at
the relationship between calcium and prostate cancer and found higher intakes
may increase the risk for prostate cancer; however the data is inconsistent and
more studies are needed to confirm any negative association. • Blood pressure.
Multiple studies provide clear evidence that higher calcium consumption reduces
blood pressure. A review of twenty-three obs

In [51]:
result = search_queries(queries=["How often should infants be breastfed??", "what is water soluble vitamins", "What are symptoms of pellagra?"],
                        k=1)
result

Results for Query: How often should infants be breastfed??

Score: 0.57
Document:

milk is the best source to fulfill nutritional requirements. An exclusively
breastfed infant does not even need extra water, including in hot climates. A
newborn infant (birth to 28 days) requires feedings eight to twelve times a day
or more. Between 1 and 3 months of age, the breastfed infant becomes more
efficient, and the number of feedings per day often become fewer even though the
amount of milk consumed stays the same. After about six months, infants can
gradually begin to consume solid foods to help meet nutrient needs. Foods that
are added in addition to breastmilk are called complementary foods.
Complementary foods should be nutrient dense to provide optimal nutrition.
Complementary foods include baby meats, vegetables, fruits, infant cereal, and
dairy products such as yogurt, but not infant formula. Infant formula is a
substitute, not a complement to breastmilk. In addition to complementary foo

At end lets delete our Pinecone index.

In [52]:
pc.delete_index(index_name)