Spaces:

Khaquan
/

Burhan02data-sci-chatbot

Runtime error

File size: 18,527 Bytes

df420ed

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c96b164c",
   "metadata": {},
   "source": [
    "# Overview "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22c9cad3",
   "metadata": {},
   "source": [
    "<span style=\"color:red; font-size:18px; font-weight:bold;\">In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.</br>\n",
    "We want to accomplish the following:</span>\n",
    "1. Establish a pipeline to read in a csv and pass it to our LLM. \n",
    "2. Establish a Basic Inference Agent.\n",
    "3. Test the Basic Inference on a few tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bd62bd1",
   "metadata": {},
   "source": [
    "# Setting up the Environment "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "76b3d212",
   "metadata": {},
   "outputs": [],
   "source": [
    "####################################################################################################\n",
    "import os\n",
    "import re\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "from langchain import hub\n",
    "from langchain.agents.agent_types import AgentType\n",
    "from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent\n",
    "from langchain_experimental.utilities import PythonREPL\n",
    "from langchain.agents import Tool\n",
    "from langchain.agents.format_scratchpad.openai_tools import (\n",
    "    format_to_openai_tool_messages,\n",
    ")\n",
    "from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser\n",
    "\n",
    "from langchain.agents import AgentExecutor\n",
    "\n",
    "from langchain.agents import create_structured_chat_agent\n",
    "from langchain.memory import ConversationBufferWindowMemory\n",
    "\n",
    "from langchain.output_parsers import PandasDataFrameOutputParser\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "\n",
    "from langchain_community.chat_models import ChatAnyscale\n",
    "import pandas as pd \n",
    "import matplotlib.pyplot as plt \n",
    "import numpy as np\n",
    "\n",
    "plt.style.use('ggplot')\n",
    "####################################################################################################"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9da52e1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# insert your API key here\n",
    "os.environ[\"ANYSCALE_API_KEY\"] = \"esecret_8btufnh3615vnbpd924s1t3q7p\"\n",
    "memory_key = \"history\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edcd6ca7",
   "metadata": {},
   "source": [
    "# Reading the Dataframe and saving it \n",
    "<span style=\"font-size:16px;color:red;\">We are saving the dataframe in a csv file in the cwd because we will iteratively update this df if need be between each inference. This way the model will continuously have the most upto date version of the dataframe at its disposal</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64a086f2",
   "metadata": {},
   "source": [
    "# DECLARE YOUR LLM HERE "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3f686305",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b1d8f8f",
   "metadata": {},
   "source": [
    "# REPL Tool \n",
    "<span style=\"font-size:16px;color:red;\">This tool enables our LLM to 'execute' python code so that we can actually display the results to the end-user.</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "23045c5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "python_repl = PythonREPL()\n",
    "\n",
    "repl_tool = Tool(\n",
    "    name=\"python_repl\",\n",
    "    description=\"\"\"A Python shell. Shell can dislay charts too. Use this to execute python commands.\\\n",
    "    You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\\\n",
    "    matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\\\n",
    "    asked you to plot the results, always print the final output using print(...)\"\"\",\n",
    "    func=python_repl.run,\n",
    ")\n",
    "\n",
    "tools = [repl_tool]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5caa356",
   "metadata": {},
   "source": [
    "# Building an Agent "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "44c07305",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"\"\"You are Machine Learning Inference agent. Your job is to use your tools to answer a user query\\\n",
    "            in the best manner possible.\\\n",
    "            The user gives you two things:\n",
    "            1. A query, and a target task.\n",
    "            Your job is to extract what you need to do from this query and then accomplish that task.\\\n",
    "            In case the user asks you to do things like regression/classification, you will need to follow this:\\\n",
    "            READ IN YOUR DATAFRAME FROM: 'df.csv'\n",
    "            1. Split the dataframe into X and y (y= target, X=features)\n",
    "            2. Perform some preprocessing on X as per user instructions. \n",
    "            3. Split preprocessed X and y into train and test sets.\n",
    "            4. If a model is specified, use that model or else pick your model of choice for the given task.\n",
    "            5. Run this model on train set. \n",
    "            6. Predict on Test Set.\n",
    "            5. Print accuracy score by predicting on the test.\n",
    "            \n",
    "            Provide no explanation for your code. Enclose all your code between triple backticks ``` \"\"\",\n",
    "        ),\n",
    "        (\"user\", \"Dataframe named df:\\n{df}\\nQuery: {input}\\nList of Tools: {tools}\"),\n",
    "        MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0784c5a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.mkdir('code')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f1da1de1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_code(code):\n",
    "    \n",
    "    with open(f'{os.getcwd()}/code/code.py', 'w') as file:\n",
    "        file.write(code)\n",
    "        \n",
    "    try:\n",
    "        print(\"Running code ...\\n\")\n",
    "        result = subprocess.run([sys.executable, 'code/code.py'], capture_output=True, text=True, check=True, timeout=20)\n",
    "        return result.stdout, False\n",
    "    \n",
    "    except subprocess.CalledProcessError as e:\n",
    "        print(\"Error in code!\\n\")\n",
    "        return e.stdout + e.stderr, True\n",
    "    \n",
    "    except subprocess.TimeoutExpired:\n",
    "        return \"Execution timed out.\", True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "9602539f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def infer(user_input, df, llm):\n",
    "\n",
    "    agent = (\n",
    "        {\n",
    "            \"input\": lambda x: x[\"input\"],\n",
    "            \"tools\": lambda x:x['tools'],\n",
    "            \"df\": lambda x:x['df'],\n",
    "            \"agent_scratchpad\": lambda x: format_to_openai_tool_messages(\n",
    "                x[\"intermediate_steps\"]\n",
    "            )\n",
    "        }\n",
    "        | prompt\n",
    "        | llm\n",
    "        | OpenAIToolsAgentOutputParser()\n",
    "    )\n",
    "    \n",
    "    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n",
    "    \n",
    "    error_flag = True    \n",
    "    \n",
    "    while error_flag:\n",
    "        result = list(agent_executor.stream({\"input\": user_input, \n",
    "                                         \"df\":pd.read_csv('df.csv'), \"tools\":tools}))\n",
    "        \n",
    "        pattern = r\"```python\\n(.*?)\\n```\"\n",
    "        matches = re.findall(pattern, result[0]['output'], re.DOTALL)\n",
    "        code = \"\\n\".join(matches)\n",
    "        code += \"\\ndf.to_csv('./df.csv', index=False)\"\n",
    "        \n",
    "        res, error_flag = run_code(code)\n",
    "        \n",
    "        print(res)\n",
    "            \n",
    "    # execute the code\n",
    "    return res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "692483e9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3m ```python\n",
      "from sklearn.model_selection import train_test_split\n",
      "from sklearn.linear_model import LinearRegression\n",
      "from sklearn.preprocessing import LabelEncoder, PolynomialFeatures\n",
      "import pandas as pd\n",
      "import numpy as np\n",
      "\n",
      "df = pd.read_csv('df.csv')\n",
      "\n",
      "# Drop rows with NaN values and non-numerical columns\n",
      "df = df.dropna().select_dtypes(include=['int64', 'float64'])\n",
      "\n",
      "X = df.drop('Fare', axis=1)\n",
      "y = df['Fare']\n",
      "\n",
      "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
      "\n",
      "model = LinearRegression()\n",
      "model.fit(X_train, y_train)\n",
      "\n",
      "y_pred = model.predict(X_test)\n",
      "\n",
      "print(model.score(X_test, y_test))\n",
      "```\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n",
      "Running code ...\n",
      "\n",
      "0.3610845260081037\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'0.3610845260081037\\n'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "infer('help me predict the fare using linear regression. Drop Nan rows and non-numerical columns.', df, llm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b119c18",
   "metadata": {},
   "source": [
    "# Testing the Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "786a9a3e",
   "metadata": {},
   "source": [
    "### First we perform some data cleaning. Observe how each infer call updates the df and passes it onwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fbef0fff",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.25</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass                     Name   Sex   Age  SibSp  \\\n",
       "0            1         0       3  Braund, Mr. Owen Harris  male  22.0      1   \n",
       "\n",
       "   Parch     Ticket  Fare Cabin Embarked  \n",
       "0      0  A/5 21171  7.25   NaN        S  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "634e0004",
   "metadata": {},
   "outputs": [],
   "source": [
    "# infer(\"Please drop the following columns inplace from the original dataframe: Name, Ticket, Cabin, PassengerId.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "53ad6af5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# infer(\"Please map the strings in the following columns of the original dataframe itself to integers: sex, embarked\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6000364b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# infer(\"Please remove NaNs with most frequent value in the respective column. Do this inplace in original dataframe.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c9d0c90",
   "metadata": {},
   "source": [
    "### Linear Regression to predict fare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93e026c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to help me run linear regression to predict the Fare. Please drop NaN rows and drop \\n\n",
    "    non numerical columns too.\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ded19d06",
   "metadata": {},
   "source": [
    "### SVR to predict Fare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "586cad9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n",
    "1. Split the dataframe into X and y. Where y is Fare column.\n",
    "2. Split X and y into train, test sets.\n",
    "3. Fit a SVR model to the train set.\n",
    "4. Predict on test set.\n",
    "5. print the MSE loss on the predictions.\n",
    "\n",
    "Run all steps from 1 to 5.\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1d8d758",
   "metadata": {},
   "source": [
    "### Logistic Regression to Classify Survival"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e0c12cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n",
    "1. Split the dataframe into X and y. Where y is Survived column.\n",
    "2. Split X and y into train, test sets.\n",
    "3. Fit a Logistic Regression model to the train set.\n",
    "4. Predict on test set.\n",
    "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n",
    "\n",
    "Run all steps from 1 to 5.\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77dd622b",
   "metadata": {},
   "source": [
    "### SVM to Classify Survival "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f7cdf6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n",
    "1. Split the dataframe into X and y. Where y is Survived column.\n",
    "2. Split X and y into train, test sets.\n",
    "3. Fit a Gaussian SVM with Poly kernel to the train set.\n",
    "4. Predict on test set.\n",
    "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n",
    "\n",
    "Run all steps from 1 to 5.\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "583e657d",
   "metadata": {},
   "source": [
    "### Naive Bayes to Classify Survival"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd88d9d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n",
    "1. Split the dataframe into X and y. Where y is Survived column.\n",
    "2. Split X and y into train, test sets.\n",
    "3. Fit a Naive Bayes model to the train set.\n",
    "4. Predict on test set.\n",
    "5. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n",
    "\n",
    "Run all steps from 1 to 5.\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a6e463b",
   "metadata": {},
   "source": [
    "### KNN to Classify Survival\n",
    "<span style=\"font-size:16px;\">We perform a grid search to figure out the best n_neighbors parameter!</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d978330c",
   "metadata": {},
   "outputs": [],
   "source": [
    "infer(\"\"\"Now I want you to follow all these steps in order:\\n\n",
    "1. Split the dataframe into X and y. Where y is Survived column.\n",
    "2. Split X and y into train and test sets.\n",
    "3. Fit a KNN model to the train set.\n",
    "4. Find the best n_neighbors parameter out of range (1,10). Import some grid search method to do this.\n",
    "5. Predict with best model on test set.\n",
    "6. print the accuracy of the predictions. Round to 4 decimal places and print it out neatly.\n",
    "\n",
    "Run all steps from 1 to 6.\n",
    "\"\"\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}