{ "cells": [ { "cell_type": "markdown", "id": "c96b164c", "metadata": {}, "source": [ "# Overview " ] }, { "cell_type": "markdown", "id": "22c9cad3", "metadata": {}, "source": [ "In this notebook, we will try to create a workflow between Langchain and Mixtral LLM.\n", "We want to accomplish the following:\n", "1. Establish a pipeline to read in a csv and pass it to our LLM. \n", "2. Establish a Basic Inference Agent.\n", "3. Test the Basic Inference on a few tasks." ] }, { "cell_type": "markdown", "id": "3bd62bd1", "metadata": {}, "source": [ "# Setting up the Environment " ] }, { "cell_type": "code", "execution_count": 21, "id": "76b3d212", "metadata": {}, "outputs": [], "source": [ "####################################################################################################\n", "import os\n", "import re\n", "import subprocess\n", "import sys\n", "\n", "from langchain import hub\n", "from langchain.agents.agent_types import AgentType\n", "from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent\n", "from langchain_experimental.utilities import PythonREPL\n", "from langchain.agents import Tool\n", "from langchain.agents.format_scratchpad.openai_tools import (\n", " format_to_openai_tool_messages,\n", ")\n", "from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser\n", "\n", "from langchain.agents import AgentExecutor\n", "\n", "from langchain.agents import create_structured_chat_agent\n", "from langchain.memory import ConversationBufferWindowMemory\n", "\n", "from langchain.output_parsers import PandasDataFrameOutputParser\n", "from langchain.prompts import PromptTemplate\n", "\n", "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n", "\n", "from langchain_community.chat_models import ChatAnyscale\n", "import pandas as pd \n", "import matplotlib.pyplot as plt \n", "import numpy as np\n", "\n", "plt.style.use('ggplot')\n", "####################################################################################################" ] }, { "cell_type": "code", "execution_count": 3, "id": "9da52e1f", "metadata": {}, "outputs": [], "source": [ "# insert your API key here\n", "os.environ[\"ANYSCALE_API_KEY\"] = \"esecret_8btufnh3615vnbpd924s1t3q7p\"\n", "memory_key = \"history\"" ] }, { "cell_type": "markdown", "id": "edcd6ca7", "metadata": {}, "source": [ "# Reading the Dataframe and saving it \n", "We are saving the dataframe in a csv file in the cwd because we will iteratively update this df if need be between each inference. This way the model will continuously have the most upto date version of the dataframe at its disposal" ] }, { "cell_type": "markdown", "id": "64a086f2", "metadata": {}, "source": [ "# DECLARE YOUR LLM HERE " ] }, { "cell_type": "code", "execution_count": 5, "id": "3f686305", "metadata": {}, "outputs": [], "source": [ "llm = ChatAnyscale(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0)" ] }, { "cell_type": "markdown", "id": "4b1d8f8f", "metadata": {}, "source": [ "# REPL Tool \n", "This tool enables our LLM to 'execute' python code so that we can actually display the results to the end-user." ] }, { "cell_type": "code", "execution_count": 6, "id": "23045c5d", "metadata": {}, "outputs": [], "source": [ "python_repl = PythonREPL()\n", "\n", "repl_tool = Tool(\n", " name=\"python_repl\",\n", " description=\"\"\"A Python shell. Shell can dislay charts too. Use this to execute python commands.\\\n", " You have access to all libraries in python including but not limited to sklearn, pandas, numpy,\\\n", " matplotlib.pyplot, seaborn etc. Input should be a valid python command. If the user has not explicitly\\\n", " asked you to plot the results, always print the final output using print(...)\"\"\",\n", " func=python_repl.run,\n", ")\n", "\n", "tools = [repl_tool]" ] }, { "cell_type": "markdown", "id": "c5caa356", "metadata": {}, "source": [ "# Building an Agent " ] }, { "cell_type": "code", "execution_count": 30, "id": "44c07305", "metadata": {}, "outputs": [], "source": [ "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\n", " \"system\",\n", " \"\"\"You are Machine Learning Inference agent. Your job is to use your tools to answer a user query\\\n", " in the best manner possible.\\\n", " The user gives you two things:\n", " 1. A query, and a target task.\n", " Your job is to extract what you need to do from this query and then accomplish that task.\\\n", " In case the user asks you to do things like regression/classification, you will need to follow this:\\\n", " READ IN YOUR DATAFRAME FROM: 'df.csv'\n", " 1. Split the dataframe into X and y (y= target, X=features)\n", " 2. Perform some preprocessing on X as per user instructions. \n", " 3. Split preprocessed X and y into train and test sets.\n", " 4. If a model is specified, use that model or else pick your model of choice for the given task.\n", " 5. Run this model on train set. \n", " 6. Predict on Test Set.\n", " 5. Print accuracy score by predicting on the test.\n", " \n", " Provide no explanation for your code. Enclose all your code between triple backticks ``` \"\"\",\n", " ),\n", " (\"user\", \"Dataframe named df:\\n{df}\\nQuery: {input}\\nList of Tools: {tools}\"),\n", " MessagesPlaceholder(variable_name=\"agent_scratchpad\"),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "0784c5a9", "metadata": {}, "outputs": [], "source": [ "os.mkdir('code')" ] }, { "cell_type": "code", "execution_count": 18, "id": "f1da1de1", "metadata": {}, "outputs": [], "source": [ "def run_code(code):\n", " \n", " with open(f'{os.getcwd()}/code/code.py', 'w') as file:\n", " file.write(code)\n", " \n", " try:\n", " print(\"Running code ...\\n\")\n", " result = subprocess.run([sys.executable, 'code/code.py'], capture_output=True, text=True, check=True, timeout=20)\n", " return result.stdout, False\n", " \n", " except subprocess.CalledProcessError as e:\n", " print(\"Error in code!\\n\")\n", " return e.stdout + e.stderr, True\n", " \n", " except subprocess.TimeoutExpired:\n", " return \"Execution timed out.\", True" ] }, { "cell_type": "code", "execution_count": 32, "id": "9602539f", "metadata": {}, "outputs": [], "source": [ "def infer(user_input, df, llm):\n", "\n", " agent = (\n", " {\n", " \"input\": lambda x: x[\"input\"],\n", " \"tools\": lambda x:x['tools'],\n", " \"df\": lambda x:x['df'],\n", " \"agent_scratchpad\": lambda x: format_to_openai_tool_messages(\n", " x[\"intermediate_steps\"]\n", " )\n", " }\n", " | prompt\n", " | llm\n", " | OpenAIToolsAgentOutputParser()\n", " )\n", " \n", " agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n", " \n", " error_flag = True \n", " \n", " while error_flag:\n", " result = list(agent_executor.stream({\"input\": user_input, \n", " \"df\":pd.read_csv('df.csv'), \"tools\":tools}))\n", " \n", " pattern = r\"```python\\n(.*?)\\n```\"\n", " matches = re.findall(pattern, result[0]['output'], re.DOTALL)\n", " code = \"\\n\".join(matches)\n", " code += \"\\ndf.to_csv('./df.csv', index=False)\"\n", " \n", " res, error_flag = run_code(code)\n", " \n", " print(res)\n", " \n", " # execute the code\n", " return res" ] }, { "cell_type": "code", "execution_count": 33, "id": "692483e9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3m ```python\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import LabelEncoder, PolynomialFeatures\n", "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv('df.csv')\n", "\n", "# Drop rows with NaN values and non-numerical columns\n", "df = df.dropna().select_dtypes(include=['int64', 'float64'])\n", "\n", "X = df.drop('Fare', axis=1)\n", "y = df['Fare']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "\n", "y_pred = model.predict(X_test)\n", "\n", "print(model.score(X_test, y_test))\n", "```\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n", "Running code ...\n", "\n", "0.3610845260081037\n", "\n" ] }, { "data": { "text/plain": [ "'0.3610845260081037\\n'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "infer('help me predict the fare using linear regression. Drop Nan rows and non-numerical columns.', df, llm)" ] }, { "cell_type": "markdown", "id": "2b119c18", "metadata": {}, "source": [ "# Testing the Agent" ] }, { "cell_type": "markdown", "id": "786a9a3e", "metadata": {}, "source": [ "### First we perform some data cleaning. Observe how each infer call updates the df and passes it onwards." ] }, { "cell_type": "code", "execution_count": 11, "id": "fbef0fff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.25 | \n", "NaN | \n", "S | \n", "