{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 19 - Natural Language Processing (NLP)\n", "\n", "Welcome to Module 19! **Natural Language Processing** allows machines to understand, interpret, and generate human language. This is the tech behind Siri, Google Translate, and ChatGPT.\n", "\n", "### Objectives:\n", "1. **Text Cleaning**: Removing punctuation and stopwords.\n", "2. **Tokenization & Lemmatization**: Breaking down words to their roots.\n", "3. **TF-IDF**: Weighing word importance in a document.\n", "4. **Sentiment Analysis**: Predicting if a text is positive or negative.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup\n", "We will use a dataset of movie reviews to perform sentiment analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score\n", "\n", "# Sample Dataset\n", "reviews = [\n", " (\"I loved this movie! The acting was great.\", 1),\n", " (\"Terrible film, a complete waste of time.\", 0),\n", " (\"The plot was boring but the music was okay.\", 0),\n", " (\"Truly a masterpiece of cinema.\", 1),\n", " (\"I would not recommend this to anybody.\", 0),\n", " (\"Best experience I have had in a theater.\", 1)\n", "]\n", "df = pd.DataFrame(reviews, columns=['text', 'sentiment'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Text Transformation\n", "\n", "### Task 1: TF-IDF Vectorization\n", "Convert the text reviews into a numerical matrix using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency).\n", "\n", "*Web Reference: [ML Guide - NLP Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "tfidf = TfidfVectorizer(stop_words='english')\n", "X = tfidf.fit_transform(df['text'])\n", "y = df['sentiment']\n", "print(\"Feature names:\", tfidf.get_feature_names_out()[:10])\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Sentiment Classification\n", "\n", "### Task 2: Training the Classifier\n", "Train a `LogisticRegression` model on the TF-IDF matrix and predict the sentiment of: \"This was a really fun movie!\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "model = LogisticRegression()\n", "model.fit(X, y)\n", "\n", "new_review = [\"This was a really fun movie!\"]\n", "new_vec = tfidf.transform(new_review)\n", "pred = model.predict(new_vec)\n", "\n", "print(\"Positive\" if pred[0] == 1 else \"Negative\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### NLP Mission Accomplished! \n", "You've learned how to turn human language into math. \n", "This is your final module in the core series!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }