Spaces:

AashishAIHub
/

DataScience

Running

File size: 5,966 Bytes

854c114

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 19 - Natural Language Processing (NLP)\n",
                "\n",
                "Welcome to Module 19! **Natural Language Processing** allows machines to understand, interpret, and generate human language. This is the tech behind Siri, Google Translate, and ChatGPT.\n",
                "\n",
                "### Objectives:\n",
                "1. **Text Cleaning**: Removing punctuation and stopwords.\n",
                "2. **Tokenization & Lemmatization**: Breaking down words to their roots.\n",
                "3. **TF-IDF**: Weighing word importance in a document.\n",
                "4. **Sentiment Analysis**: Predicting if a text is positive or negative.\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Setup\n",
                "We will use a dataset of movie reviews to perform sentiment analysis."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "from sklearn.model_selection import train_test_split\n",
                "from sklearn.feature_extraction.text import TfidfVectorizer\n",
                "from sklearn.linear_model import LogisticRegression\n",
                "from sklearn.metrics import accuracy_score\n",
                "\n",
                "# Sample Dataset\n",
                "reviews = [\n",
                "    (\"I loved this movie! The acting was great.\", 1),\n",
                "    (\"Terrible film, a complete waste of time.\", 0),\n",
                "    (\"The plot was boring but the music was okay.\", 0),\n",
                "    (\"Truly a masterpiece of cinema.\", 1),\n",
                "    (\"I would not recommend this to anybody.\", 0),\n",
                "    (\"Best experience I have had in a theater.\", 1)\n",
                "]\n",
                "df = pd.DataFrame(reviews, columns=['text', 'sentiment'])\n",
                "df"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Text Transformation\n",
                "\n",
                "### Task 1: TF-IDF Vectorization\n",
                "Convert the text reviews into a numerical matrix using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency).\n",
                "\n",
                "*Web Reference: [ML Guide - NLP Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "tfidf = TfidfVectorizer(stop_words='english')\n",
                "X = tfidf.fit_transform(df['text'])\n",
                "y = df['sentiment']\n",
                "print(\"Feature names:\", tfidf.get_feature_names_out()[:10])\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Sentiment Classification\n",
                "\n",
                "### Task 2: Training the Classifier\n",
                "Train a `LogisticRegression` model on the TF-IDF matrix and predict the sentiment of: \"This was a really fun movie!\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "model = LogisticRegression()\n",
                "model.fit(X, y)\n",
                "\n",
                "new_review = [\"This was a really fun movie!\"]\n",
                "new_vec = tfidf.transform(new_review)\n",
                "pred = model.predict(new_vec)\n",
                "\n",
                "print(\"Positive\" if pred[0] == 1 else \"Negative\")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### NLP Mission Accomplished! \n",
                "You've learned how to turn human language into math. \n",
                "This is your final module in the core series!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}