Spaces:
Running
Running
File size: 5,966 Bytes
854c114 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | {
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 19 - Natural Language Processing (NLP)\n",
"\n",
"Welcome to Module 19! **Natural Language Processing** allows machines to understand, interpret, and generate human language. This is the tech behind Siri, Google Translate, and ChatGPT.\n",
"\n",
"### Objectives:\n",
"1. **Text Cleaning**: Removing punctuation and stopwords.\n",
"2. **Tokenization & Lemmatization**: Breaking down words to their roots.\n",
"3. **TF-IDF**: Weighing word importance in a document.\n",
"4. **Sentiment Analysis**: Predicting if a text is positive or negative.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"We will use a dataset of movie reviews to perform sentiment analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# Sample Dataset\n",
"reviews = [\n",
" (\"I loved this movie! The acting was great.\", 1),\n",
" (\"Terrible film, a complete waste of time.\", 0),\n",
" (\"The plot was boring but the music was okay.\", 0),\n",
" (\"Truly a masterpiece of cinema.\", 1),\n",
" (\"I would not recommend this to anybody.\", 0),\n",
" (\"Best experience I have had in a theater.\", 1)\n",
"]\n",
"df = pd.DataFrame(reviews, columns=['text', 'sentiment'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Text Transformation\n",
"\n",
"### Task 1: TF-IDF Vectorization\n",
"Convert the text reviews into a numerical matrix using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency).\n",
"\n",
"*Web Reference: [ML Guide - NLP Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"tfidf = TfidfVectorizer(stop_words='english')\n",
"X = tfidf.fit_transform(df['text'])\n",
"y = df['sentiment']\n",
"print(\"Feature names:\", tfidf.get_feature_names_out()[:10])\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Sentiment Classification\n",
"\n",
"### Task 2: Training the Classifier\n",
"Train a `LogisticRegression` model on the TF-IDF matrix and predict the sentiment of: \"This was a really fun movie!\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"model = LogisticRegression()\n",
"model.fit(X, y)\n",
"\n",
"new_review = [\"This was a really fun movie!\"]\n",
"new_vec = tfidf.transform(new_review)\n",
"pred = model.predict(new_vec)\n",
"\n",
"print(\"Positive\" if pred[0] == 1 else \"Negative\")\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### NLP Mission Accomplished! \n",
"You've learned how to turn human language into math. \n",
"This is your final module in the core series!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
} |