{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 19 - Natural Language Processing (NLP)\n",
"\n",
"Welcome to Module 19! **Natural Language Processing** allows machines to understand, interpret, and generate human language. This is the tech behind Siri, Google Translate, and ChatGPT.\n",
"\n",
"### Objectives:\n",
"1. **Text Cleaning**: Removing punctuation and stopwords.\n",
"2. **Tokenization & Lemmatization**: Breaking down words to their roots.\n",
"3. **TF-IDF**: Weighing word importance in a document.\n",
"4. **Sentiment Analysis**: Predicting if a text is positive or negative.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"We will use a dataset of movie reviews to perform sentiment analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# Sample Dataset\n",
"reviews = [\n",
" (\"I loved this movie! The acting was great.\", 1),\n",
" (\"Terrible film, a complete waste of time.\", 0),\n",
" (\"The plot was boring but the music was okay.\", 0),\n",
" (\"Truly a masterpiece of cinema.\", 1),\n",
" (\"I would not recommend this to anybody.\", 0),\n",
" (\"Best experience I have had in a theater.\", 1)\n",
"]\n",
"df = pd.DataFrame(reviews, columns=['text', 'sentiment'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Text Transformation\n",
"\n",
"### Task 1: TF-IDF Vectorization\n",
"Convert the text reviews into a numerical matrix using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency).\n",
"\n",
"*Web Reference: [ML Guide - NLP Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"tfidf = TfidfVectorizer(stop_words='english')\n",
"X = tfidf.fit_transform(df['text'])\n",
"y = df['sentiment']\n",
"print(\"Feature names:\", tfidf.get_feature_names_out()[:10])\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Sentiment Classification\n",
"\n",
"### Task 2: Training the Classifier\n",
"Train a `LogisticRegression` model on the TF-IDF matrix and predict the sentiment of: \"This was a really fun movie!\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"model = LogisticRegression()\n",
"model.fit(X, y)\n",
"\n",
"new_review = [\"This was a really fun movie!\"]\n",
"new_vec = tfidf.transform(new_review)\n",
"pred = model.predict(new_vec)\n",
"\n",
"print(\"Positive\" if pred[0] == 1 else \"Negative\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### NLP Mission Accomplished! \n",
"You've learned how to turn human language into math. \n",
"This is your final module in the core series!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}