nlp_machine / app.py
Gowthamvemula's picture
Update app.py
a18d360 verified
import streamlit as st
# App title with emoji
st.title("๐Ÿ’ก Behind the Scenes of NLP")
# Sidebar navigation with icons
st.sidebar.title("๐ŸŒ Find Your Way")
# Define main sections and their subpoints
sections = {
"๐Ÿ“š Introduction to NLP": [],
"๐Ÿ”„ Lifecycle of NLP": [
" Problem Statement",
" Data Collection",
" Simple EDA",
" Data Preprocessing",
" Feature Extraction",
" Model Selection",
" Model Training and Evaluation",
" Deployment",
" Monitoring and Maintenance",
],
"โš™๏ธ NLP Techniques": [
" Tokenization",
" Stemming",
" Lemmatization",
" Stop Words",
" One Hot Encoding",
" Bag Of Words",
" Binary Bag Of Words",
" TF-IDF",
" Word Embeddings",
" Part-of-Speech (POS) Tagging",
" Named Entity Recognition (NER)",
" Sentiment Analysis",
],
}
# Display main sections and subpoints in sidebar
selected_page = st.sidebar.radio("Steps, Guidance, Clarity", list(sections.keys()))
selected_subpoint = None
if sections[selected_page]:
st.sidebar.write("### Grow & Achieve")
selected_subpoint = st.sidebar.radio("Grow & Achieve", sections[selected_page])
# Content rendering
if selected_page == "๐Ÿ“š Introduction to NLP":
st.header("What is Natural Language Processing (NLP)? ๐Ÿง ")
st.write("""
**What is NLP? ๐Ÿค–๐Ÿ’ฌ**:
**Natural Language Processing (NLP)** is a field of Artificial Intelligence (AI) that focuses on the interaction between computers and humans through natural language.
The goal is to enable machines to understand, interpret, and generate human language in a way that is meaningful and useful.
### Why NLP? ๐Ÿ’ก
NLP helps machines bridge the gap between human communication and machine understanding. It allows computers to process large amounts of unstructured text, making them capable of tasks like translation, sentiment analysis, text summarization, and more.
""")
st.header("Application of NLP:")
st.write("""
1. **Chatbots and Virtual Assistants** ๐Ÿค–๐Ÿ’ฌ
Example: Siri or Alexa understanding your voice commands and providing responses.
2. **Sentiment Analysis** โค๏ธ๐Ÿ’”
Example: Analyzing Twitter comments to determine if they are positive, negative, or neutral about a product.
3. **Machine Translation** ๐ŸŒ
Example: Google Translate converting a sentence from Spanish to English: "Hola, ยฟcรณmo estรกs?" โ†’ "Hello, how are you?"
4. **Text Summarization** โœ‚๏ธ
Example: Automatically summarizing a long article into a few key points.
- NLP is crucial in making machines more intelligent and interactive by understanding and responding to human language effectively.
""")
elif selected_page == "๐Ÿ”„ Lifecycle of NLP":
st.header("NLP Lifecycle ๐Ÿ”„")
if selected_subpoint:
if selected_subpoint == " Problem Statement":
st.write("""
Data collection is a critical first step in any NLP project. The quality and quantity of the data collected directly influence the success of the subsequent stages, such as data preprocessing, model training, and evaluation. Here's a breakdown of the data collection process and the different types of data formats used in NLP:
**Defining the Problem**:
- ๐Ÿ•ต๏ธ **Identify the challenge**: Understand the goal and define the scope of the NLP task.
- ๐Ÿงฉ **Determine the input**: What type of text data will be used? For instance, customer reviews, emails, or social media posts.
- ๐ŸŽฏ **Specify the outcome**: What result is expected? Is it classification, summarization, or language translation?
**Example**:
- ๐Ÿ“ฆ **Task**: Build a chatbot to assist customers in an e-commerce store.
- ๐Ÿ’ฌ **Input**: Customer queries like "Where is my order?" or "How do I return this product?"
- ๐ŸŽ‰ **Output**: A conversational response such as "Your order is on its way!" or "Please follow these steps to return your product."
""")
elif selected_subpoint == " Data Collection":
st.write("""
**Data Collection**: Gather text data from various sources and handle diverse file formats.
**Supported File Types**:
- ๐Ÿ“„ **CSV**: Structured data files (e.g., `data.csv`).
- ๐Ÿ“‹ **XLSX**: Excel spreadsheets (e.g., `data.xlsx`).
- ๐ŸŒ **HTML**: Scraped web data (e.g., `table.html`).
- ๐Ÿ“‚ **JSON**: API responses or hierarchical data (e.g., `data.json`).
- ๐Ÿ“œ **XML**: Nested data like RSS feeds (e.g., `data.xml`).
**Example**:
```python
import pandas as pd
# Load a CSV file
csv_file = "data.csv"
csv_data = pd.read_csv(csv_file)
# Load an Excel file
excel_file = "data.xlsx"
excel_data = pd.read_excel(excel_file)
# Load an HTML file
html_file = "table.html"
html_data = pd.read_html(html_file)[0]
# Load a JSON file
json_file = "data.json"
json_data = pd.read_json(json_file)
# Load an XML file
xml_file = "data.xml"
xml_data = pd.read_xml(xml_file)
# Print sample outputs
print(csv_data.head())
print(excel_data.head())
print(html_data.head())
print(json_data.head())
print(xml_data.head())
```
""")
elif selected_subpoint == " Simple EDA":
st.write("""
**Simple EDA (Exploratory Data Analysis)**:
- ๐Ÿ“Š **Visualize the Data**: Use plots and histograms to understand distributions.
- ๐Ÿ“ˆ **Summary Statistics**: Calculate mean, median, and other stats to find patterns.
- ๐Ÿ” **Detect Anomalies**: Identify missing values or outliers.
**Example**:
```python
# Example EDA Code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
# Summary statistics
print(df.describe())
# Visualize distributions
df['column_name'].hist()
plt.show()
```
""")
elif selected_subpoint == " Data Preprocessing":
st.write("""
Text preprocessing is the foundation of Natural Language Processing (NLP). It transforms raw text into a clean, structured format, making it suitable for analysis and machine learning.
Below are the key steps involved in preprocessing:
""")
st.write("### ๐Ÿ”— Key Steps in Text Preprocessing")
st.write("""
- **โœ‚๏ธ Tokenization**: Splitting text into smaller units like words or sentences.
- Example: `"The cat is happy"` โ†’ `["The", "cat", "is", "happy"]`
- **โŒ Stop Word Removal**: Removing commonly used words (e.g., "the", "is", "and") that add little meaning to the analysis.
- Example: `"The cat is happy"` โ†’ `["cat", "happy"]`
- **๐ŸŒฑ Lemmatization**: Converting words to their root form using grammar rules.
- Example: `"running"` โ†’ `"run"`
- **โœ‚๏ธ Stemming**: Truncating words to their base form by removing prefixes or suffixes.
- Example: `"runner"` โ†’ `"run"`
- **๐Ÿ”ก Lowercasing**: Converting all text to lowercase to maintain uniformity.
- Example: `"Hello World!"` โ†’ `"hello world!"`
- **๐Ÿ”ง HTML Tag Removal**: Cleaning out HTML tags like `<a>`, `<p>`, `<b>` from text.
- Example: `"Hello <b>World</b>!"` โ†’ `"Hello World!"`
- **๐ŸŒ URL Removal**: Stripping URLs from text.
- Example: `"Visit us at http://example.com"` โ†’ `"Visit us at"`
- **๐Ÿ˜Š Emoji Removal**: Eliminating emojis that may not contribute to textual analysis.
- Example: `"I love NLP! ๐Ÿš€"` โ†’ `"I love NLP!"`
- **๐Ÿ“Œ Hashtag Removal**: Removing hashtags that may not be relevant for the analysis.
- Example: `"#MachineLearning is awesome!"` โ†’ `"is awesome!"`
- **โ— Special Character Removal**: Cleaning symbols like `@`, `#`, `%`, etc.
- Example: `"Hello @user!"` โ†’ `"Hello user"`
""")
st.write("### Example of Preprocessing")
st.write("""
Consider the sentence:
`"Check out this amazing post! ๐Ÿ˜Š #awesome #data http://example.com ๐Ÿš€ Let's talk about AI!"`
After applying the preprocessing steps:
- Tokenization: `["Check", "out", "this", "amazing", "post", "๐Ÿ˜Š", "#awesome", "#data", "http://example.com", "๐Ÿš€", "Let's", "talk", "about", "AI"]`
- Stop Word Removal: `["Check", "amazing", "post", "๐Ÿ˜Š", "#awesome", "#data", "http://example.com", "๐Ÿš€", "Let's", "talk", "AI"]`
- URL Removal: `["Check", "amazing", "post", "๐Ÿ˜Š", "#awesome", "#data", "๐Ÿš€", "Let's", "talk", "AI"]`
- Emoji Removal: `["Check", "amazing", "post", "#awesome", "#data", "Let's", "talk", "AI"]`
- Hashtag Removal: `["Check", "amazing", "post", "Let's", "talk", "AI"]`
""")
st.write("### Code Example")
st.code("""
import re
from bs4 import BeautifulSoup
# Sample text
text = "Check out this amazing post! ๐Ÿ˜Š #awesome #data http://example.com ๐Ÿš€ Let's talk about AI!"
# Remove HTML tags
cleaned_text = BeautifulSoup(text, "html.parser").get_text()
# Remove URLs
cleaned_text = re.sub(r'http\\S+|www\\S+', '', cleaned_text)
# Remove emojis
cleaned_text = re.sub(r'[^\w\s,]', '', cleaned_text)
# Remove hashtags
cleaned_text = re.sub(r'#\\w+', '', cleaned_text)
# Output cleaned text
st.write(f"Cleaned Text: {cleaned_text}")
""", language='python')
st.write("""
**Final Output:**
`"Check out this amazing post Let's talk about AI"`
""")
elif selected_subpoint == " Feature Extraction":
st.write("""
**Feature Extraction**:
- ๐Ÿ”ข **Bag of Words (BoW)**: Represent text as a frequency matrix.
- ๐Ÿงฎ **TF-IDF**: Measure word importance using term frequency and inverse document frequency.
- ๐Ÿค– **Word Embeddings**: Represent words in continuous vector spaces (e.g., Word2Vec, GloVe).
**Example**:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
documents = ["This is a sample document.", "Another document for testing."]
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(documents)
print(features.toarray())
```
""")
elif selected_subpoint == " Model Selection":
st.write("""
**Model Selection**:
- ๐Ÿ” Choose models based on data and requirements:
- ๐Ÿ“Š **Classification**: Logistic Regression, Naive Bayes.
- ๐Ÿ”ค **Sequence Models**: RNNs, LSTMs.
- ๐ŸŒŸ **Transformers**: BERT, GPT.
**Example**:
```python
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
```
""")
elif selected_subpoint == " Model Training and Evaluation":
st.write("""
**Model Training & Evaluation**:
- ๐Ÿ“– Train models using training datasets.
- ๐Ÿ“Š Evaluate using metrics like accuracy, precision, and recall.
**Example**:
```python
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
```
""")
elif selected_subpoint == " Deployment":
st.write("""
**Deployment**:
- ๐ŸŒ Deploy models as APIs using tools like Flask or FastAPI.
- ๐Ÿ“ฆ Integrate into web or mobile apps.
**Example**:
```bash
flask run
```
""")
elif selected_subpoint == " Monitoring and Maintenance":
st.write("""
**Monitoring & Maintenance**:
- ๐Ÿ“ก Monitor performance using real-time data.
- ๐Ÿ”„ Retrain models periodically with updated data.
""")
elif selected_page == "โš™๏ธ NLP Techniques":
st.header("โš™๏ธ NLP Techniques")
if selected_subpoint:
if selected_subpoint == " Tokenization":
st.write("### Tokenization")
st.write("""
Breaking down text into smaller units such as words or sentences to make it manageable for analysis.
**Example:**
- Input: `"Artificial Intelligence is fascinating."`
- Word Tokens: `["Artificial", "Intelligence", "is", "fascinating", "."]`
- Sentence Tokens: `["Artificial Intelligence is fascinating."]`
""")
elif selected_subpoint == " Stemming":
st.write("### ๐ŸŒฑ Stemming")
st.write("""
Stemming reduces words to their root form by removing prefixes or suffixes, often resulting in a non-grammatical base.
**Example:**
- Input: `["running", "runner", "runs"]`
- Output: `["run", "runner", "run"]` (Porter Stemmer)
**Key Points:**
- **Fast** and **simple**, but can lead to over-stemming or under-stemming.
- Example of over-stemming: `"generous"` โ†’ `"gener"`
**Code Example:**
```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runner", "runs"]
print([stemmer.stem(word) for word in words])
# Output: ['run', 'runner', 'run']
```
""")
elif selected_subpoint == " Lemmatization":
st.write("### ๐ŸŒฟ Lemmatization")
st.write("""
Lemmatization reduces words to their dictionary base form (lemma), ensuring grammatical correctness.
**Example:**
- Input: `["running", "ran", "better"]`
- Output: `["run", "run", "good"]`
**Key Points:**
- Context-aware and accurate.
- More computationally intensive than stemming.
**Code Example:**
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "better"]
print([lemmatizer.lemmatize(word, pos="v") for word in words])
```
""")
elif selected_subpoint == " Stop Words":
st.write("### ๐Ÿšซ Stop Words")
st.write("""
Stop words are common words (e.g., *the*, *is*) that are removed during text processing as they don't add much meaning.
**Example:**
- Input: `"This is a simple sentence."`
- Output: `"simple sentence"`
**Code Example:**
```python
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
sentence = "This is a simple sentence."
filtered_sentence = [word for word in sentence.split() if word.lower() not in stop_words]
print(filtered_sentence) # Output: ['simple', 'sentence']
```
""")
elif selected_subpoint == " One Hot Encoding":
st.write("### ๐Ÿ”ฅ One-Hot Encoding")
st.write("""
Representing categorical data as binary vectors to make it suitable for machine learning models.
**How it works:**
- Each unique category is assigned a unique binary vector.
- A binary vector has all values as `0` except for the position representing the category, which is `1`.
**Example:**
- Categories: `["Apple", "Banana", "Cherry"]`
- Encoding:
- `Apple`: `[1, 0, 0]`
- `Banana`: `[0, 1, 0]`
- `Cherry`: `[0, 0, 1]`
""")
st.write("### ๐ŸŽ Example with Fruits")
st.write("""
**Input Categories:** `["Apple", "Banana", "Cherry", "Banana", "Apple"]`
**Output (One-Hot Encoding):**
- `Apple`: `[1, 0, 0]`
- `Banana`: `[0, 1, 0]`
- `Cherry`: `[0, 0, 1]`
- `Banana`: `[0, 1, 0]`
- `Apple`: `[1, 0, 0]`
""")
elif selected_subpoint == " Bag Of Words":
st.write("### ๐Ÿ‘œ Bag of Words (BoW)")
st.write("""
Bag of Words converts text into a matrix of word frequencies, where each word is represented by a unique index in the vocabulary.
**Example:**
- Input: `["I love NLP", "NLP is fun"]`
- Vocabulary: `["I", "love", "NLP", "is", "fun"]`
- BoW Matrix:
- `"I love NLP"`: `[1, 1, 1, 0, 0]`
- `"NLP is fun"`: `[0, 0, 1, 1, 1]`
**Code Example:**
```python
from sklearn.feature_extraction.text import CountVectorizer
documents = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print(bow_matrix.toarray()) # Output: [[1, 1, 1, 0, 0], [0, 0, 1, 1, 1]]
```
""")
elif selected_subpoint == " Binary Bag Of Words":
st.write("### ๐Ÿ”ฒ Binary Bag of Words")
st.write("""
Binary Bag of Words is a variation of the BoW model where each word is represented by `1` if present in the document and `0` if absent, ignoring word frequencies.
**Example:**
- Input: `["I love NLP", "NLP is fun"]`
- Vocabulary: `["I", "love", "NLP", "is", "fun"]`
- Binary BoW Matrix:
- `"I love NLP"`: `[1, 1, 1, 0, 0]`
- `"NLP is fun"`: `[0, 0, 1, 1, 1]`
**Code Example:**
```python
from sklearn.feature_extraction.text import CountVectorizer
documents = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer(binary=True)
binary_bow_matrix = vectorizer.fit_transform(documents)
print(binary_bow_matrix.toarray()) # Output: [[1, 1, 1, 0, 0], [0, 0, 1, 1, 1]]
```
""")
elif selected_subpoint == " TF-IDF":
st.write("### ๐Ÿงฎ TF-IDF (Term Frequency - Inverse Document Frequency)")
st.write("""
TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It considers two factors:
- **Term Frequency (TF)**: The frequency of a word in a document.
- **Inverse Document Frequency (IDF)**: The importance of the word across all documents in the corpus. Words that appear in many documents are less important.
The formula for TF-IDF is:
- **TF-IDF = TF * IDF**
**Example:**
Consider three documents:
1. `"I love programming"`
2. `"Programming is fun"`
3. `"I love Python programming"`
- **TF (for "programming")**:
- Document 1: `1/3`
- Document 2: `1/3`
- Document 3: `1/3`
- **IDF (for "programming")**:
- IDF = log(3/3) = 0 (common word, less informative)
**Code Example:**
```python
from sklearn.feature_extraction.text import TfidfVectorizer
documents = ["I love programming", "Programming is fun", "I love Python programming"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(tfidf_matrix.toarray()) # Output will show TF-IDF scores for each word in each document
```
""")
elif selected_subpoint == " Word Embeddings":
st.write("### ๐Ÿค– Word Embeddings")
st.write("""
Word embeddings are dense vector representations of words in a continuous vector space, capturing semantic meanings and relationships between words.
**Types of Word Embeddings:**
1. **Word2Vec**: Learns word associations from context using two approaches: Word2Vec is a model that transforms words into dense vector representations in a continuous vector space, capturing semantic relationships. It learns these representations by predicting words based on their context.
- **Skip-gram model**: This model predicts context words from a target word, similar to Word2Vec's Skip-gram model. It's useful for capturing word relationships, and it works well with smaller datasets.
- **CBOW (Continuous Bag of Words) model**: This model predicts a target word from a context window of words, similar to Word2Vec's CBOW model. It's effective for larger datasets and works well when words occur frequently.
2. **GloVe (Global Vectors for Word Representation)**: Uses a co-occurrence matrix to capture the relationships between words. It factors the matrix to produce low-dimensional vectors.
3. **FastText**: Extends Word2Vec by breaking words into subword units, which helps capture morphology and represent rare or unseen words.
**Example:**
- Words like `"king"` and `"queen"` will have similar vector representations in embedding space, reflecting their semantic relationship.
**Code Example (using Word2Vec):**
```python
from gensim.models import Word2Vec
# Sample sentences
sentences = [["I", "love", "programming"], ["Word", "embeddings", "are", "cool"]]
# Train Word2Vec model
model = Word2Vec(sentences, min_count=1)
# Get the vector for the word 'programming'
vector = model.wv['programming']
print(vector)
```
""")
elif selected_subpoint == " Part-of-Speech (POS) Tagging":
st.write("### ๐Ÿ–‡๏ธ Part-of-Speech (POS) Tagging")
st.write("""
Assigning grammatical labels to each word in a sentence, indicating its role in context.
**Example:**
- Input: `"Birds fly high"`
- Output: `["Birds (NOUN)", "fly (VERB)", "high (ADJ)"]`
""")
elif selected_subpoint == " Named Entity Recognition (NER)":
st.write("### ๐ŸŒ Named Entity Recognition (NER)")
st.write("""
Detecting and categorizing entities like names, dates, and locations from text.
**Example:**
- Input: `"Tesla, founded by Elon Musk, is based in California."`
- Output: `["Tesla (ORGANIZATION)", "Elon Musk (PERSON)", "California (LOCATION)"]`
""")
elif selected_subpoint == " Sentiment Analysis":
st.write("### ๐ŸŽญ Sentiment Analysis")
st.write("""
Classifying the emotional tone of a text into categories such as positive, negative, or neutral.
**Example:**
- Input: `"The service was exceptional!"`
- Output: `Positive`
""")
# Footer
st.sidebar.write("---")
st.sidebar.write("๐Ÿ” Made with Streamlit.")