Update app.py
Browse files
app.py
CHANGED
|
@@ -1,273 +1,229 @@
|
|
| 1 |
import streamlit as st
|
| 2 |
|
| 3 |
-
# Function to redirect to different pages
|
| 4 |
-
def redirect_to_page(page):
|
| 5 |
-
st.experimental_set_query_params(page=page)
|
| 6 |
-
|
| 7 |
# Title of the app
|
| 8 |
st.title('Natural Language Processing (NLP) Overview')
|
| 9 |
|
| 10 |
-
#
|
| 11 |
-
st.
|
| 12 |
-
st.write("""
|
| 13 |
-
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that enables machines to understand,
|
| 14 |
-
interpret, and generate human language. NLP is used in a wide variety of applications, such as chatbots, search engines,
|
| 15 |
-
translation systems, and voice assistants.
|
| 16 |
-
|
| 17 |
-
Some common NLP tasks include:
|
| 18 |
-
- Text Classification
|
| 19 |
-
- Sentiment Analysis
|
| 20 |
-
- Named Entity Recognition (NER)
|
| 21 |
-
- Language Translation
|
| 22 |
-
- Text Summarization
|
| 23 |
-
- Part-of-Speech Tagging
|
| 24 |
-
|
| 25 |
-
### Importance of NLP:
|
| 26 |
-
- **Automation of manual tasks**: NLP is widely used to automate tasks such as document categorization, content summarization, and sentiment analysis.
|
| 27 |
-
- **Understanding and generating human language**: NLP allows machines to understand the meaning behind words, sentences, and paragraphs, making human-machine interactions more natural.
|
| 28 |
-
""")
|
| 29 |
-
|
| 30 |
-
# NLP Lifecycle
|
| 31 |
-
st.header('NLP Lifecycle')
|
| 32 |
-
st.write("""
|
| 33 |
-
The NLP lifecycle consists of several stages, each contributing to transforming raw text into useful insights or predictions. Here are the stages of the NLP lifecycle:
|
| 34 |
-
|
| 35 |
-
1. **Data Collection**: Collect text data from various sources such as websites, social media, surveys, etc.
|
| 36 |
-
2. **Text Preprocessing**: Clean and preprocess the text data, removing unnecessary information like stopwords, punctuation, etc.
|
| 37 |
-
3. **Text Representation**: Convert the preprocessed text into numerical form using methods like Bag of Words (BoW), TF-IDF, or Word Embeddings.
|
| 38 |
-
4. **Model Training**: Train machine learning models on the text data to solve the NLP problem, such as classification or entity recognition.
|
| 39 |
-
5. **Evaluation**: Assess the model's performance using evaluation metrics like accuracy, precision, recall, and F1-score.
|
| 40 |
-
6. **Deployment**: Deploy the trained model to a real-world application, such as a chatbot or sentiment analysis tool, and continuously monitor and retrain the model as needed.
|
| 41 |
-
|
| 42 |
-
These stages are crucial for building effective NLP applications that provide value to users.
|
| 43 |
-
""")
|
| 44 |
-
# Define the available NLP lifecycle stages
|
| 45 |
-
lifecycle_stages = ['Data Collection', 'Text Preprocessing', 'Text Representation',
|
| 46 |
-
'Model Training', 'Evaluation', 'Deployment']
|
| 47 |
-
|
| 48 |
-
# Add a selectbox for the user to choose a lifecycle stage
|
| 49 |
-
selected_lifecycle_stage = st.selectbox('Choose an NLP Lifecycle Stage:', lifecycle_stages)
|
| 50 |
-
|
| 51 |
-
# If lifecycle stage is selected, update query params and display new content
|
| 52 |
-
if selected_lifecycle_stage:
|
| 53 |
-
redirect_to_page(selected_lifecycle_stage)
|
| 54 |
-
|
| 55 |
-
# Get the page from the query params
|
| 56 |
-
params = st.experimental_get_query_params()
|
| 57 |
-
selected_page = params.get("page", [None])[0]
|
| 58 |
-
|
| 59 |
-
# Define content for different lifecycle stages
|
| 60 |
-
if selected_page == 'Data Collection':
|
| 61 |
-
st.write("""
|
| 62 |
-
### Data Collection:
|
| 63 |
-
The first stage of the NLP lifecycle involves gathering text data from various sources such as:
|
| 64 |
-
- Social media posts
|
| 65 |
-
- Websites and blogs
|
| 66 |
-
- News articles
|
| 67 |
-
- Customer reviews
|
| 68 |
-
- Books and papers
|
| 69 |
-
|
| 70 |
-
**Example**: Collecting customer feedback from surveys or scraping news articles to analyze sentiment.
|
| 71 |
-
|
| 72 |
-
**Key Points**:
|
| 73 |
-
- Data must be relevant to the task you are solving (e.g., sentiment analysis, text classification).
|
| 74 |
-
- The data can be structured (e.g., databases) or unstructured (e.g., plain text from websites).
|
| 75 |
-
""")
|
| 76 |
-
|
| 77 |
-
elif selected_page == 'Text Preprocessing':
|
| 78 |
-
st.write("""
|
| 79 |
-
### Text Preprocessing:
|
| 80 |
-
Text preprocessing is essential for preparing raw text data for analysis. The steps involved include:
|
| 81 |
-
- **Tokenization**: Breaking text into smaller units like words or sentences.
|
| 82 |
-
- **Removing Stop Words**: Stop words (e.g., "the", "a", "is") are common words that don't carry much information and are often removed.
|
| 83 |
-
- **Stemming**: Reducing words to their base or root form (e.g., "running" → "run").
|
| 84 |
-
- **Lemmatization**: Similar to stemming but more accurate, it reduces words to their dictionary form (e.g., "better" → "good").
|
| 85 |
-
- **Lowercasing**: Converting all text to lowercase to avoid treating the same word in different cases (e.g., "Hello" vs "hello").
|
| 86 |
-
- **Removing Special Characters**: Eliminating punctuation marks, numbers, and other non-alphabetic characters that may not contribute to the analysis.
|
| 87 |
-
|
| 88 |
-
**Key Points**:
|
| 89 |
-
- Preprocessing is crucial for reducing noise in the text, ensuring that the machine learning models focus on the important features.
|
| 90 |
-
""")
|
| 91 |
-
|
| 92 |
-
elif selected_page == 'Text Representation':
|
| 93 |
-
st.write("""
|
| 94 |
-
### Text Representation:
|
| 95 |
-
After preprocessing, text needs to be converted into a numerical form for machine learning algorithms.
|
| 96 |
-
The common techniques for text representation include:
|
| 97 |
-
- **Bag of Words (BoW)**: Converts text into a matrix of word frequencies.
|
| 98 |
-
- **TF-IDF (Term Frequency - Inverse Document Frequency)**: A statistical method to evaluate the importance of a word within a document relative to a collection of documents.
|
| 99 |
-
- **Word Embeddings**: Maps words to dense vectors, preserving semantic meaning (e.g., Word2Vec, GloVe, FastText).
|
| 100 |
-
|
| 101 |
-
**Key Points**:
|
| 102 |
-
- BoW and TF-IDF are more traditional methods, while word embeddings capture semantic relationships and are widely used in modern NLP tasks.
|
| 103 |
-
""")
|
| 104 |
-
|
| 105 |
-
elif selected_page == 'Model Training':
|
| 106 |
-
st.write("""
|
| 107 |
-
### Model Training:
|
| 108 |
-
In the model training stage, machine learning algorithms are used to train a model on the preprocessed and represented data.
|
| 109 |
-
The choice of model depends on the task at hand. For example:
|
| 110 |
-
- For **text classification**, algorithms like Naive Bayes, SVM, or neural networks are commonly used.
|
| 111 |
-
- For **named entity recognition (NER)**, sequence models such as CRF (Conditional Random Fields) or LSTM (Long Short-Term Memory) can be used.
|
| 112 |
-
- For **sentiment analysis**, simple models like logistic regression or complex models like BERT can be employed.
|
| 113 |
-
|
| 114 |
-
**Key Points**:
|
| 115 |
-
- The choice of model depends on the task (e.g., classification, sequence generation, summarization).
|
| 116 |
-
- The model learns patterns and relationships in the text data, which it will use to make predictions.
|
| 117 |
-
""")
|
| 118 |
-
|
| 119 |
-
elif selected_page == 'Evaluation':
|
| 120 |
-
st.write("""
|
| 121 |
-
### Evaluation:
|
| 122 |
-
Once a model is trained, it is evaluated to understand its performance. Common evaluation metrics include:
|
| 123 |
-
- **Accuracy**: The proportion of correct predictions.
|
| 124 |
-
- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
|
| 125 |
-
- **Recall**: The ratio of correctly predicted positive observations to the total actual positives.
|
| 126 |
-
- **F1-Score**: The weighted average of precision and recall.
|
| 127 |
-
- **ROC and AUC**: Performance measurement for classification problems.
|
| 128 |
-
|
| 129 |
-
**Key Points**:
|
| 130 |
-
- Evaluation helps determine if the model is overfitting (memorizing the training data) or underfitting (not learning the data properly).
|
| 131 |
-
- It ensures that the model will perform well on unseen data (real-world applications).
|
| 132 |
-
""")
|
| 133 |
-
|
| 134 |
-
elif selected_page == 'Deployment':
|
| 135 |
-
st.write("""
|
| 136 |
-
### Deployment:
|
| 137 |
-
The final stage is deploying the trained model for real-time use. The model can be integrated into applications like:
|
| 138 |
-
- Chatbots for customer service
|
| 139 |
-
- Sentiment analysis for social media monitoring
|
| 140 |
-
- Language translation systems
|
| 141 |
-
- Search engines for better query results
|
| 142 |
-
|
| 143 |
-
**Key Points**:
|
| 144 |
-
- Continuous monitoring and maintenance are necessary to ensure that the model stays effective over time, especially as new data comes in.
|
| 145 |
-
- Retraining may be required periodically to account for changes in language usage or new trends in the data.
|
| 146 |
-
""")
|
| 147 |
-
|
| 148 |
-
# NLP Techniques
|
| 149 |
-
st.header('NLP Techniques')
|
| 150 |
-
st.write("""
|
| 151 |
-
Some key techniques used in NLP include:
|
| 152 |
-
|
| 153 |
-
- **Tokenization**: The process of breaking down text into smaller units, such as words or sentences.
|
| 154 |
-
- **Stop Word Removal**: The process of removing common words (e.g., "the", "a", "and") that do not contribute significant meaning to the text.
|
| 155 |
-
- **Stemming**: Reducing words to their root form (e.g., "running" → "run").
|
| 156 |
-
- **Lemmatization**: Similar to stemming but more accurate, reducing words to their dictionary form (e.g., "better" → "good").
|
| 157 |
-
- **Named Entity Recognition (NER)**: Identifying entities such as people, organizations, and locations within text.
|
| 158 |
-
- **Part-of-Speech Tagging**: Identifying the grammatical structure of words in a sentence, such as nouns, verbs, adjectives, etc.
|
| 159 |
-
- **Word Embeddings**: A technique that maps words into continuous vector space, capturing semantic relationships between words (e.g., Word2Vec, GloVe).
|
| 160 |
-
- **Text Classification**: Categorizing text into predefined labels or categories (e.g., spam detection, sentiment analysis).
|
| 161 |
-
- **Sentiment Analysis**: Determining the sentiment expressed in a text, such as whether it is positive, negative, or neutral.
|
| 162 |
|
| 163 |
-
|
| 164 |
-
|
| 165 |
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
#
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
'Text Generation', 'Text Similarity']
|
| 171 |
|
| 172 |
-
#
|
| 173 |
-
|
|
|
|
| 174 |
|
| 175 |
-
#
|
| 176 |
-
if
|
| 177 |
-
redirect_to_page(selected_task)
|
| 178 |
-
|
| 179 |
-
# Get the task from the query params
|
| 180 |
-
selected_task_page = params.get("page", [None])[0]
|
| 181 |
-
|
| 182 |
-
# Define content for different NLP tasks
|
| 183 |
-
if selected_task_page == 'Text Classification':
|
| 184 |
-
st.write("""
|
| 185 |
-
### Text Classification:
|
| 186 |
-
Text Classification is the task of categorizing text into predefined labels.
|
| 187 |
-
This can be used for spam detection, topic categorization, etc.
|
| 188 |
-
**Example**: Categorizing news articles into topics like 'Sports', 'Politics', etc.
|
| 189 |
-
|
| 190 |
-
**Techniques**:
|
| 191 |
-
- Bag of Words (BoW)
|
| 192 |
-
- TF-IDF
|
| 193 |
-
- Word Embeddings
|
| 194 |
-
""")
|
| 195 |
-
|
| 196 |
-
elif selected_task_page == 'Sentiment Analysis':
|
| 197 |
-
st.write("""
|
| 198 |
-
### Sentiment Analysis:
|
| 199 |
-
Sentiment Analysis determines the sentiment of a given text, such as whether it is positive, negative, or neutral.
|
| 200 |
-
**Example**: Analyzing product reviews to determine customer satisfaction.
|
| 201 |
-
|
| 202 |
-
**Techniques**:
|
| 203 |
-
- Lexicon-based (e.g., VADER)
|
| 204 |
-
- Machine Learning (e.g., Naive Bayes, SVM)
|
| 205 |
-
""")
|
| 206 |
-
|
| 207 |
-
elif selected_task_page == 'Named Entity Recognition (NER)':
|
| 208 |
st.write("""
|
| 209 |
-
###
|
| 210 |
-
|
| 211 |
-
**Example**: Extracting names of people and organizations from news articles.
|
| 212 |
|
| 213 |
-
|
| 214 |
-
-
|
| 215 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
""")
|
| 217 |
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 234 |
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
elif
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import streamlit as st
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
# Title of the app
|
| 4 |
st.title('Natural Language Processing (NLP) Overview')
|
| 5 |
|
| 6 |
+
# Sidebar for navigation
|
| 7 |
+
sidebar = st.sidebar
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
# Sidebar header
|
| 10 |
+
sidebar.header('NLP Navigation')
|
| 11 |
|
| 12 |
+
# Sidebar options for NLP Overview, Lifecycle, and Techniques
|
| 13 |
+
sidebar_option = sidebar.radio('Choose a section to explore:', ['What is NLP?', 'NLP Lifecycle', 'NLP Techniques'])
|
| 14 |
|
| 15 |
+
# Store the selected page in session state
|
| 16 |
+
if 'selected_page' not in st.session_state:
|
| 17 |
+
st.session_state.selected_page = sidebar_option
|
|
|
|
| 18 |
|
| 19 |
+
# Update the selected page if the user selects a different option
|
| 20 |
+
if sidebar_option != st.session_state.selected_page:
|
| 21 |
+
st.session_state.selected_page = sidebar_option
|
| 22 |
|
| 23 |
+
# Content for "What is NLP?"
|
| 24 |
+
if st.session_state.selected_page == 'What is NLP?':
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
st.write("""
|
| 26 |
+
### What is NLP?
|
| 27 |
+
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human (natural) languages. The goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is meaningful.
|
|
|
|
| 28 |
|
| 29 |
+
NLP is essential for enabling computers to process and analyze large amounts of natural language data, such as:
|
| 30 |
+
- Text from documents
|
| 31 |
+
- Speech from conversations
|
| 32 |
+
- Images with textual descriptions
|
| 33 |
+
|
| 34 |
+
#### Key Components of NLP:
|
| 35 |
+
- **Syntax**: Refers to the arrangement of words in a sentence.
|
| 36 |
+
- **Semantics**: Focuses on the meaning of the words and sentences.
|
| 37 |
+
- **Pragmatics**: Involves the context and intent behind language.
|
| 38 |
+
- **Discourse**: Studies how previous sentences and context influence meaning.
|
| 39 |
+
|
| 40 |
+
#### Example Applications of NLP:
|
| 41 |
+
- **Machine Translation**: Automatic translation of text from one language to another (e.g., Google Translate).
|
| 42 |
+
- **Speech Recognition**: Converting spoken language into text (e.g., Siri, Alexa).
|
| 43 |
+
- **Sentiment Analysis**: Analyzing text to determine the sentiment (positive, negative, neutral) (e.g., analyzing customer reviews).
|
| 44 |
+
- **Text Summarization**: Creating a short summary of a long text (e.g., summarizing articles).
|
| 45 |
+
|
| 46 |
+
NLP is used across multiple domains like healthcare, finance, and customer service to automate and improve various tasks.
|
| 47 |
""")
|
| 48 |
|
| 49 |
+
# Content for NLP Lifecycle
|
| 50 |
+
elif st.session_state.selected_page == "NLP Lifecycle":
|
| 51 |
+
lifecycle_option = sidebar.radio("Select NLP Lifecycle Step:", [
|
| 52 |
+
"Data Collection",
|
| 53 |
+
"Text Preprocessing",
|
| 54 |
+
"Text Representation",
|
| 55 |
+
"Model Training",
|
| 56 |
+
"Evaluation",
|
| 57 |
+
"Deployment"
|
| 58 |
+
])
|
| 59 |
|
| 60 |
+
if lifecycle_option == "Data Collection":
|
| 61 |
+
st.write("""
|
| 62 |
+
#### 1. Data Collection
|
| 63 |
+
Data collection is the first stage of the NLP lifecycle. It involves gathering relevant text data from various sources to analyze and process.
|
| 64 |
+
- **Sources**:
|
| 65 |
+
- Social media posts (e.g., tweets, Facebook status updates)
|
| 66 |
+
- News articles (e.g., for summarization or sentiment analysis)
|
| 67 |
+
- Customer reviews (e.g., on e-commerce platforms)
|
| 68 |
+
- Books and research papers (e.g., for topic modeling or classification)
|
| 69 |
+
|
| 70 |
+
**Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
|
| 71 |
+
""")
|
| 72 |
+
|
| 73 |
+
elif lifecycle_option == "Text Preprocessing":
|
| 74 |
+
st.write("""
|
| 75 |
+
#### 2. Text Preprocessing
|
| 76 |
+
Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
|
| 77 |
+
- **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
|
| 78 |
+
- **Stop Words Removal**: Removing common words that don’t contribute much information.
|
| 79 |
+
- **Lemmatization**: Converting words into their base or dictionary form.
|
| 80 |
+
- **Stemming**: Cutting off prefixes or suffixes from words.
|
| 81 |
+
- **Lowercasing**: Converting all characters in the text to lowercase.
|
| 82 |
+
|
| 83 |
+
**Example**: For the sentence "The quick brown fox is running fast", after preprocessing:
|
| 84 |
+
- Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast"]
|
| 85 |
+
- Stop Words Removal: ["quick", "brown", "fox", "running", "fast"]
|
| 86 |
+
- Lemmatization: ["quick", "brown", "fox", "run", "fast"]
|
| 87 |
+
""")
|
| 88 |
+
|
| 89 |
+
elif lifecycle_option == "Text Representation":
|
| 90 |
+
st.write("""
|
| 91 |
+
#### 3. Text Representation
|
| 92 |
+
After preprocessing, the text data needs to be converted into a numerical format for use in machine learning models. There are several methods for text representation:
|
| 93 |
+
- **Bag of Words (BoW)**: Converts text into a matrix of word frequencies.
|
| 94 |
+
- **TF-IDF**: Weighs words based on their frequency in a specific document relative to their frequency across the entire dataset.
|
| 95 |
+
- **Word Embeddings**: Transforms words into dense vectors that capture semantic meaning.
|
| 96 |
+
|
| 97 |
+
**Example**: Using BoW to convert the sentence "I love NLP" into a vector representation:
|
| 98 |
+
- Vocabulary: ["I", "love", "NLP"]
|
| 99 |
+
- Vector: [1, 1, 1] (word frequency representation)
|
| 100 |
+
""")
|
| 101 |
+
|
| 102 |
+
elif lifecycle_option == "Model Training":
|
| 103 |
+
st.write("""
|
| 104 |
+
#### 4. Model Training
|
| 105 |
+
In the model training stage, machine learning algorithms are trained on the preprocessed and represented text data. The choice of model depends on the task:
|
| 106 |
+
- **Text Classification**: Naive Bayes, Support Vector Machines (SVM), or neural networks.
|
| 107 |
+
- **Named Entity Recognition (NER)**: Conditional Random Fields (CRF), LSTMs, or transformers.
|
| 108 |
+
- **Sentiment Analysis**: Logistic regression, Naive Bayes, or transformer-based models like BERT.
|
| 109 |
+
|
| 110 |
+
**Example**: Training a Naive Bayes classifier to categorize news articles into topics such as "Sports", "Politics", etc.
|
| 111 |
+
""")
|
| 112 |
+
|
| 113 |
+
elif lifecycle_option == "Evaluation":
|
| 114 |
+
st.write("""
|
| 115 |
+
#### 5. Evaluation
|
| 116 |
+
After training the model, it's important to evaluate its performance. Common evaluation metrics include:
|
| 117 |
+
- **Accuracy**: The percentage of correctly classified samples.
|
| 118 |
+
- **Precision**: The proportion of true positive predictions among all positive predictions.
|
| 119 |
+
- **Recall**: The proportion of true positive predictions among all actual positive cases.
|
| 120 |
+
- **F1-Score**: The harmonic mean of precision and recall.
|
| 121 |
+
- **ROC and AUC**: Metrics used to evaluate classification models.
|
| 122 |
+
|
| 123 |
+
**Example**: Using a confusion matrix to evaluate the performance of a sentiment analysis model.
|
| 124 |
+
""")
|
| 125 |
+
|
| 126 |
+
elif lifecycle_option == "Deployment":
|
| 127 |
+
st.write("""
|
| 128 |
+
#### 6. Deployment
|
| 129 |
+
Once the model is trained and evaluated, it is deployed to production for real-world use. This might include integration with applications like chatbots, recommendation systems, or text summarization tools.
|
| 130 |
+
- **Monitoring**: Continuous monitoring to ensure that the model performs well over time.
|
| 131 |
+
- **Retraining**: The model might need to be retrained periodically as new data becomes available.
|
| 132 |
+
|
| 133 |
+
**Example**: Deploying a chatbot powered by an NLP model to assist users on a website.
|
| 134 |
+
""")
|
| 135 |
+
|
| 136 |
+
# Content for "NLP Techniques"
|
| 137 |
+
elif st.session_state.selected_page == "NLP Techniques":
|
| 138 |
+
technique_option = sidebar.radio("Select NLP Technique:", [
|
| 139 |
+
"Tokenization",
|
| 140 |
+
"Stop Words Removal",
|
| 141 |
+
"Lemmatization",
|
| 142 |
+
"Stemming",
|
| 143 |
+
"Bag of Words (BoW)",
|
| 144 |
+
"TF-IDF",
|
| 145 |
+
"Word Embeddings",
|
| 146 |
+
"Named Entity Recognition (NER)",
|
| 147 |
+
"Part-of-Speech (POS) Tagging",
|
| 148 |
+
"Sentiment Analysis"
|
| 149 |
+
])
|
| 150 |
|
| 151 |
+
if technique_option == "Tokenization":
|
| 152 |
+
st.write("""
|
| 153 |
+
#### 1. Tokenization
|
| 154 |
+
Tokenization is the process of splitting text into smaller units, such as words, sentences, or subwords. This is a key preprocessing step for many NLP tasks.
|
| 155 |
+
- **Example**:
|
| 156 |
+
- Sentence: "Natural Language Processing is awesome!"
|
| 157 |
+
- Tokenized words: ["Natural", "Language", "Processing", "is", "awesome"]
|
| 158 |
+
""")
|
| 159 |
+
|
| 160 |
+
elif technique_option == "Stop Words Removal":
|
| 161 |
+
st.write("""
|
| 162 |
+
#### 2. Stop Words Removal
|
| 163 |
+
Stop words are commonly used words like "the", "is", "at", etc., that do not carry much information in many NLP tasks. Removing stop words helps reduce the dimensionality and noise in the data.
|
| 164 |
+
- **Example**: Removing "is" from the sentence "NLP is amazing!"
|
| 165 |
+
""")
|
| 166 |
+
|
| 167 |
+
elif technique_option == "Lemmatization":
|
| 168 |
+
st.write("""
|
| 169 |
+
#### 3. Lemmatization
|
| 170 |
+
Lemmatization is the process of converting words into their root or base form based on context. It is more sophisticated than stemming, as it considers the meaning of words.
|
| 171 |
+
- **Example**: "better" → "good", "running" → "run".
|
| 172 |
+
""")
|
| 173 |
+
|
| 174 |
+
elif technique_option == "Stemming":
|
| 175 |
+
st.write("""
|
| 176 |
+
#### 4. Stemming
|
| 177 |
+
Stemming is the process of reducing words to their root form by removing prefixes or suffixes. This technique may result in non-dictionary words.
|
| 178 |
+
- **Example**: "running" → "run", "happiness" → "happi".
|
| 179 |
+
""")
|
| 180 |
+
|
| 181 |
+
elif technique_option == "Bag of Words (BoW)":
|
| 182 |
+
st.write("""
|
| 183 |
+
#### 5. Bag of Words (BoW)
|
| 184 |
+
The Bag of Words model represents text as a set of individual words, disregarding grammar and word order but keeping multiplicity. It is a simple and widely used method for text representation.
|
| 185 |
+
- **Example**:
|
| 186 |
+
- Text: "I love NLP"
|
| 187 |
+
- BoW: {"I": 1, "love": 1, "NLP": 1}
|
| 188 |
+
""")
|
| 189 |
+
|
| 190 |
+
elif technique_option == "TF-IDF":
|
| 191 |
+
st.write("""
|
| 192 |
+
#### 6. TF-IDF (Term Frequency-Inverse Document Frequency)
|
| 193 |
+
TF-IDF helps determine the importance of a word in a document relative to the entire dataset. It reduces the weight of common words and increases the weight of rare but important words.
|
| 194 |
+
- **Example**: The word "data" might have a high TF-IDF score in a document about data analysis but a low score in a document about cooking.
|
| 195 |
+
""")
|
| 196 |
+
|
| 197 |
+
elif technique_option == "Word Embeddings":
|
| 198 |
+
st.write("""
|
| 199 |
+
#### 7. Word Embeddings
|
| 200 |
+
Word embeddings are vector representations of words that capture semantic relationships. Words with similar meanings have similar vectors. Common word embedding models include:
|
| 201 |
+
- **Word2Vec**
|
| 202 |
+
- **GloVe**
|
| 203 |
+
- **FastText**
|
| 204 |
+
|
| 205 |
+
**Example**: The words "king" and "queen" would have similar vector representations because they share semantic relationships.
|
| 206 |
+
""")
|
| 207 |
+
|
| 208 |
+
elif technique_option == "Named Entity Recognition (NER)":
|
| 209 |
+
st.write("""
|
| 210 |
+
#### 8. Named Entity Recognition (NER)
|
| 211 |
+
NER is the task of identifying named entities such as persons, organizations, locations, and dates in text. This technique is commonly used for information extraction.
|
| 212 |
+
- **Example**: "Barack Obama was born in Hawaii."
|
| 213 |
+
- Entities: ["Barack Obama" (Person), "Hawaii" (Location)]
|
| 214 |
+
""")
|
| 215 |
+
|
| 216 |
+
elif technique_option == "Part-of-Speech (POS) Tagging":
|
| 217 |
+
st.write("""
|
| 218 |
+
#### 9. Part-of-Speech (POS) Tagging
|
| 219 |
+
POS tagging involves assigning grammatical labels (such as noun, verb, adjective) to each word in a sentence.
|
| 220 |
+
- **Example**: "The cat sat on the mat."
|
| 221 |
+
- POS Tags: [("The", "DT"), ("cat", "NN"), ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
|
| 222 |
+
""")
|
| 223 |
+
|
| 224 |
+
elif technique_option == "Sentiment Analysis":
|
| 225 |
+
st.write("""
|
| 226 |
+
#### 10. Sentiment Analysis
|
| 227 |
+
Sentiment analysis involves determining the sentiment of a piece of text, typically categorizing it as positive, negative, or neutral. This is commonly used for customer feedback and social media monitoring.
|
| 228 |
+
- **Example**: "I love this product!" → Positive Sentiment
|
| 229 |
+
""")
|