Update app.py
Browse files
app.py
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
import streamlit as st
|
| 2 |
|
|
|
|
| 3 |
def show_home_page():
|
| 4 |
st.title("Natural Language Processing (NLP)")
|
| 5 |
st.markdown(
|
|
@@ -10,72 +11,44 @@ def show_home_page():
|
|
| 10 |
language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
|
| 11 |
translation tools, sentiment analysis, and search engines.
|
| 12 |
|
| 13 |
-
Use the
|
| 14 |
"""
|
| 15 |
)
|
| 16 |
|
| 17 |
-
|
| 18 |
-
st.session_state["page"] = "terminologies"
|
| 19 |
-
if st.button("One-Hot Vectorization"):
|
| 20 |
-
st.session_state["page"] = "one_hot"
|
| 21 |
-
if st.button("Bag of Words"):
|
| 22 |
-
st.session_state["page"] = "bow"
|
| 23 |
-
if st.button("TF-IDF Vectorizer"):
|
| 24 |
-
st.session_state["page"] = "tfidf"
|
| 25 |
-
if st.button("Word2Vec"):
|
| 26 |
-
st.session_state["page"] = "word2vec"
|
| 27 |
-
if st.button("FastText"):
|
| 28 |
-
st.session_state["page"] = "fasttext"
|
| 29 |
-
if st.button("Tokenization"):
|
| 30 |
-
st.session_state["page"] = "tokenization"
|
| 31 |
-
if st.button("Stop Words"):
|
| 32 |
-
st.session_state["page"] = "stop_words"
|
| 33 |
-
|
| 34 |
def show_page(page):
|
| 35 |
-
if page == "
|
| 36 |
st.title("NLP Terminologies")
|
| 37 |
st.markdown(
|
| 38 |
"""
|
| 39 |
### NLP Terminologies (Detailed Explanation)
|
| 40 |
|
| 41 |
-
- **Tokenization**:
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
- **
|
| 45 |
-
during preprocessing because they carry little unique information.
|
| 46 |
-
|
| 47 |
-
- **Stemming**: Stemming reduces words to their root form by removing suffixes. For example, "running" -> "run".
|
| 48 |
-
It may produce non-lexical words (e.g., "better" -> "bett").
|
| 49 |
-
|
| 50 |
-
- **Lemmatization**: Unlike stemming, lemmatization converts a word to its dictionary base form (e.g., "running" -> "run").
|
| 51 |
-
|
| 52 |
- **Corpus**: A large collection of text used for NLP training and analysis.
|
| 53 |
-
|
| 54 |
-
- **
|
| 55 |
-
|
| 56 |
-
- **
|
| 57 |
-
|
| 58 |
-
- **POS Tagging**: Assigning parts of speech to words, like noun, verb, etc.
|
| 59 |
-
|
| 60 |
-
- **Named Entity Recognition (NER)**: Identifying entities like names, locations, and organizations in text.
|
| 61 |
-
|
| 62 |
-
- **Parsing**: Analyzing grammatical structure and relationships between words.
|
| 63 |
"""
|
| 64 |
)
|
| 65 |
-
elif page == "
|
| 66 |
st.title("One-Hot Vectorization")
|
| 67 |
st.markdown(
|
| 68 |
"""
|
| 69 |
### One-Hot Vectorization
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
#### How It Works:
|
| 74 |
- Each unique word in the corpus is assigned an index.
|
| 75 |
- The vector for a word is all zeros except for a 1 at the index corresponding to that word.
|
| 76 |
|
| 77 |
#### Example:
|
| 78 |
-
|
| 79 |
- "cat" -> [1, 0, 0]
|
| 80 |
- "dog" -> [0, 1, 0]
|
| 81 |
- "bird" -> [0, 0, 1]
|
|
@@ -91,7 +64,7 @@ def show_page(page):
|
|
| 91 |
- Useful for small datasets and when computational simplicity is prioritized.
|
| 92 |
"""
|
| 93 |
)
|
| 94 |
-
elif page == "
|
| 95 |
st.title("Bag of Words (BoW)")
|
| 96 |
st.markdown(
|
| 97 |
"""
|
|
@@ -124,7 +97,7 @@ def show_page(page):
|
|
| 124 |
- Text classification and clustering.
|
| 125 |
"""
|
| 126 |
)
|
| 127 |
-
elif page == "
|
| 128 |
st.title("TF-IDF Vectorizer")
|
| 129 |
st.markdown(
|
| 130 |
"""
|
|
@@ -138,11 +111,22 @@ def show_page(page):
|
|
| 138 |
- **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
|
| 139 |
- **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
|
| 140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
#### Applications:
|
| 142 |
- Search engines, information retrieval, and document classification.
|
| 143 |
"""
|
| 144 |
)
|
| 145 |
-
elif page == "
|
| 146 |
st.title("Word2Vec")
|
| 147 |
st.markdown(
|
| 148 |
"""
|
|
@@ -154,11 +138,18 @@ def show_page(page):
|
|
| 154 |
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
|
| 155 |
- **Skip-gram**: Predicts the context from the target word.
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
#### Applications:
|
| 158 |
- Text classification, sentiment analysis, and recommendation systems.
|
|
|
|
|
|
|
|
|
|
| 159 |
"""
|
| 160 |
)
|
| 161 |
-
elif page == "
|
| 162 |
st.title("FastText")
|
| 163 |
st.markdown(
|
| 164 |
"""
|
|
@@ -166,36 +157,87 @@ def show_page(page):
|
|
| 166 |
|
| 167 |
FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
|
| 168 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
#### Applications:
|
| 170 |
- Multilingual text processing.
|
| 171 |
- Handling noisy and incomplete data.
|
|
|
|
|
|
|
|
|
|
| 172 |
"""
|
| 173 |
)
|
| 174 |
-
elif page == "
|
| 175 |
st.title("Tokenization")
|
| 176 |
st.markdown(
|
| 177 |
"""
|
| 178 |
### Tokenization
|
| 179 |
|
| 180 |
Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
"""
|
| 182 |
)
|
| 183 |
-
elif page == "
|
| 184 |
st.title("Stop Words")
|
| 185 |
st.markdown(
|
| 186 |
"""
|
| 187 |
### Stop Words
|
| 188 |
|
| 189 |
Stop words are commonly used words in a language that are often removed during text preprocessing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
"""
|
| 191 |
)
|
| 192 |
|
| 193 |
-
#
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
show_home_page()
|
| 200 |
else:
|
| 201 |
-
show_page(
|
|
|
|
| 1 |
import streamlit as st
|
| 2 |
|
| 3 |
+
# Function to display the Home Page
|
| 4 |
def show_home_page():
|
| 5 |
st.title("Natural Language Processing (NLP)")
|
| 6 |
st.markdown(
|
|
|
|
| 11 |
language in a way that is both meaningful and useful. NLP powers a wide range of applications like chatbots,
|
| 12 |
translation tools, sentiment analysis, and search engines.
|
| 13 |
|
| 14 |
+
Use the menu in the sidebar to explore each topic in detail.
|
| 15 |
"""
|
| 16 |
)
|
| 17 |
|
| 18 |
+
# Function to display specific topic pages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
def show_page(page):
|
| 20 |
+
if page == "NLP Terminologies":
|
| 21 |
st.title("NLP Terminologies")
|
| 22 |
st.markdown(
|
| 23 |
"""
|
| 24 |
### NLP Terminologies (Detailed Explanation)
|
| 25 |
|
| 26 |
+
- **Tokenization**: Breaking text into smaller units like words or sentences.
|
| 27 |
+
- **Stop Words**: Commonly used words (e.g., "the", "is") often removed during preprocessing.
|
| 28 |
+
- **Stemming**: Reducing words to their root forms (e.g., "running" -> "run").
|
| 29 |
+
- **Lemmatization**: Converting words to their dictionary base forms (e.g., "running" -> "run").
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
- **Corpus**: A large collection of text used for NLP training and analysis.
|
| 31 |
+
- **Vocabulary**: The set of all unique words in a corpus.
|
| 32 |
+
- **n-grams**: Continuous sequences of n words/characters from text.
|
| 33 |
+
- **POS Tagging**: Assigning parts of speech to words.
|
| 34 |
+
- **NER (Named Entity Recognition)**: Identifying names, places, organizations, etc.
|
| 35 |
+
- **Parsing**: Analyzing grammatical structure of text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
"""
|
| 37 |
)
|
| 38 |
+
elif page == "One-Hot Vectorization":
|
| 39 |
st.title("One-Hot Vectorization")
|
| 40 |
st.markdown(
|
| 41 |
"""
|
| 42 |
### One-Hot Vectorization
|
| 43 |
|
| 44 |
+
A simple representation where each word in the vocabulary is represented as a binary vector.
|
| 45 |
|
| 46 |
#### How It Works:
|
| 47 |
- Each unique word in the corpus is assigned an index.
|
| 48 |
- The vector for a word is all zeros except for a 1 at the index corresponding to that word.
|
| 49 |
|
| 50 |
#### Example:
|
| 51 |
+
Vocabulary: ["cat", "dog", "bird"]
|
| 52 |
- "cat" -> [1, 0, 0]
|
| 53 |
- "dog" -> [0, 1, 0]
|
| 54 |
- "bird" -> [0, 0, 1]
|
|
|
|
| 64 |
- Useful for small datasets and when computational simplicity is prioritized.
|
| 65 |
"""
|
| 66 |
)
|
| 67 |
+
elif page == "Bag of Words":
|
| 68 |
st.title("Bag of Words (BoW)")
|
| 69 |
st.markdown(
|
| 70 |
"""
|
|
|
|
| 97 |
- Text classification and clustering.
|
| 98 |
"""
|
| 99 |
)
|
| 100 |
+
elif page == "TF-IDF Vectorizer":
|
| 101 |
st.title("TF-IDF Vectorizer")
|
| 102 |
st.markdown(
|
| 103 |
"""
|
|
|
|
| 111 |
- **Term Frequency (TF)**: Number of times a term appears in a document divided by total terms in the document.
|
| 112 |
- **Inverse Document Frequency (IDF)**: Logarithm of total documents divided by the number of documents containing the term.
|
| 113 |
|
| 114 |
+
#### Advantages:
|
| 115 |
+
- Reduces the weight of common words.
|
| 116 |
+
- Highlights unique and important words.
|
| 117 |
+
|
| 118 |
+
#### Example:
|
| 119 |
+
For the corpus:
|
| 120 |
+
- Doc1: "NLP is amazing."
|
| 121 |
+
- Doc2: "NLP is fun and amazing."
|
| 122 |
+
|
| 123 |
+
TF-IDF highlights words like "fun" and "amazing" over commonly occurring words like "is".
|
| 124 |
+
|
| 125 |
#### Applications:
|
| 126 |
- Search engines, information retrieval, and document classification.
|
| 127 |
"""
|
| 128 |
)
|
| 129 |
+
elif page == "Word2Vec":
|
| 130 |
st.title("Word2Vec")
|
| 131 |
st.markdown(
|
| 132 |
"""
|
|
|
|
| 138 |
- **CBOW (Continuous Bag of Words)**: Predicts the target word from its context.
|
| 139 |
- **Skip-gram**: Predicts the context from the target word.
|
| 140 |
|
| 141 |
+
#### Advantages:
|
| 142 |
+
- Captures semantic meaning (e.g., "king" - "man" + "woman" ≈ "queen").
|
| 143 |
+
- Efficient for large datasets.
|
| 144 |
+
|
| 145 |
#### Applications:
|
| 146 |
- Text classification, sentiment analysis, and recommendation systems.
|
| 147 |
+
|
| 148 |
+
#### Limitations:
|
| 149 |
+
- Requires significant computational resources.
|
| 150 |
"""
|
| 151 |
)
|
| 152 |
+
elif page == "FastText":
|
| 153 |
st.title("FastText")
|
| 154 |
st.markdown(
|
| 155 |
"""
|
|
|
|
| 157 |
|
| 158 |
FastText is an extension of Word2Vec that represents words as a combination of character n-grams.
|
| 159 |
|
| 160 |
+
#### Advantages:
|
| 161 |
+
- Handles rare and out-of-vocabulary words.
|
| 162 |
+
- Captures subword information (e.g., prefixes and suffixes).
|
| 163 |
+
|
| 164 |
+
#### Example:
|
| 165 |
+
The word "playing" might be represented by n-grams like "pla", "lay", "ayi", "ing".
|
| 166 |
+
|
| 167 |
#### Applications:
|
| 168 |
- Multilingual text processing.
|
| 169 |
- Handling noisy and incomplete data.
|
| 170 |
+
|
| 171 |
+
#### Limitations:
|
| 172 |
+
- Higher computational cost compared to Word2Vec.
|
| 173 |
"""
|
| 174 |
)
|
| 175 |
+
elif page == "Tokenization":
|
| 176 |
st.title("Tokenization")
|
| 177 |
st.markdown(
|
| 178 |
"""
|
| 179 |
### Tokenization
|
| 180 |
|
| 181 |
Tokenization is the process of breaking text into smaller units (tokens) such as words, phrases, or sentences.
|
| 182 |
+
|
| 183 |
+
#### Types of Tokenization:
|
| 184 |
+
- **Word Tokenization**: Splits text into words.
|
| 185 |
+
- **Sentence Tokenization**: Splits text into sentences.
|
| 186 |
+
|
| 187 |
+
#### Libraries for Tokenization:
|
| 188 |
+
- NLTK, SpaCy, and Hugging Face Transformers.
|
| 189 |
+
|
| 190 |
+
#### Example:
|
| 191 |
+
Sentence: "NLP is exciting."
|
| 192 |
+
- Word Tokens: ["NLP", "is", "exciting", "."]
|
| 193 |
+
|
| 194 |
+
#### Applications:
|
| 195 |
+
- Preprocessing for machine learning models.
|
| 196 |
+
|
| 197 |
+
#### Challenges:
|
| 198 |
+
- Handling complex text like abbreviations and multilingual data.
|
| 199 |
"""
|
| 200 |
)
|
| 201 |
+
elif page == "Stop Words":
|
| 202 |
st.title("Stop Words")
|
| 203 |
st.markdown(
|
| 204 |
"""
|
| 205 |
### Stop Words
|
| 206 |
|
| 207 |
Stop words are commonly used words in a language that are often removed during text preprocessing.
|
| 208 |
+
|
| 209 |
+
#### Examples of Stop Words:
|
| 210 |
+
- English: "is", "the", "and", "in".
|
| 211 |
+
- Spanish: "es", "el", "y", "en".
|
| 212 |
+
|
| 213 |
+
#### Why Remove Stop Words?
|
| 214 |
+
- To reduce noise in text data.
|
| 215 |
+
|
| 216 |
+
#### Applications:
|
| 217 |
+
- Sentiment analysis, text classification, and search engines.
|
| 218 |
+
|
| 219 |
+
#### Challenges:
|
| 220 |
+
- Some stop words might carry context-specific importance.
|
| 221 |
"""
|
| 222 |
)
|
| 223 |
|
| 224 |
+
# Sidebar navigation
|
| 225 |
+
st.sidebar.title("NLP Topics")
|
| 226 |
+
menu_options = [
|
| 227 |
+
"Home",
|
| 228 |
+
"NLP Terminologies",
|
| 229 |
+
"One-Hot Vectorization",
|
| 230 |
+
"Bag of Words",
|
| 231 |
+
"TF-IDF Vectorizer",
|
| 232 |
+
"Word2Vec",
|
| 233 |
+
"FastText",
|
| 234 |
+
"Tokenization",
|
| 235 |
+
"Stop Words",
|
| 236 |
+
]
|
| 237 |
+
selected_page = st.sidebar.radio("Select a topic", menu_options)
|
| 238 |
+
|
| 239 |
+
# Display the selected page
|
| 240 |
+
if selected_page == "Home":
|
| 241 |
show_home_page()
|
| 242 |
else:
|
| 243 |
+
show_page(selected_page)
|